Quick Start =========== This page gets you from zero to a working tokenizer in under two minutes. 0. Use the bundled pretrained model (no setup required) -------------------------------------------------------- A 32k-vocabulary model trained on Wikitext-TL-39 is shipped with the package. After ``pip install filipino-tokenizer`` you can use it immediately: .. code-block:: python from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.load_pretrained() ids = tok.encode("Kumain siya ng pagkain.") print(tok.decode(ids)) # kumain siya ng pagkain. For HuggingFace Trainer / datasets, also install ``transformers``: .. code-block:: bash pip install filipino-tokenizer[hf] .. code-block:: python from filipino_tokenizer.tagalog import TagalogHFTokenizer tok = TagalogHFTokenizer() # loads bundled model encoding = tok("Kumain siya ng pagkain.", return_tensors="pt") For batched dataset tokenization with dynamic or max-length padding: .. code-block:: python enc = tok( ["Kumain siya ng pagkain.", "Nagluluto ang nanay."], truncation=True, max_length=128, padding="max_length", return_tensors=None, # or "pt" / "np" ) ---- If you want to train your own model on a custom corpus, follow the steps below. 1. Prepare a corpus ------------------- The tokenizer trains on a plain UTF-8 text file with **one sentence per line**. .. code-block:: text Kumain siya ng pagkain sa hapagkainan. Maganda ang panahon ngayon kaya lumabas kami. Nagluluto ang nanay ng masarap na adobo para sa pamilya. Save this as ``corpus.txt``. For production use, download the `Wikitext-TL-39 `_ corpus (~1.5M sentences) with the included script: .. code-block:: bash pip install datasets python scripts/download_corpus.py 2. Train -------- .. code-block:: python from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.train("corpus.txt", vocab_size=32000) ``vocab_size`` is the target BPE vocabulary size. The actual vocabulary will be smaller if the corpus does not contain enough distinct character pairs. 3. Encode and decode -------------------- .. code-block:: python ids = tok.encode("Kumain siya ng pagkain.") # [79, 99, 115, ...] text = tok.decode(ids) # 'kumain siya ng pagkain.' ``encode()`` lowercases input and returns a ``list[int]``. ``decode()`` removes boundary markers and reconstructs the original text. 4. Inspect tokens ----------------- .. code-block:: python tokens = tok.tokenize("Kumain siya ng pagkain.") # ['k', '▁', 'um', '▁', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.'] ``tokenize()`` returns strings instead of IDs — useful for debugging and understanding what the tokenizer is doing. 5. Save and reload ------------------ .. code-block:: python tok.save("my_tokenizer/") tok2 = TagalogTokenizer() tok2.load("my_tokenizer/") This writes two files: - ``my_tokenizer/vocab.json`` — token-to-ID mapping - ``my_tokenizer/merges.txt`` — learned BPE merge rules 6. HuggingFace integration -------------------------- ``TagalogHFTokenizer`` wraps the tokenizer behind the ``PreTrainedTokenizer`` interface for use with ``Trainer``, TRL, Axolotl, and any other HF pipeline. .. code-block:: bash pip install filipino-tokenizer[hf] .. code-block:: python from filipino_tokenizer.tagalog import TagalogHFTokenizer # Option A: bundled pretrained model (no path needed) tok = TagalogHFTokenizer() # Option B: load from a directory you trained yourself tok = TagalogHFTokenizer.from_pretrained("my_tokenizer/") # Standard HuggingFace call encoding = tok("Kumain siya ng pagkain.", return_tensors="pt") # Save / reload in HF format tok.save_pretrained("hf_tokenizer/") tok2 = TagalogHFTokenizer.from_pretrained("hf_tokenizer/") See :doc:`api/hf_tokenizer` for the full API reference. What's next? ------------ - **Developers** — see :doc:`guides/developers` for corpus preparation, batch encoding, and integration with ML frameworks. - **Researchers** — see :doc:`guides/researchers` for the morphological segmentation algorithm, the CBPE constraint, and evaluation methodology. - **API details** — see :doc:`api/tokenizer`, :doc:`api/segmenter`, :doc:`api/bpe`, :doc:`api/hf_tokenizer`.