Developer Guide =============== This guide covers everything you need to integrate Filipino Tokenizer into an application or ML pipeline. Corpus preparation ------------------ The tokenizer reads a plain UTF-8 text file — one sentence per line. .. code-block:: text Kumain siya ng pagkain sa hapagkainan. Ang mga bata ay masayang naglalaro sa labas. Maganda ang panahon ngayon. **Size guidelines** +-------------------+--------------------------------------------+ | Corpus size | Recommended use | +===================+============================================+ | < 10k sentences | Prototyping / demos only | +-------------------+--------------------------------------------+ | 10k – 100k | Small-scale experiments | +-------------------+--------------------------------------------+ | 100k – 1M | Production NLP tasks | +-------------------+--------------------------------------------+ | > 1M | Large language model pre-training | +-------------------+--------------------------------------------+ A good starting corpus for Tagalog is the `WikiText-TL-39 dataset `_. **Writing a corpus programmatically** .. code-block:: python import tempfile, os sentences = [ "Kumain siya ng pagkain.", "Maganda ang panahon ngayon.", ] with tempfile.NamedTemporaryFile("w", suffix=".txt", delete=False, encoding="utf-8") as f: f.write("\n".join(sentences)) corpus_path = f.name ---- Training -------- .. code-block:: python from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.train(corpus_path, vocab_size=32000) print(f"Vocab size : {len(tok.bpe.vocab)}") print(f"Merge rules : {len(tok.bpe.merges)}") **Choosing vocab_size** ``vocab_size`` sets an upper bound on the BPE vocabulary. A larger vocabulary means longer, more-complete tokens (lower fertility) but increases model embedding table size. +-------------+-------------------------------------------------------+ | vocab_size | Typical use case | +=============+=======================================================+ | 500 – 2000 | Small experiments, unit tests | +-------------+-------------------------------------------------------+ | 8000 | Lightweight production tokenizer | +-------------+-------------------------------------------------------+ | 32000 | Standard for transformer language models | +-------------+-------------------------------------------------------+ | 64000+ | Very large corpora / multilingual settings | +-------------+-------------------------------------------------------+ .. note:: If the corpus is too small to generate ``vocab_size`` unique merges, training stops early. Check ``len(tok.bpe.merges)`` after training to see the actual count. ---- Encoding -------- Single sentence ~~~~~~~~~~~~~~~ .. code-block:: python ids = tok.encode("Kumain siya ng pagkain.") # [79, 99, 115, 99, 133, 4, 154, 4, 100, 4, 125, 99, 145, 18] - Input is lowercased automatically. - Returns ``list[int]``. - Unknown characters fall back to the ```` token (ID 1). Inspecting tokens as strings ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python tokens = tok.tokenize("Kumain siya ng pagkain.") # ['k', 'um', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.'] Batch encoding ~~~~~~~~~~~~~~ There is no built-in batch method. Use a list comprehension: .. code-block:: python sentences = ["Kumain siya.", "Maganda ang panahon."] batch_ids = [tok.encode(s) for s in sentences] For large batches, use ``concurrent.futures`` to parallelise: .. code-block:: python from concurrent.futures import ThreadPoolExecutor with ThreadPoolExecutor() as pool: batch_ids = list(pool.map(tok.encode, sentences)) ---- Decoding -------- .. code-block:: python text = tok.decode(ids) # 'kumain siya ng pagkain.' - Special tokens (````, ````, ````, ````) are silently dropped. - Boundary markers (``▁``) are removed. - Output is always lowercase (encoding lowercases input). ---- Saving and loading ------------------ .. code-block:: python # Save tok.save("my_tokenizer/") # Creates my_tokenizer/vocab.json and my_tokenizer/merges.txt # Load from filipino_tokenizer.tagalog import TagalogTokenizer tok2 = TagalogTokenizer() tok2.load("my_tokenizer/") The saved files are human-readable plain text — you can inspect or version-control them. ---- Using the segmenter independently ---------------------------------- The morphological segmenter can be used without the BPE layer: .. code-block:: python from filipino_tokenizer.tagalog import TagalogSegmenter seg = TagalogSegmenter() seg.segment("kumain") # ['um', 'kain'] seg.segment("pagkain") # ['pag', 'kain'] seg.segment("pinakamahusay") # ['pinaka', 'ma', 'husay'] seg.segment("pangalan") # ['pangalan'] ← frozen form, not decomposed seg.segment("computer") # ['computer'] ← loan word, not decomposed # Segment a full sentence (splits on whitespace/punctuation first) seg.segment_text("Kumain siya ng pagkain.") # ['um', 'kain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.'] This is useful for: - Feature extraction for non-BPE models - Linguistic analysis and corpus statistics - Preprocessing for other NLP tools ---- Integrating with PyTorch ------------------------ Filipino Tokenizer produces plain Python lists, which convert directly to tensors: .. code-block:: python import torch from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.load("my_tokenizer/") def collate(sentences, pad_id=0): encoded = [tok.encode(s) for s in sentences] max_len = max(len(e) for e in encoded) padded = [e + [pad_id] * (max_len - len(e)) for e in encoded] return torch.tensor(padded, dtype=torch.long) batch = collate(["Kumain siya.", "Maganda ang panahon ngayon."]) # tensor of shape (2, max_seq_len) ---- Integrating with HuggingFace ``datasets`` ------------------------------------------ .. code-block:: python from datasets import Dataset from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.load("my_tokenizer/") raw = Dataset.from_dict({"text": ["Kumain siya.", "Maganda ang panahon."]}) def tokenize_fn(batch): return {"input_ids": [tok.encode(t) for t in batch["text"]]} tokenized = raw.map(tokenize_fn, batched=True) ---- Special token IDs ----------------- +----------+----+---------------------------------------------+ | Token | ID | Meaning | +==========+====+=============================================+ | ````| 0 | Padding (for fixed-length batches) | +----------+----+---------------------------------------------+ | ````| 1 | Unknown character fallback | +----------+----+---------------------------------------------+ | ```` | 2 | Beginning of sequence | +----------+----+---------------------------------------------+ | ```` | 3 | End of sequence | +----------+----+---------------------------------------------+ These are always at IDs 0–3 regardless of corpus. Add them manually if your model expects them: .. code-block:: python BOS, EOS = 2, 3 ids = [BOS] + tok.encode(sentence) + [EOS] ---- HuggingFace Transformers integration -------------------------------------- ``TagalogHFTokenizer`` implements the ``PreTrainedTokenizer`` interface so it works directly with any HuggingFace-compatible training framework. .. code-block:: bash pip install filipino-tokenizer[hf] Loading a trained tokenizer ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from filipino_tokenizer.tagalog import TagalogHFTokenizer tok = TagalogHFTokenizer( vocab_file="my_tokenizer/vocab.json", merges_file="my_tokenizer/merges.txt", ) print(tok.vocab_size) # 32000 print(tok.bos_token) # '' print(tok.pad_token_id) # 0 Batch encoding ~~~~~~~~~~~~~~ .. code-block:: python sentences = [ "Kumain siya ng pagkain.", "Nagtatrabaho ang tatay sa opisina araw-araw.", ] encoding = tok(sentences, padding=True, truncation=True, return_tensors="pt") # encoding["input_ids"] — shape (2, seq_len) # encoding["attention_mask"] — shape (2, seq_len) Save and reload in HuggingFace format ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python tok.save_pretrained("hf_tokenizer/") # Creates: vocab.json, merges.txt, tokenizer_config.json, special_tokens_map.json tok2 = TagalogHFTokenizer.from_pretrained("hf_tokenizer/") Building a dataset for causal LM training ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import torch from torch.utils.data import Dataset class FilipinoTextDataset(Dataset): def __init__(self, texts, tokenizer, max_length=512): self.encodings = tokenizer( texts, max_length=max_length, padding="max_length", truncation=True, return_tensors="pt", ) def __len__(self): return self.encodings["input_ids"].shape[0] def __getitem__(self, idx): item = {k: v[idx] for k, v in self.encodings.items()} item["labels"] = item["input_ids"].clone() return item Setting up a model ~~~~~~~~~~~~~~~~~~ The only tokenizer-specific value a model needs is ``vocab_size``: .. code-block:: python from transformers import GPT2Config, GPT2LMHeadModel config = GPT2Config( vocab_size=tok.vocab_size, pad_token_id=tok.pad_token_id, bos_token_id=tok.bos_token_id, eos_token_id=tok.eos_token_id, ) model = GPT2LMHeadModel(config) The same pattern works for any architecture (``LlamaForCausalLM``, ``BertForMaskedLM``, ``T5ForConditionalGeneration``, etc.) — only the config class changes. ---- Running tests ------------- .. code-block:: bash python -m unittest discover tests -v All 49 tests should pass. Individual test files: .. code-block:: bash python -m unittest tests.test_affixes -v python -m unittest tests.test_segmenter -v python -m unittest tests.test_tokenizer -v