Developer Guide¶

This guide covers everything you need to integrate Filipino Tokenizer into an application or ML pipeline.

Corpus preparation¶

The tokenizer reads a plain UTF-8 text file — one sentence per line.

Kumain siya ng pagkain sa hapagkainan.
Ang mga bata ay masayang naglalaro sa labas.
Maganda ang panahon ngayon.

Size guidelines

Corpus size	Recommended use
< 10k sentences	Prototyping / demos only
10k – 100k	Small-scale experiments
100k – 1M	Production NLP tasks
> 1M	Large language model pre-training

A good starting corpus for Tagalog is the WikiText-TL-39 dataset.

Writing a corpus programmatically

import tempfile, os

sentences = [
    "Kumain siya ng pagkain.",
    "Maganda ang panahon ngayon.",
]

with tempfile.NamedTemporaryFile("w", suffix=".txt", delete=False,
                                 encoding="utf-8") as f:
    f.write("\n".join(sentences))
    corpus_path = f.name

Training¶

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.train(corpus_path, vocab_size=32000)

print(f"Vocab size  : {len(tok.bpe.vocab)}")
print(f"Merge rules : {len(tok.bpe.merges)}")

Choosing vocab_size

vocab_size sets an upper bound on the BPE vocabulary. A larger vocabulary means longer, more-complete tokens (lower fertility) but increases model embedding table size.

vocab_size	Typical use case
500 – 2000	Small experiments, unit tests
8000	Lightweight production tokenizer
32000	Standard for transformer language models
64000+	Very large corpora / multilingual settings

Note

If the corpus is too small to generate vocab_size unique merges, training stops early. Check len(tok.bpe.merges) after training to see the actual count.

Encoding¶

Single sentence¶

ids = tok.encode("Kumain siya ng pagkain.")
# [79, 99, 115, 99, 133, 4, 154, 4, 100, 4, 125, 99, 145, 18]

Input is lowercased automatically.
Returns list[int].
Unknown characters fall back to the <unk> token (ID 1).

Inspecting tokens as strings¶

tokens = tok.tokenize("Kumain siya ng pagkain.")
# ['k', 'um', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']

Batch encoding¶

There is no built-in batch method. Use a list comprehension:

sentences = ["Kumain siya.", "Maganda ang panahon."]
batch_ids = [tok.encode(s) for s in sentences]

For large batches, use concurrent.futures to parallelise:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as pool:
    batch_ids = list(pool.map(tok.encode, sentences))

Decoding¶

text = tok.decode(ids)
# 'kumain siya ng pagkain.'

Special tokens (<pad>, <unk>, <s>, </s>) are silently dropped.
Boundary markers (▁) are removed.
Output is always lowercase (encoding lowercases input).

Saving and loading¶

# Save
tok.save("my_tokenizer/")
# Creates my_tokenizer/vocab.json and my_tokenizer/merges.txt

# Load
from filipino_tokenizer.tagalog import TagalogTokenizer
tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")

The saved files are human-readable plain text — you can inspect or version-control them.

Using the segmenter independently¶

The morphological segmenter can be used without the BPE layer:

from filipino_tokenizer.tagalog import TagalogSegmenter

seg = TagalogSegmenter()

seg.segment("kumain")          # ['um', 'kain']
seg.segment("pagkain")         # ['pag', 'kain']
seg.segment("pinakamahusay")   # ['pinaka', 'ma', 'husay']
seg.segment("pangalan")        # ['pangalan']  ← frozen form, not decomposed
seg.segment("computer")        # ['computer']  ← loan word, not decomposed

# Segment a full sentence (splits on whitespace/punctuation first)
seg.segment_text("Kumain siya ng pagkain.")
# ['um', 'kain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']

This is useful for:

Feature extraction for non-BPE models
Linguistic analysis and corpus statistics
Preprocessing for other NLP tools

Integrating with PyTorch¶

Filipino Tokenizer produces plain Python lists, which convert directly to tensors:

import torch
from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.load("my_tokenizer/")

def collate(sentences, pad_id=0):
    encoded = [tok.encode(s) for s in sentences]
    max_len = max(len(e) for e in encoded)
    padded = [e + [pad_id] * (max_len - len(e)) for e in encoded]
    return torch.tensor(padded, dtype=torch.long)

batch = collate(["Kumain siya.", "Maganda ang panahon ngayon."])
# tensor of shape (2, max_seq_len)

Integrating with HuggingFace `datasets`¶

from datasets import Dataset
from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.load("my_tokenizer/")

raw = Dataset.from_dict({"text": ["Kumain siya.", "Maganda ang panahon."]})

def tokenize_fn(batch):
    return {"input_ids": [tok.encode(t) for t in batch["text"]]}

tokenized = raw.map(tokenize_fn, batched=True)

Special token IDs¶

Token	ID	Meaning
`<pad>`	0	Padding (for fixed-length batches)
`<unk>`	1	Unknown character fallback
`<s>`	2	Beginning of sequence
`</s>`	3	End of sequence

These are always at IDs 0–3 regardless of corpus. Add them manually if your model expects them:

BOS, EOS = 2, 3
ids = [BOS] + tok.encode(sentence) + [EOS]

HuggingFace Transformers integration¶

TagalogHFTokenizer implements the PreTrainedTokenizer interface so it works directly with any HuggingFace-compatible training framework.

pip install filipino-tokenizer[hf]

Loading a trained tokenizer¶

from filipino_tokenizer.tagalog import TagalogHFTokenizer

tok = TagalogHFTokenizer(
    vocab_file="my_tokenizer/vocab.json",
    merges_file="my_tokenizer/merges.txt",
)

print(tok.vocab_size)      # 32000
print(tok.bos_token)       # '<s>'
print(tok.pad_token_id)    # 0

Batch encoding¶

sentences = [
    "Kumain siya ng pagkain.",
    "Nagtatrabaho ang tatay sa opisina araw-araw.",
]
encoding = tok(sentences, padding=True, truncation=True, return_tensors="pt")
# encoding["input_ids"]       — shape (2, seq_len)
# encoding["attention_mask"]  — shape (2, seq_len)

Save and reload in HuggingFace format¶

tok.save_pretrained("hf_tokenizer/")
# Creates: vocab.json, merges.txt, tokenizer_config.json, special_tokens_map.json

tok2 = TagalogHFTokenizer.from_pretrained("hf_tokenizer/")

Building a dataset for causal LM training¶

import torch
from torch.utils.data import Dataset

class FilipinoTextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.encodings = tokenizer(
            texts,
            max_length=max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

    def __len__(self):
        return self.encodings["input_ids"].shape[0]

    def __getitem__(self, idx):
        item = {k: v[idx] for k, v in self.encodings.items()}
        item["labels"] = item["input_ids"].clone()
        return item

Setting up a model¶

The only tokenizer-specific value a model needs is vocab_size:

from transformers import GPT2Config, GPT2LMHeadModel

config = GPT2Config(
    vocab_size=tok.vocab_size,
    pad_token_id=tok.pad_token_id,
    bos_token_id=tok.bos_token_id,
    eos_token_id=tok.eos_token_id,
)
model = GPT2LMHeadModel(config)

The same pattern works for any architecture (LlamaForCausalLM, BertForMaskedLM, T5ForConditionalGeneration, etc.) — only the config class changes.

Running tests¶

python -m unittest discover tests -v

All 49 tests should pass. Individual test files:

python -m unittest tests.test_affixes -v
python -m unittest tests.test_segmenter -v
python -m unittest tests.test_tokenizer -v

Developer Guide¶

Corpus preparation¶

Training¶

Encoding¶

Single sentence¶

Inspecting tokens as strings¶

Batch encoding¶

Decoding¶

Saving and loading¶

Using the segmenter independently¶

Integrating with PyTorch¶

Integrating with HuggingFace datasets¶

Special token IDs¶

HuggingFace Transformers integration¶

Loading a trained tokenizer¶

Batch encoding¶

Save and reload in HuggingFace format¶

Building a dataset for causal LM training¶

Setting up a model¶

Running tests¶

Integrating with HuggingFace `datasets`¶