Developer Guide¶
This guide covers everything you need to integrate Filipino Tokenizer into an application or ML pipeline.
Corpus preparation¶
The tokenizer reads a plain UTF-8 text file — one sentence per line.
Kumain siya ng pagkain sa hapagkainan.
Ang mga bata ay masayang naglalaro sa labas.
Maganda ang panahon ngayon.
Size guidelines
Corpus size |
Recommended use |
|---|---|
< 10k sentences |
Prototyping / demos only |
10k – 100k |
Small-scale experiments |
100k – 1M |
Production NLP tasks |
> 1M |
Large language model pre-training |
A good starting corpus for Tagalog is the WikiText-TL-39 dataset.
Writing a corpus programmatically
import tempfile, os
sentences = [
"Kumain siya ng pagkain.",
"Maganda ang panahon ngayon.",
]
with tempfile.NamedTemporaryFile("w", suffix=".txt", delete=False,
encoding="utf-8") as f:
f.write("\n".join(sentences))
corpus_path = f.name
Training¶
from filipino_tokenizer.tagalog import TagalogTokenizer
tok = TagalogTokenizer()
tok.train(corpus_path, vocab_size=32000)
print(f"Vocab size : {len(tok.bpe.vocab)}")
print(f"Merge rules : {len(tok.bpe.merges)}")
Choosing vocab_size
vocab_size sets an upper bound on the BPE vocabulary. A larger vocabulary
means longer, more-complete tokens (lower fertility) but increases model embedding
table size.
vocab_size |
Typical use case |
|---|---|
500 – 2000 |
Small experiments, unit tests |
8000 |
Lightweight production tokenizer |
32000 |
Standard for transformer language models |
64000+ |
Very large corpora / multilingual settings |
Note
If the corpus is too small to generate vocab_size unique merges, training
stops early. Check len(tok.bpe.merges) after training to see the actual count.
Encoding¶
Single sentence¶
ids = tok.encode("Kumain siya ng pagkain.")
# [79, 99, 115, 99, 133, 4, 154, 4, 100, 4, 125, 99, 145, 18]
Input is lowercased automatically.
Returns
list[int].Unknown characters fall back to the
<unk>token (ID 1).
Inspecting tokens as strings¶
tokens = tok.tokenize("Kumain siya ng pagkain.")
# ['k', 'um', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']
Batch encoding¶
There is no built-in batch method. Use a list comprehension:
sentences = ["Kumain siya.", "Maganda ang panahon."]
batch_ids = [tok.encode(s) for s in sentences]
For large batches, use concurrent.futures to parallelise:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as pool:
batch_ids = list(pool.map(tok.encode, sentences))
Decoding¶
text = tok.decode(ids)
# 'kumain siya ng pagkain.'
Special tokens (
<pad>,<unk>,<s>,</s>) are silently dropped.Boundary markers (
▁) are removed.Output is always lowercase (encoding lowercases input).
Saving and loading¶
# Save
tok.save("my_tokenizer/")
# Creates my_tokenizer/vocab.json and my_tokenizer/merges.txt
# Load
from filipino_tokenizer.tagalog import TagalogTokenizer
tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")
The saved files are human-readable plain text — you can inspect or version-control them.
Using the segmenter independently¶
The morphological segmenter can be used without the BPE layer:
from filipino_tokenizer.tagalog import TagalogSegmenter
seg = TagalogSegmenter()
seg.segment("kumain") # ['um', 'kain']
seg.segment("pagkain") # ['pag', 'kain']
seg.segment("pinakamahusay") # ['pinaka', 'ma', 'husay']
seg.segment("pangalan") # ['pangalan'] ← frozen form, not decomposed
seg.segment("computer") # ['computer'] ← loan word, not decomposed
# Segment a full sentence (splits on whitespace/punctuation first)
seg.segment_text("Kumain siya ng pagkain.")
# ['um', 'kain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']
This is useful for:
Feature extraction for non-BPE models
Linguistic analysis and corpus statistics
Preprocessing for other NLP tools
Integrating with PyTorch¶
Filipino Tokenizer produces plain Python lists, which convert directly to tensors:
import torch
from filipino_tokenizer.tagalog import TagalogTokenizer
tok = TagalogTokenizer()
tok.load("my_tokenizer/")
def collate(sentences, pad_id=0):
encoded = [tok.encode(s) for s in sentences]
max_len = max(len(e) for e in encoded)
padded = [e + [pad_id] * (max_len - len(e)) for e in encoded]
return torch.tensor(padded, dtype=torch.long)
batch = collate(["Kumain siya.", "Maganda ang panahon ngayon."])
# tensor of shape (2, max_seq_len)
Integrating with HuggingFace datasets¶
from datasets import Dataset
from filipino_tokenizer.tagalog import TagalogTokenizer
tok = TagalogTokenizer()
tok.load("my_tokenizer/")
raw = Dataset.from_dict({"text": ["Kumain siya.", "Maganda ang panahon."]})
def tokenize_fn(batch):
return {"input_ids": [tok.encode(t) for t in batch["text"]]}
tokenized = raw.map(tokenize_fn, batched=True)
Special token IDs¶
Token |
ID |
Meaning |
|---|---|---|
|
0 |
Padding (for fixed-length batches) |
|
1 |
Unknown character fallback |
|
2 |
Beginning of sequence |
|
3 |
End of sequence |
These are always at IDs 0–3 regardless of corpus. Add them manually if your model expects them:
BOS, EOS = 2, 3
ids = [BOS] + tok.encode(sentence) + [EOS]
HuggingFace Transformers integration¶
TagalogHFTokenizer implements the PreTrainedTokenizer interface so it works
directly with any HuggingFace-compatible training framework.
pip install filipino-tokenizer[hf]
Loading a trained tokenizer¶
from filipino_tokenizer.tagalog import TagalogHFTokenizer
tok = TagalogHFTokenizer(
vocab_file="my_tokenizer/vocab.json",
merges_file="my_tokenizer/merges.txt",
)
print(tok.vocab_size) # 32000
print(tok.bos_token) # '<s>'
print(tok.pad_token_id) # 0
Batch encoding¶
sentences = [
"Kumain siya ng pagkain.",
"Nagtatrabaho ang tatay sa opisina araw-araw.",
]
encoding = tok(sentences, padding=True, truncation=True, return_tensors="pt")
# encoding["input_ids"] — shape (2, seq_len)
# encoding["attention_mask"] — shape (2, seq_len)
Save and reload in HuggingFace format¶
tok.save_pretrained("hf_tokenizer/")
# Creates: vocab.json, merges.txt, tokenizer_config.json, special_tokens_map.json
tok2 = TagalogHFTokenizer.from_pretrained("hf_tokenizer/")
Building a dataset for causal LM training¶
import torch
from torch.utils.data import Dataset
class FilipinoTextDataset(Dataset):
def __init__(self, texts, tokenizer, max_length=512):
self.encodings = tokenizer(
texts,
max_length=max_length,
padding="max_length",
truncation=True,
return_tensors="pt",
)
def __len__(self):
return self.encodings["input_ids"].shape[0]
def __getitem__(self, idx):
item = {k: v[idx] for k, v in self.encodings.items()}
item["labels"] = item["input_ids"].clone()
return item
Setting up a model¶
The only tokenizer-specific value a model needs is vocab_size:
from transformers import GPT2Config, GPT2LMHeadModel
config = GPT2Config(
vocab_size=tok.vocab_size,
pad_token_id=tok.pad_token_id,
bos_token_id=tok.bos_token_id,
eos_token_id=tok.eos_token_id,
)
model = GPT2LMHeadModel(config)
The same pattern works for any architecture (LlamaForCausalLM,
BertForMaskedLM, T5ForConditionalGeneration, etc.) — only the
config class changes.
Running tests¶
python -m unittest discover tests -v
All 49 tests should pass. Individual test files:
python -m unittest tests.test_affixes -v
python -m unittest tests.test_segmenter -v
python -m unittest tests.test_tokenizer -v