Filipino Tokenizer¶
Morphology-aware BPE tokenization for Philippine languages.
Filipino words are built by stacking prefixes, infixes, suffixes, and circumfixes onto a root. A generic tokenizer trained on English treats this morphology as noise and splits words at arbitrary character positions. Filipino Tokenizer fixes that: it uses a rule-based morphological segmenter to identify morpheme boundaries before running BPE, so the learned subword units are always linguistically meaningful.
from filipino_tokenizer.tagalog import TagalogTokenizer
tok = TagalogTokenizer()
tok.train("corpus.txt", vocab_size=32000)
tok.tokenize("Kumain siya ng pagkain.")
# ['k', '▁', 'um', '▁', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']
The root kain (eat) appears as a single token in both kumain and pagkain, even though the surface forms look very different.
Getting started
User Guides
API Reference
Project