Filipino Tokenizer ================== **Morphology-aware BPE tokenization for Philippine languages.** Filipino words are built by stacking prefixes, infixes, suffixes, and circumfixes onto a root. A generic tokenizer trained on English treats this morphology as noise and splits words at arbitrary character positions. Filipino Tokenizer fixes that: it uses a rule-based morphological segmenter to identify morpheme boundaries *before* running BPE, so the learned subword units are always linguistically meaningful. .. code-block:: python from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.train("corpus.txt", vocab_size=32000) tok.tokenize("Kumain siya ng pagkain.") # ['k', '▁', 'um', '▁', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.'] The root *kain* (eat) appears as a single token in both *kumain* and *pagkain*, even though the surface forms look very different. ---- .. toctree:: :maxdepth: 1 :caption: Getting started installation quickstart .. toctree:: :maxdepth: 2 :caption: User Guides guides/developers guides/researchers .. toctree:: :maxdepth: 2 :caption: API Reference api/tokenizer api/segmenter api/bpe api/hf_tokenizer .. toctree:: :maxdepth: 1 :caption: Project changelog