Researcher Guide¶
This guide explains the linguistic theory behind the tokenizer, the algorithmic design decisions, how to evaluate it, and how to extend it to other Philippine languages.
Filipino morphology primer¶
Tagalog is an agglutinative language — complex words are formed by attaching affixes to a root. Unlike English, where affixes attach only at word edges, Tagalog also uses infixes that are inserted inside the root.
Affix types¶
Type |
Example |
Segmentation |
Meaning |
|---|---|---|---|
Prefix |
pagkain |
pag + kain |
“food” (pag- nominalises) |
Infix -um- |
kumain |
k + um + ain |
“ate” (-um- marks actor focus, past) |
Infix -in- |
kinain |
k + in + ain |
“was eaten” (-in- marks object focus) |
Suffix |
kainan |
kain + an |
“dining place” (-an locative) |
Circumfix |
pagkainan |
pag + kain + an |
“dining hall” (pag- -an together) |
Infixes are particularly important for tokenisation. The surface form kumain
does not begin with the root kain; instead the root’s first consonant k
comes first, then the infix um, then the rest of the root ain.
A character-level tokenizer sees k, u, m, a, i, n
with no concept that kain is the meaningful unit.
Nasal assimilation¶
The prefixes pang- and mang- undergo nasal assimilation when the root
begins with certain consonants, which changes both the prefix surface form and
drops the root’s initial consonant:
Root initial consonant |
Surface prefix |
Example |
|---|---|---|
b, p |
pam- / mam- |
pamili (pang + bili) |
d, t, s |
pan- / man- |
panulat (pang + sulat) |
k, g |
pang- / mang- |
pangkain (pang + kain) |
vowel, h, l, m, n, w, y |
pang- / mang- |
pangasiwa (pang + asiwa) |
The TagalogPhonology class handles forward (apply) and reverse (strip)
direction for these rules.
The Constrained BPE algorithm¶
Background¶
Standard Byte Pair Encoding (BPE) learns
subword units by repeatedly merging the most frequent adjacent pair of symbols in a
corpus. Applied naively to Filipino, it produces merges that cross morpheme
boundaries — e.g., merging n and g in pagkain to create ng
regardless of whether n and g belong to different morphemes.
The CBPE constraint¶
This library implements Constrained BPE (CBPE), following the approach of Tacorda et al. (2024). The constraint is simple:
No merge may combine two symbols that are separated by a morpheme boundary marker.
The boundary marker is ▁ (U+2581, LOWER ONE EIGHTH BLOCK), the same character
used by SentencePiece.
Pipeline¶
Raw text
│
▼
Pre-tokenize Split on whitespace and punctuation
│
▼
Morphological TagalogSegmenter identifies morphemes;
Segmentation TagalogTokenizer inserts ▁ into the surface text
│ at morpheme boundaries
▼
Surface-annotated e.g. "pag▁kain" for pagkain
tokens e.g. "k▁um▁ain" for kumain (infix)
│
▼
CBPE Training BPE pair-counting skips any pair that
(or Encoding) spans a ▁ boundary
The critical detail for infix forms: the segmenter returns ['um', 'kain']
for kumain, but these morphemes do not concatenate to give the surface word.
The _surface_annotate method maps them back to the surface text with boundary
markers: k▁um▁ain. This means:
kandumcannot be merged (▁between them)umandaincannot be merged (▁between them)kandacannot be merged (not adjacent in the token sequence —umis between them)
The root fragment kain is therefore split in infix words, which is unavoidable
given the phonological reality of Tagalog infixation. For prefix/suffix forms
(pag▁kain) the root kain appears intact and receives consistent token IDs.
Heap-based incremental BPE¶
The MorphAwareBPE training loop uses an optimised incremental algorithm:
Doubly-linked list — each unique word sequence is represented as a linked list of
Nodeobjects, enabling O(1) local edits when a merge is applied.Max-heap with lazy deletion — the most frequent pair is found in O(log n) time. Stale heap entries (whose count has decreased since they were pushed) are skipped at pop time.
Position index —
pair_positions[pair]is a set of nodes where the pair starts, enabling targeted updates instead of a full corpus rescan.
This brings training complexity from O(N²) (naïve BPE) down to O(N log V) where N is corpus size and V is vocabulary size.
Morpheme segmentation passes¶
The TagalogSegmenter runs five passes in order, returning the first successful
segmentation:
Pass |
Name |
Logic |
Example |
|---|---|---|---|
0 |
Frozen-form guard |
If the whole word is a root and stripping a prefix yields another root with an identical dictionary definition, return the word unsegmented. |
pangalan → |
1 |
Circumfix |
Try all (prefix, suffix) circumfix pairs longest-first. Accept if the core is ≥ 4 chars, is a root, and is not a redundant duplicate. |
pagkainan → |
2 |
Prefix (recursive) |
Strip the longest matching prefix. Recurse on the remainder (up to depth 3) to handle stacked prefixes. Try infix detection on the remainder before accepting a bare root. |
pinakamahusay → |
3 |
Infix |
Check whether inserting |
kumain → |
4 |
Suffix |
Strip suffix variants (including h-insertion: |
kainan → |
5 |
Fallback |
Return |
computer → |
Root validation¶
Every candidate root is checked against tagalog_roots.json (~28,000 entries).
The minimum root length is 4 characters (_MIN_ROOT = 4), which eliminates
spurious matches against short dictionary fragments like gka or nda that
appear in the roots file as inflected-form artefacts.
Redundancy check¶
The _is_redundant(word, root) method compares the dictionary definitions of
the whole word and the candidate root. If they are identical, the segmentation is
rejected — this catches duplicate entries like:
pangalan— definition: “name; reputation; repute; denomination”alan— definition: “name; reputation; repute; denomination”
Without this check, the segmenter would produce ['pang', 'alan'] for a word
that is itself a frozen lexical entry.
Evaluation methodology¶
Morpheme boundary accuracy¶
The primary metric used in the demo notebooks is morpheme boundary F1:
Gold standard: manually verified morpheme segmentations for ~200 words across prefixed, infixed, suffixed, circumfixed, stacked-prefix, and unsegmentable categories.
Predicted boundaries: token split positions output by the tokenizer.
F1: harmonic mean of precision (fraction of predicted boundaries that are gold) and recall (fraction of gold boundaries that are predicted).
def get_boundaries(segments):
boundaries = set()
pos = 0
for s in segments[:-1]:
pos += len(s)
boundaries.add(pos)
return boundaries
def compute_f1(gold, pred):
hits = len(gold & pred)
prec = hits / len(pred) if pred else 0.0
rec = hits / len(gold) if gold else 0.0
return 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0.0
Fertility¶
Fertility = tokens per word. Lower fertility means the tokenizer is compressing Filipino words into more meaningful units:
tokens_per_word = len(tok.encode(sentence)) / len(sentence.split())
Root consistency¶
For a given root (e.g., kain), encode the root alone, then check whether those exact IDs appear as a contiguous subsequence in the encoding of each inflected form. For prefix/suffix forms this will always hold; for infix forms it will not (the root is split around the infix), which is expected.
Extending to a new language¶
The library is designed for multiple Philippine languages. All affix data is stored
in four shared JSON files (data/prefix_table.json etc.) filtered by a
"language" field. Adding a new language requires:
Add affix entries to the JSON tables:
{ "mag-": [ {"language": "Tagalog", "function": "...", "etymology": "..."}, {"language": "Bisaya", "function": "...", "etymology": "..."} ] }
Add a root file —
data/<language>_roots.json, same schema astagalog_roots.json:[ {"word": "kaon", "definition": "to eat", "language": "Bisaya", "part_of_speech": "v", "link": ""} ]
Create an affixes class:
# src/<language>/affixes.py from filipino_tokenizer.base import BaseAffixes class BisayaAffixes(BaseAffixes): def __init__(self): super().__init__(language="Bisaya")
Create a roots class:
from filipino_tokenizer.base import BaseRoots class BisayaRoots(BaseRoots): def __init__(self): super().__init__(language="Bisaya", filename="bisaya_roots.json")
Create a phonology class — subclass or replace
TagalogPhonologywith language-specific rules (Bisaya has different nasal assimilation patterns).Create a segmenter — subclass
BaseSegmenter, implementing the same pass structure with language-appropriate adjustments.Create a tokenizer — wire the segmenter into
MorphAwareBPE, followingTagalogTokenizeras a template.
References¶
Tacorda, Livelo, Ong, and Cheng (2024). Constraining Byte Pair Encoding (CBPE) to improve morphological segmentation for Filipino tokenizers. The CBPE approach this library implements.
Sennrich, Haddow, and Birch (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. arXiv:1508.07909 The original BPE paper.
Cruz, J.P. and Cheng, C. (2022). Improving Large-scale Language Models and Resources for Filipino. Source of Filipino NLP benchmarks referenced in evaluation.
Miranda, L.J. (2023). calamanCy: A Tagalog Natural Language Processing Toolkit. SpaCy-based Tagalog pipeline that informed morphological analysis design.