Researcher Guide

This guide explains the linguistic theory behind the tokenizer, the algorithmic design decisions, how to evaluate it, and how to extend it to other Philippine languages.

Filipino morphology primer

Tagalog is an agglutinative language — complex words are formed by attaching affixes to a root. Unlike English, where affixes attach only at word edges, Tagalog also uses infixes that are inserted inside the root.

Affix types

Type

Example

Segmentation

Meaning

Prefix

pagkain

pag + kain

“food” (pag- nominalises)

Infix -um-

kumain

k + um + ain

“ate” (-um- marks actor focus, past)

Infix -in-

kinain

k + in + ain

“was eaten” (-in- marks object focus)

Suffix

kainan

kain + an

“dining place” (-an locative)

Circumfix

pagkainan

pag + kain + an

“dining hall” (pag- -an together)

Infixes are particularly important for tokenisation. The surface form kumain does not begin with the root kain; instead the root’s first consonant k comes first, then the infix um, then the rest of the root ain. A character-level tokenizer sees k, u, m, a, i, n with no concept that kain is the meaningful unit.

Nasal assimilation

The prefixes pang- and mang- undergo nasal assimilation when the root begins with certain consonants, which changes both the prefix surface form and drops the root’s initial consonant:

Root initial consonant

Surface prefix

Example

b, p

pam- / mam-

pamili (pang + bili)

d, t, s

pan- / man-

panulat (pang + sulat)

k, g

pang- / mang-

pangkain (pang + kain)

vowel, h, l, m, n, w, y

pang- / mang-

pangasiwa (pang + asiwa)

The TagalogPhonology class handles forward (apply) and reverse (strip) direction for these rules.


The Constrained BPE algorithm

Background

Standard Byte Pair Encoding (BPE) learns subword units by repeatedly merging the most frequent adjacent pair of symbols in a corpus. Applied naively to Filipino, it produces merges that cross morpheme boundaries — e.g., merging n and g in pagkain to create ng regardless of whether n and g belong to different morphemes.

The CBPE constraint

This library implements Constrained BPE (CBPE), following the approach of Tacorda et al. (2024). The constraint is simple:

No merge may combine two symbols that are separated by a morpheme boundary marker.

The boundary marker is (U+2581, LOWER ONE EIGHTH BLOCK), the same character used by SentencePiece.

Pipeline

Raw text
   │
   ▼
Pre-tokenize         Split on whitespace and punctuation
   │
   ▼
Morphological        TagalogSegmenter identifies morphemes;
Segmentation         TagalogTokenizer inserts ▁ into the surface text
   │                 at morpheme boundaries
   ▼
Surface-annotated    e.g. "pag▁kain" for pagkain
tokens               e.g. "k▁um▁ain" for kumain (infix)
   │
   ▼
CBPE Training        BPE pair-counting skips any pair that
(or Encoding)        spans a ▁ boundary

The critical detail for infix forms: the segmenter returns ['um', 'kain'] for kumain, but these morphemes do not concatenate to give the surface word. The _surface_annotate method maps them back to the surface text with boundary markers: k▁um▁ain. This means:

  • k and um cannot be merged ( between them)

  • um and ain cannot be merged ( between them)

  • k and a cannot be merged (not adjacent in the token sequence — um is between them)

The root fragment kain is therefore split in infix words, which is unavoidable given the phonological reality of Tagalog infixation. For prefix/suffix forms (pag▁kain) the root kain appears intact and receives consistent token IDs.

Heap-based incremental BPE

The MorphAwareBPE training loop uses an optimised incremental algorithm:

  1. Doubly-linked list — each unique word sequence is represented as a linked list of Node objects, enabling O(1) local edits when a merge is applied.

  2. Max-heap with lazy deletion — the most frequent pair is found in O(log n) time. Stale heap entries (whose count has decreased since they were pushed) are skipped at pop time.

  3. Position indexpair_positions[pair] is a set of nodes where the pair starts, enabling targeted updates instead of a full corpus rescan.

This brings training complexity from O(N²) (naïve BPE) down to O(N log V) where N is corpus size and V is vocabulary size.


Morpheme segmentation passes

The TagalogSegmenter runs five passes in order, returning the first successful segmentation:

Pass

Name

Logic

Example

0

Frozen-form guard

If the whole word is a root and stripping a prefix yields another root with an identical dictionary definition, return the word unsegmented.

pangalan['pangalan'] (not pang + alan)

1

Circumfix

Try all (prefix, suffix) circumfix pairs longest-first. Accept if the core is ≥ 4 chars, is a root, and is not a redundant duplicate.

pagkainan['pag', 'kain', 'an']

2

Prefix (recursive)

Strip the longest matching prefix. Recurse on the remainder (up to depth 3) to handle stacked prefixes. Try infix detection on the remainder before accepting a bare root.

pinakamahusay['pinaka', 'ma', 'husay']

3

Infix

Check whether inserting -um- or -in- after the first consonant gives a valid root (≥ 4 chars, in dictionary).

kumain['um', 'kain']

4

Suffix

Strip suffix variants (including h-insertion: -an/-han, -in/-hin). Accept if root is ≥ 4 chars and in dictionary.

kainan['kain', 'an']

5

Fallback

Return [word] unsegmented.

computer['computer']

Root validation

Every candidate root is checked against tagalog_roots.json (~28,000 entries). The minimum root length is 4 characters (_MIN_ROOT = 4), which eliminates spurious matches against short dictionary fragments like gka or nda that appear in the roots file as inflected-form artefacts.

Redundancy check

The _is_redundant(word, root) method compares the dictionary definitions of the whole word and the candidate root. If they are identical, the segmentation is rejected — this catches duplicate entries like:

  • pangalan — definition: “name; reputation; repute; denomination”

  • alan — definition: “name; reputation; repute; denomination”

Without this check, the segmenter would produce ['pang', 'alan'] for a word that is itself a frozen lexical entry.


Evaluation methodology

Morpheme boundary accuracy

The primary metric used in the demo notebooks is morpheme boundary F1:

  • Gold standard: manually verified morpheme segmentations for ~200 words across prefixed, infixed, suffixed, circumfixed, stacked-prefix, and unsegmentable categories.

  • Predicted boundaries: token split positions output by the tokenizer.

  • F1: harmonic mean of precision (fraction of predicted boundaries that are gold) and recall (fraction of gold boundaries that are predicted).

def get_boundaries(segments):
    boundaries = set()
    pos = 0
    for s in segments[:-1]:
        pos += len(s)
        boundaries.add(pos)
    return boundaries

def compute_f1(gold, pred):
    hits = len(gold & pred)
    prec = hits / len(pred) if pred else 0.0
    rec  = hits / len(gold) if gold else 0.0
    return 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0.0

Fertility

Fertility = tokens per word. Lower fertility means the tokenizer is compressing Filipino words into more meaningful units:

tokens_per_word = len(tok.encode(sentence)) / len(sentence.split())

Root consistency

For a given root (e.g., kain), encode the root alone, then check whether those exact IDs appear as a contiguous subsequence in the encoding of each inflected form. For prefix/suffix forms this will always hold; for infix forms it will not (the root is split around the infix), which is expected.


Extending to a new language

The library is designed for multiple Philippine languages. All affix data is stored in four shared JSON files (data/prefix_table.json etc.) filtered by a "language" field. Adding a new language requires:

  1. Add affix entries to the JSON tables:

    {
      "mag-": [
        {"language": "Tagalog", "function": "...", "etymology": "..."},
        {"language": "Bisaya",  "function": "...", "etymology": "..."}
      ]
    }
    
  2. Add a root filedata/<language>_roots.json, same schema as tagalog_roots.json:

    [
      {"word": "kaon", "definition": "to eat", "language": "Bisaya",
       "part_of_speech": "v", "link": ""}
    ]
    
  3. Create an affixes class:

    # src/<language>/affixes.py
    from filipino_tokenizer.base import BaseAffixes
    
    class BisayaAffixes(BaseAffixes):
        def __init__(self):
            super().__init__(language="Bisaya")
    
  4. Create a roots class:

    from filipino_tokenizer.base import BaseRoots
    
    class BisayaRoots(BaseRoots):
        def __init__(self):
            super().__init__(language="Bisaya", filename="bisaya_roots.json")
    
  5. Create a phonology class — subclass or replace TagalogPhonology with language-specific rules (Bisaya has different nasal assimilation patterns).

  6. Create a segmenter — subclass BaseSegmenter, implementing the same pass structure with language-appropriate adjustments.

  7. Create a tokenizer — wire the segmenter into MorphAwareBPE, following TagalogTokenizer as a template.


References

  • Tacorda, Livelo, Ong, and Cheng (2024). Constraining Byte Pair Encoding (CBPE) to improve morphological segmentation for Filipino tokenizers. The CBPE approach this library implements.

  • Sennrich, Haddow, and Birch (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. arXiv:1508.07909 The original BPE paper.

  • Cruz, J.P. and Cheng, C. (2022). Improving Large-scale Language Models and Resources for Filipino. Source of Filipino NLP benchmarks referenced in evaluation.

  • Miranda, L.J. (2023). calamanCy: A Tagalog Natural Language Processing Toolkit. SpaCy-based Tagalog pipeline that informed morphological analysis design.