TagalogSegmenter¶
- class filipino_tokenizer.tagalog.segmenter.TagalogSegmenter[source]¶
Bases:
BaseSegmenterMulti-pass morphological segmenter for Tagalog.
- Pass order (per SKILL.md):
- Frozen-form guard — words whose affix analysis is blocked by
identical-definition duplicates in the dict.
Circumfix detection — ka- -han, pag- -an, etc.
Prefix stripping — longest-match-first, recursive for stacked prefixes
Infix detection — -um- and -in- after first consonant
Suffix stripping — -an/-han, -in/-hin phonology variants
Fallback — return [word] unsegmented
Root validation: every candidate root is checked against the root dictionary before a segmentation is accepted.
Redundancy check: if both the whole word and the stripped root appear in the dictionary with identical definitions the analysis is rejected. This catches frozen forms like ‘pangalan’ where ‘alan’ and ‘pangalan’ share the same definition (“name; reputation; repute; denomination”).
_MIN_ROOT = 4: roots shorter than 4 characters are rejected to avoid spurious matches against short dictionary fragments (e.g. ‘gka’, ‘nda’) that appear as roots only because the dictionary stores inflected forms under truncated keys.
- VOWELS = frozenset({'a', 'e', 'i', 'o', 'u'})¶
Method reference¶
Method |
Signature |
Description |
|---|---|---|
|
|
Decompose a single word into morphemes. |
|
|
Split text on whitespace/punctuation, then segment each word. |
Segmentation pass order¶
Frozen-form guard — words whose affix analysis is blocked by identical-definition duplicates in the root dictionary.
Circumfix detection — ka- -han, pag- -an, etc.
Prefix stripping — longest-match-first, recursive for stacked prefixes (depth limit: 3).
Infix detection —
-um-and-in-after first consonant.Suffix stripping —
-an/-han,-in/-hinphonology variants.Fallback — return
[word]unsegmented.
Root validation is applied at every pass: a candidate root must be ≥ 4 characters
and present in tagalog_roots.json.
Examples¶
from filipino_tokenizer.tagalog import TagalogSegmenter
seg = TagalogSegmenter()
# Infix
seg.segment("kumain") # ['um', 'kain']
seg.segment("kinain") # ['in', 'kain']
# Prefix
seg.segment("pagkain") # ['pag', 'kain']
seg.segment("maganda") # ['ma', 'ganda']
# Circumfix
seg.segment("pagkainan") # ['pag', 'kain', 'an']
seg.segment("kasiyahan") # ['ka', 'siya', 'han']
# Stacked prefixes
seg.segment("pinakamahusay") # ['pinaka', 'ma', 'husay']
# Frozen form (identical definitions for whole word and stripped root)
seg.segment("pangalan") # ['pangalan']
# Loan word / no valid root found
seg.segment("computer") # ['computer']
# Empty input
seg.segment("") # []
# Case-insensitive
seg.segment("KUMAIN") == seg.segment("kumain") # True
# Full sentence
seg.segment_text("Kumain siya ng pagkain.")
# ['um', 'kain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']