TagalogSegmenter

class filipino_tokenizer.tagalog.segmenter.TagalogSegmenter[source]

Bases: BaseSegmenter

Multi-pass morphological segmenter for Tagalog.

Pass order (per SKILL.md):
  1. Frozen-form guard — words whose affix analysis is blocked by

    identical-definition duplicates in the dict.

  2. Circumfix detection — ka- -han, pag- -an, etc.

  3. Prefix stripping — longest-match-first, recursive for stacked prefixes

  4. Infix detection — -um- and -in- after first consonant

  5. Suffix stripping — -an/-han, -in/-hin phonology variants

  6. Fallback — return [word] unsegmented

Root validation: every candidate root is checked against the root dictionary before a segmentation is accepted.

Redundancy check: if both the whole word and the stripped root appear in the dictionary with identical definitions the analysis is rejected. This catches frozen forms like ‘pangalan’ where ‘alan’ and ‘pangalan’ share the same definition (“name; reputation; repute; denomination”).

_MIN_ROOT = 4: roots shorter than 4 characters are rejected to avoid spurious matches against short dictionary fragments (e.g. ‘gka’, ‘nda’) that appear as roots only because the dictionary stores inflected forms under truncated keys.

VOWELS = frozenset({'a', 'e', 'i', 'o', 'u'})
segment(word)[source]
Parameters:

word (str)

Return type:

list


Method reference

Method

Signature

Description

segment

(word) list[str]

Decompose a single word into morphemes.

segment_text

(text) list[str]

Split text on whitespace/punctuation, then segment each word.


Segmentation pass order

  1. Frozen-form guard — words whose affix analysis is blocked by identical-definition duplicates in the root dictionary.

  2. Circumfix detection — ka- -han, pag- -an, etc.

  3. Prefix stripping — longest-match-first, recursive for stacked prefixes (depth limit: 3).

  4. Infix detection-um- and -in- after first consonant.

  5. Suffix stripping-an/-han, -in/-hin phonology variants.

  6. Fallback — return [word] unsegmented.

Root validation is applied at every pass: a candidate root must be ≥ 4 characters and present in tagalog_roots.json.


Examples

from filipino_tokenizer.tagalog import TagalogSegmenter

seg = TagalogSegmenter()

# Infix
seg.segment("kumain")          # ['um', 'kain']
seg.segment("kinain")          # ['in', 'kain']

# Prefix
seg.segment("pagkain")         # ['pag', 'kain']
seg.segment("maganda")         # ['ma', 'ganda']

# Circumfix
seg.segment("pagkainan")       # ['pag', 'kain', 'an']
seg.segment("kasiyahan")       # ['ka', 'siya', 'han']

# Stacked prefixes
seg.segment("pinakamahusay")   # ['pinaka', 'ma', 'husay']

# Frozen form (identical definitions for whole word and stripped root)
seg.segment("pangalan")        # ['pangalan']

# Loan word / no valid root found
seg.segment("computer")        # ['computer']

# Empty input
seg.segment("")                # []

# Case-insensitive
seg.segment("KUMAIN") == seg.segment("kumain")   # True

# Full sentence
seg.segment_text("Kumain siya ng pagkain.")
# ['um', 'kain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']