TagalogSegmenter ================ .. autoclass:: filipino_tokenizer.tagalog.segmenter.TagalogSegmenter :members: :undoc-members: :show-inheritance: ---- Method reference ---------------- .. list-table:: :header-rows: 1 :widths: 25 30 45 * - Method - Signature - Description * - ``segment`` - ``(word) → list[str]`` - Decompose a single word into morphemes. * - ``segment_text`` - ``(text) → list[str]`` - Split text on whitespace/punctuation, then segment each word. ---- Segmentation pass order ----------------------- 1. **Frozen-form guard** — words whose affix analysis is blocked by identical-definition duplicates in the root dictionary. 2. **Circumfix detection** — ka- -han, pag- -an, etc. 3. **Prefix stripping** — longest-match-first, recursive for stacked prefixes (depth limit: 3). 4. **Infix detection** — ``-um-`` and ``-in-`` after first consonant. 5. **Suffix stripping** — ``-an``/``-han``, ``-in``/``-hin`` phonology variants. 6. **Fallback** — return ``[word]`` unsegmented. Root validation is applied at every pass: a candidate root must be ≥ 4 characters and present in ``tagalog_roots.json``. ---- Examples -------- .. code-block:: python from filipino_tokenizer.tagalog import TagalogSegmenter seg = TagalogSegmenter() # Infix seg.segment("kumain") # ['um', 'kain'] seg.segment("kinain") # ['in', 'kain'] # Prefix seg.segment("pagkain") # ['pag', 'kain'] seg.segment("maganda") # ['ma', 'ganda'] # Circumfix seg.segment("pagkainan") # ['pag', 'kain', 'an'] seg.segment("kasiyahan") # ['ka', 'siya', 'han'] # Stacked prefixes seg.segment("pinakamahusay") # ['pinaka', 'ma', 'husay'] # Frozen form (identical definitions for whole word and stripped root) seg.segment("pangalan") # ['pangalan'] # Loan word / no valid root found seg.segment("computer") # ['computer'] # Empty input seg.segment("") # [] # Case-insensitive seg.segment("KUMAIN") == seg.segment("kumain") # True # Full sentence seg.segment_text("Kumain siya ng pagkain.") # ['um', 'kain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']