TagalogTokenizer ================ .. autoclass:: filipino_tokenizer.tagalog.tokenizer.TagalogTokenizer :members: :undoc-members: :show-inheritance: ---- Method reference ---------------- .. list-table:: :header-rows: 1 :widths: 25 30 45 * - Method - Signature - Description * - ``train`` - ``(corpus_path, vocab_size=32000)`` - Train BPE from a plain-text corpus file. * - ``encode`` - ``(text) → list[int]`` - Encode text to token IDs. * - ``decode`` - ``(ids) → str`` - Decode token IDs back to text. * - ``tokenize`` - ``(text) → list[str]`` - Return subword strings instead of IDs (for inspection). * - ``load_pretrained`` - ``()`` - Load the bundled 32k model shipped with the package. No path needed. * - ``save`` - ``(directory)`` - Write ``vocab.json`` and ``merges.txt`` to *directory*. * - ``load`` - ``(directory)`` - Load a previously saved tokenizer. ---- Attributes ---------- .. list-table:: :header-rows: 1 :widths: 25 75 * - Attribute - Description * - ``tok.bpe`` - The underlying :class:`~filipino_tokenizer.tagalog.bpe.MorphAwareBPE` instance. Access ``tok.bpe.vocab`` (dict), ``tok.bpe.merges`` (list of tuples), ``tok.bpe.id_to_token`` (dict). * - ``tok.segmenter`` - The underlying :class:`~filipino_tokenizer.tagalog.segmenter.TagalogSegmenter` instance. Use ``tok.segmenter.segment(word)`` independently. ---- Examples -------- Load the bundled pretrained model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No download or path required — the 32k model is shipped with the package: .. code-block:: python from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.load_pretrained() ids = tok.encode("Kumain siya ng pagkain.") Train on your own corpus ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.train("corpus.txt", vocab_size=32000) Encode / decode round-trip ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python ids = tok.encode("Kumain siya ng pagkain.") assert tok.decode(ids) == "kumain siya ng pagkain." Inspect tokens ~~~~~~~~~~~~~~ .. code-block:: python tok.tokenize("Pinakamahusay ang ginawa niya.") # ['pinaka', '▁', 'ma', '▁', 'husay', ' ', 'ang', ' ', ...] Save and reload ~~~~~~~~~~~~~~~ .. code-block:: python tok.save("my_tokenizer/") tok2 = TagalogTokenizer() tok2.load("my_tokenizer/") assert tok.encode("test") == tok2.encode("test")