Changelog ========= 0.4.1 (2026-04-27) ------------------- Fixed ~~~~~ - **HuggingFace wrapper — ``_convert_token_to_id`` never returns ``None``** — After ``super().__init__()``, HuggingFace wraps special tokens in ``AddedToken`` objects. Passing an ``AddedToken`` as a plain ``dict`` key failed the vocab lookup in some HF versions, allowing ``None`` to propagate into padded ``input_ids`` and crashing batch tokenisation with: ``ValueError: type of None unknown: `` Fixed by using ``str(self.unk_token)`` before the vocab lookup and adding a final ``isinstance(result, int)`` guard so the function always returns a valid integer. - **Broader import-error handling for ``transformers``** — The ``except ImportError`` around the ``transformers`` import was too narrow; partial installs can raise ``AttributeError`` or other exceptions, silently stripping ``batch_encode_plus`` and other HF methods. Changed to ``except Exception`` with a proper stub-class fallback so the MRO is never crippled. 0.4.0 (2026-04-27) ------------------- Added ~~~~~ - **Rust BPE backend** — ``filipino_tokenizer._bpe_rust.CoreBPE`` is a compiled Rust extension (PyO3 + ``rustc-hash``) that replaces the pure-Python encode/decode loop. The greedy O(n²) BPE algorithm, tab-separated merge-rank lookup via ``FxHashMap``, and a reusable key buffer eliminate per-call allocation overhead. - **``MorphAwareBPE._init_rust()``** — builds the ``CoreBPE`` from the current ``vocab`` and ``merges`` after ``train()`` or ``load()``. The Python object retains its ``encode``/``decode`` API; the Rust layer is transparent to callers. - **``setup.py`` + ``setuptools-rust``** — ``pip install`` now compiles the Rust extension automatically (``setuptools-rust>=1.5.2`` added to build requirements). - **Integration test suite** — ``tests/test_rust_backend.py`` covers extension loading, encode/decode correctness, morpheme-boundary enforcement, encode-cache consistency, and full-sentence round-trips. Changed ~~~~~~~ - ``MorphAwareBPE.encode()`` and ``decode()`` delegate entirely to the Rust backend. The pure-Python ``_apply_merges()`` method has been removed. Breaking changes ~~~~~~~~~~~~~~~~ - **Rust is required to build the package.** Installing without a Rust toolchain will fail at the compilation step. See the :doc:`installation` page for setup instructions. 0.3.2 (2026-04-27) ------------------- Fixed ~~~~~ - **HuggingFace batch padding crash** — ``TagalogHFTokenizer`` now ensures special tokens resolve to valid IDs even for custom/older vocabularies. This prevents ``None`` from leaking into padded batches and fixes: ``ValueError: type of None unknown: `` - **Defensive ID conversion** — ``_convert_token_to_id()`` now handles ``None`` safely and always falls back to a valid unknown-token ID. 0.3.0 (2026-04-26) ------------------- Added ~~~~~ - **Bundled pretrained model** — ``vocab.json`` + ``merges.txt`` (32k vocabulary trained on Wikitext-TL-39) are now shipped inside the package. No separate download or path needed. - **``TagalogTokenizer.load_pretrained()``** — loads the bundled model in one call: .. code-block:: python from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.load_pretrained() ids = tok.encode("Kumain siya ng pagkain.") - **Zero-argument ``TagalogHFTokenizer()``** — ``vocab_file`` is now optional. Calling with no arguments loads the bundled pretrained model automatically: .. code-block:: python from filipino_tokenizer.tagalog import TagalogHFTokenizer tok = TagalogHFTokenizer() # loads bundled 32k model encoding = tok("Kumain siya ng pagkain.", return_tensors="pt") Changed ~~~~~~~ - ``TagalogHFTokenizer.__init__`` — ``vocab_file`` and ``merges_file`` are now optional (default ``None``). Existing code passing explicit file paths continues to work unchanged. 0.2.0 (2026-04-26) ------------------- Added ~~~~~ - **HuggingFace integration** — ``TagalogHFTokenizer`` wraps the tokenizer behind the ``PreTrainedTokenizer`` interface, making it compatible with ``Trainer``, TRL, Axolotl, LlamaFactory, and any other HuggingFace-based training pipeline. .. code-block:: bash pip install filipino-tokenizer[hf] .. code-block:: python from filipino_tokenizer.tagalog import TagalogHFTokenizer tok = TagalogHFTokenizer.from_pretrained("path/to/model/") encoding = tok("Kumain siya ng pagkain.", return_tensors="pt") - **Corpus download script** — ``scripts/download_corpus.py`` fetches `Wikitext-TL-39 `_ (~1.5M sentences) from HuggingFace Hub. - **Pretrained model** — 32,000-token vocabulary trained on Wikitext-TL-39 in ~2.6 minutes, included in ``demo/models/morph/``. - **Training progress reporting** — live ``stderr`` progress bars for both morphological segmentation and BPE merge phases (every 5%). - **Usage guide notebook** — ``demo/demo_tagalog_tokenizer.ipynb`` covers all library features end-to-end: training, encode/decode, segmentation, save/reload, HuggingFace integration, and SLM/LLM training setup. - **Optional dependency group** — ``pip install filipino-tokenizer[hf]`` installs ``transformers>=4.30``. Changed ~~~~~~~ - ``examples/training_tagalog_tokenizer.py`` rewritten to focus on Wikitext-TL-39. Accepts ``--vocab-size`` and ``--output`` CLI arguments. 0.1.0 (2026-04-26) ------------------- - Initial release. - ``TagalogTokenizer`` — full train / encode / decode / save / load pipeline. - ``TagalogSegmenter`` — multi-pass morphological segmenter with root validation, frozen-form guard, and recursive stacked-prefix support. - ``MorphAwareBPE`` — constrained BPE with doubly-linked list + max-heap training. - ``TagalogPhonology`` — nasal assimilation and suffix h-insertion rules. - Data: 28,000+ Tagalog roots, 92 prefixes, 34 suffixes, 2 infixes, 30 circumfixes.