Changelog¶
0.4.1 (2026-04-27)¶
Fixed¶
HuggingFace wrapper — ``_convert_token_to_id`` never returns ``None`` — After
super().__init__(), HuggingFace wraps special tokens inAddedTokenobjects. Passing anAddedTokenas a plaindictkey failed the vocab lookup in some HF versions, allowingNoneto propagate into paddedinput_idsand crashing batch tokenisation with:ValueError: type of None unknown: <class 'NoneType'>Fixed by using
str(self.unk_token)before the vocab lookup and adding a finalisinstance(result, int)guard so the function always returns a valid integer.Broader import-error handling for ``transformers`` — The
except ImportErroraround thetransformersimport was too narrow; partial installs can raiseAttributeErroror other exceptions, silently strippingbatch_encode_plusand other HF methods. Changed toexcept Exceptionwith a proper stub-class fallback so the MRO is never crippled.
0.4.0 (2026-04-27)¶
Added¶
Rust BPE backend —
filipino_tokenizer._bpe_rust.CoreBPEis a compiled Rust extension (PyO3 +rustc-hash) that replaces the pure-Python encode/decode loop. The greedy O(n²) BPE algorithm, tab-separated merge-rank lookup viaFxHashMap, and a reusable key buffer eliminate per-call allocation overhead.``MorphAwareBPE._init_rust()`` — builds the
CoreBPEfrom the currentvocabandmergesaftertrain()orload(). The Python object retains itsencode/decodeAPI; the Rust layer is transparent to callers.``setup.py`` + ``setuptools-rust`` —
pip installnow compiles the Rust extension automatically (setuptools-rust>=1.5.2added to build requirements).Integration test suite —
tests/test_rust_backend.pycovers extension loading, encode/decode correctness, morpheme-boundary enforcement, encode-cache consistency, and full-sentence round-trips.
Changed¶
MorphAwareBPE.encode()anddecode()delegate entirely to the Rust backend. The pure-Python_apply_merges()method has been removed.
Breaking changes¶
Rust is required to build the package. Installing without a Rust toolchain will fail at the compilation step. See the Installation page for setup instructions.
0.3.2 (2026-04-27)¶
Fixed¶
HuggingFace batch padding crash —
TagalogHFTokenizernow ensures special tokens resolve to valid IDs even for custom/older vocabularies. This preventsNonefrom leaking into padded batches and fixes:ValueError: type of None unknown: <class 'NoneType'>Defensive ID conversion —
_convert_token_to_id()now handlesNonesafely and always falls back to a valid unknown-token ID.
0.3.0 (2026-04-26)¶
Added¶
Bundled pretrained model —
vocab.json+merges.txt(32k vocabulary trained on Wikitext-TL-39) are now shipped inside the package. No separate download or path needed.``TagalogTokenizer.load_pretrained()`` — loads the bundled model in one call:
from filipino_tokenizer.tagalog import TagalogTokenizer tok = TagalogTokenizer() tok.load_pretrained() ids = tok.encode("Kumain siya ng pagkain.")
Zero-argument ``TagalogHFTokenizer()`` —
vocab_fileis now optional. Calling with no arguments loads the bundled pretrained model automatically:from filipino_tokenizer.tagalog import TagalogHFTokenizer tok = TagalogHFTokenizer() # loads bundled 32k model encoding = tok("Kumain siya ng pagkain.", return_tensors="pt")
Changed¶
TagalogHFTokenizer.__init__—vocab_fileandmerges_fileare now optional (defaultNone). Existing code passing explicit file paths continues to work unchanged.
0.2.0 (2026-04-26)¶
Added¶
HuggingFace integration —
TagalogHFTokenizerwraps the tokenizer behind thePreTrainedTokenizerinterface, making it compatible withTrainer, TRL, Axolotl, LlamaFactory, and any other HuggingFace-based training pipeline.pip install filipino-tokenizer[hf]
from filipino_tokenizer.tagalog import TagalogHFTokenizer tok = TagalogHFTokenizer.from_pretrained("path/to/model/") encoding = tok("Kumain siya ng pagkain.", return_tensors="pt")
Corpus download script —
scripts/download_corpus.pyfetches Wikitext-TL-39 (~1.5M sentences) from HuggingFace Hub.Pretrained model — 32,000-token vocabulary trained on Wikitext-TL-39 in ~2.6 minutes, included in
demo/models/morph/.Training progress reporting — live
stderrprogress bars for both morphological segmentation and BPE merge phases (every 5%).Usage guide notebook —
demo/demo_tagalog_tokenizer.ipynbcovers all library features end-to-end: training, encode/decode, segmentation, save/reload, HuggingFace integration, and SLM/LLM training setup.Optional dependency group —
pip install filipino-tokenizer[hf]installstransformers>=4.30.
Changed¶
examples/training_tagalog_tokenizer.pyrewritten to focus on Wikitext-TL-39. Accepts--vocab-sizeand--outputCLI arguments.
0.1.0 (2026-04-26)¶
Initial release.
TagalogTokenizer— full train / encode / decode / save / load pipeline.TagalogSegmenter— multi-pass morphological segmenter with root validation, frozen-form guard, and recursive stacked-prefix support.MorphAwareBPE— constrained BPE with doubly-linked list + max-heap training.TagalogPhonology— nasal assimilation and suffix h-insertion rules.Data: 28,000+ Tagalog roots, 92 prefixes, 34 suffixes, 2 infixes, 30 circumfixes.