Installation¶
Requirements¶
Python 3.10 or later
No external runtime dependencies
Pre-built wheels are published for Linux, macOS, and Windows on Python 3.10–3.13.
pip installdownloads the right binary — no compiler or Rust toolchain needed.Note
Installing from source (e.g. cloning the repo and running
pip install -e .) requires a Rust toolchain. See rustup.rs to install one.
From PyPI¶
pip install filipino-tokenizer
From source¶
git clone https://github.com/JpCurada/filipino-tokenizer.git
cd filipino-tokenizer
pip install -e .
The -e flag installs in editable mode, so changes to the source are reflected
immediately without reinstalling.
Verify the installation¶
from filipino_tokenizer.tagalog import TagalogTokenizer, TagalogSegmenter
seg = TagalogSegmenter()
print(seg.segment("kumain")) # ['um', 'kain']
Optional dependencies¶
HuggingFace integration¶
To use TagalogHFTokenizer with transformers-based training pipelines:
pip install filipino-tokenizer[hf]
This installs transformers>=4.30.
Demo notebooks¶
The notebooks in demo/ use additional packages for comparisons and visualisations:
pip install plotly tiktoken sentencepiece jupyter
Corpus download¶
To download the Wikitext-TL-39 training corpus:
pip install datasets
python scripts/download_corpus.py