TagalogTokenizer¶
- class filipino_tokenizer.tagalog.tokenizer.TagalogTokenizer[source]¶
Bases:
objectEnd-to-end tokenizer for Tagalog text.
Usage:
tok = TagalogTokenizer() tok.train("corpus.txt", vocab_size=32000) ids = tok.encode("Kumain siya ng pagkain.") text = tok.decode(ids) assert text == "kumain siya ng pagkain."
- train(corpus_path, vocab_size=32000)[source]¶
Train the tokenizer from a plain-text corpus file.
- Steps:
Read the corpus file line-by-line.
Pre-tokenize each line into words / punctuation.
Segment each word morphologically.
Insert boundary markers into the surface text at morpheme boundaries (preserving original spelling).
Train BPE with the CBPE constraint.
Parameters¶
- corpus_pathstr
Path to a UTF-8 plain-text file (one sentence per line).
- vocab_sizeint
Target BPE vocabulary size.
- encode(text)[source]¶
Encode text into a list of integer token IDs.
The text is lowercased, split into words/punctuation, each word is morphologically segmented (with boundary markers in the surface form), and BPE encoding is applied.
- tokenize(text)[source]¶
Tokenize text into subword strings (for debugging / inspection).
Returns the string representation of each BPE token rather than integer IDs.
- decode(ids)[source]¶
Decode a list of token IDs back to a readable string.
Boundary markers and special tokens are removed. Spaces between words are reconstructed by detecting word-boundary tokens.
- save(directory)[source]¶
Save the trained tokenizer to directory.
- Creates:
vocab.json— BPE vocabulary mappingmerges.txt— ordered merge rules
- Parameters:
directory (str)
- Return type:
None
- load(directory)[source]¶
Load a previously saved tokenizer from directory.
- Parameters:
directory (str)
- Return type:
None
- prewarm(lines)[source]¶
Pre-segment all unique words across lines to warm the segment cache.
TagalogTokenizercaches morphological segmentation per word in_segment_cache. A large corpus has millions of lines but typically only tens of thousands of unique words. Calling this beforeencode()/tokenize()ensures each word is segmented exactly once, cutting tokenization time by ~10x on real corpora.Parameters¶
- lineslist[str]
The same lines you intend to tokenize.
Method reference¶
Method |
Signature |
Description |
|---|---|---|
|
|
Train BPE from a plain-text corpus file. |
|
|
Encode text to token IDs. |
|
|
Decode token IDs back to text. |
|
|
Return subword strings instead of IDs (for inspection). |
|
|
Load the bundled 32k model shipped with the package. No path needed. |
|
|
Write |
|
|
Load a previously saved tokenizer. |
Attributes¶
Attribute |
Description |
|---|---|
|
The underlying |
|
The underlying |
Examples¶
Load the bundled pretrained model¶
No download or path required — the 32k model is shipped with the package:
from filipino_tokenizer.tagalog import TagalogTokenizer
tok = TagalogTokenizer()
tok.load_pretrained()
ids = tok.encode("Kumain siya ng pagkain.")
Train on your own corpus¶
from filipino_tokenizer.tagalog import TagalogTokenizer
tok = TagalogTokenizer()
tok.train("corpus.txt", vocab_size=32000)
Encode / decode round-trip¶
ids = tok.encode("Kumain siya ng pagkain.")
assert tok.decode(ids) == "kumain siya ng pagkain."
Inspect tokens¶
tok.tokenize("Pinakamahusay ang ginawa niya.")
# ['pinaka', '▁', 'ma', '▁', 'husay', ' ', 'ang', ' ', ...]
Save and reload¶
tok.save("my_tokenizer/")
tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")
assert tok.encode("test") == tok2.encode("test")