TagalogTokenizer¶

class filipino_tokenizer.tagalog.tokenizer.TagalogTokenizer[source]¶

Bases: object

End-to-end tokenizer for Tagalog text.

Usage:

tok = TagalogTokenizer()
tok.train("corpus.txt", vocab_size=32000)
ids = tok.encode("Kumain siya ng pagkain.")
text = tok.decode(ids)
assert text == "kumain siya ng pagkain."

train(corpus_path, vocab_size=32000)[source]¶

Train the tokenizer from a plain-text corpus file.

Steps:

Read the corpus file line-by-line.
Pre-tokenize each line into words / punctuation.
Segment each word morphologically.
Insert boundary markers into the surface text at morpheme boundaries (preserving original spelling).
Train BPE with the CBPE constraint.

Parameters¶

corpus_pathstr: Path to a UTF-8 plain-text file (one sentence per line).
vocab_sizeint: Target BPE vocabulary size.

Parameters:

corpus_path (str)
vocab_size (int)

Return type:

None

encode(text)[source]¶

Encode text into a list of integer token IDs.

The text is lowercased, split into words/punctuation, each word is morphologically segmented (with boundary markers in the surface form), and BPE encoding is applied.

Parameters:: text (str)
Return type:: list[int]

tokenize(text)[source]¶

Tokenize text into subword strings (for debugging / inspection).

Returns the string representation of each BPE token rather than integer IDs.

Parameters:: text (str)
Return type:: list[str]

decode(ids)[source]¶

Decode a list of token IDs back to a readable string.

Boundary markers and special tokens are removed. Spaces between words are reconstructed by detecting word-boundary tokens.

Parameters:: ids (list[int])
Return type:: str

save(directory)[source]¶

Save the trained tokenizer to directory.

Creates:

vocab.json — BPE vocabulary mapping
merges.txt — ordered merge rules

Parameters:: directory (str)
Return type:: None

load(directory)[source]¶

Load a previously saved tokenizer from directory.

Parameters:: directory (str)
Return type:: None

prewarm(lines)[source]¶

Pre-segment all unique words across lines to warm the segment cache.

TagalogTokenizer caches morphological segmentation per word in _segment_cache. A large corpus has millions of lines but typically only tens of thousands of unique words. Calling this before encode() / tokenize() ensures each word is segmented exactly once, cutting tokenization time by ~10x on real corpora.

Parameters¶

lineslist[str]: The same lines you intend to tokenize.

Parameters:: lines (list[str])
Return type:: None

load_pretrained()[source]¶

Load the bundled pretrained 32k Tagalog tokenizer.

No path needed — the model is included in the package:

tok = TagalogTokenizer()
tok.load_pretrained()
ids = tok.encode("Kumain siya ng pagkain.")

Return type:: None

Method reference¶

Method	Signature	Description
`train`	`(corpus_path, vocab_size=32000)`	Train BPE from a plain-text corpus file.
`encode`	`(text) → list[int]`	Encode text to token IDs.
`decode`	`(ids) → str`	Decode token IDs back to text.
`tokenize`	`(text) → list[str]`	Return subword strings instead of IDs (for inspection).
`load_pretrained`	`()`	Load the bundled 32k model shipped with the package. No path needed.
`save`	`(directory)`	Write `vocab.json` and `merges.txt` to directory.
`load`	`(directory)`	Load a previously saved tokenizer.

Attributes¶

Attribute	Description
`tok.bpe`	The underlying `MorphAwareBPE` instance. Access `tok.bpe.vocab` (dict), `tok.bpe.merges` (list of tuples), `tok.bpe.id_to_token` (dict).
`tok.segmenter`	The underlying `TagalogSegmenter` instance. Use `tok.segmenter.segment(word)` independently.

Examples¶

Load the bundled pretrained model¶

No download or path required — the 32k model is shipped with the package:

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.load_pretrained()
ids = tok.encode("Kumain siya ng pagkain.")

Train on your own corpus¶

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.train("corpus.txt", vocab_size=32000)

Encode / decode round-trip¶

ids = tok.encode("Kumain siya ng pagkain.")
assert tok.decode(ids) == "kumain siya ng pagkain."

Inspect tokens¶

tok.tokenize("Pinakamahusay ang ginawa niya.")
# ['pinaka', '▁', 'ma', '▁', 'husay', ' ', 'ang', ' ', ...]

Save and reload¶

tok.save("my_tokenizer/")

tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")
assert tok.encode("test") == tok2.encode("test")