TagalogTokenizer

class filipino_tokenizer.tagalog.tokenizer.TagalogTokenizer[source]

Bases: object

End-to-end tokenizer for Tagalog text.

Usage:

tok = TagalogTokenizer()
tok.train("corpus.txt", vocab_size=32000)
ids = tok.encode("Kumain siya ng pagkain.")
text = tok.decode(ids)
assert text == "kumain siya ng pagkain."
train(corpus_path, vocab_size=32000)[source]

Train the tokenizer from a plain-text corpus file.

Steps:
  1. Read the corpus file line-by-line.

  2. Pre-tokenize each line into words / punctuation.

  3. Segment each word morphologically.

  4. Insert boundary markers into the surface text at morpheme boundaries (preserving original spelling).

  5. Train BPE with the CBPE constraint.

Parameters

corpus_pathstr

Path to a UTF-8 plain-text file (one sentence per line).

vocab_sizeint

Target BPE vocabulary size.

Parameters:
  • corpus_path (str)

  • vocab_size (int)

Return type:

None

encode(text)[source]

Encode text into a list of integer token IDs.

The text is lowercased, split into words/punctuation, each word is morphologically segmented (with boundary markers in the surface form), and BPE encoding is applied.

Parameters:

text (str)

Return type:

list[int]

tokenize(text)[source]

Tokenize text into subword strings (for debugging / inspection).

Returns the string representation of each BPE token rather than integer IDs.

Parameters:

text (str)

Return type:

list[str]

decode(ids)[source]

Decode a list of token IDs back to a readable string.

Boundary markers and special tokens are removed. Spaces between words are reconstructed by detecting word-boundary tokens.

Parameters:

ids (list[int])

Return type:

str

save(directory)[source]

Save the trained tokenizer to directory.

Creates:
  • vocab.json — BPE vocabulary mapping

  • merges.txt — ordered merge rules

Parameters:

directory (str)

Return type:

None

load(directory)[source]

Load a previously saved tokenizer from directory.

Parameters:

directory (str)

Return type:

None

prewarm(lines)[source]

Pre-segment all unique words across lines to warm the segment cache.

TagalogTokenizer caches morphological segmentation per word in _segment_cache. A large corpus has millions of lines but typically only tens of thousands of unique words. Calling this before encode() / tokenize() ensures each word is segmented exactly once, cutting tokenization time by ~10x on real corpora.

Parameters

lineslist[str]

The same lines you intend to tokenize.

Parameters:

lines (list[str])

Return type:

None

load_pretrained()[source]

Load the bundled pretrained 32k Tagalog tokenizer.

No path needed — the model is included in the package:

tok = TagalogTokenizer()
tok.load_pretrained()
ids = tok.encode("Kumain siya ng pagkain.")
Return type:

None


Method reference

Method

Signature

Description

train

(corpus_path, vocab_size=32000)

Train BPE from a plain-text corpus file.

encode

(text) list[int]

Encode text to token IDs.

decode

(ids) str

Decode token IDs back to text.

tokenize

(text) list[str]

Return subword strings instead of IDs (for inspection).

load_pretrained

()

Load the bundled 32k model shipped with the package. No path needed.

save

(directory)

Write vocab.json and merges.txt to directory.

load

(directory)

Load a previously saved tokenizer.


Attributes

Attribute

Description

tok.bpe

The underlying MorphAwareBPE instance. Access tok.bpe.vocab (dict), tok.bpe.merges (list of tuples), tok.bpe.id_to_token (dict).

tok.segmenter

The underlying TagalogSegmenter instance. Use tok.segmenter.segment(word) independently.


Examples

Load the bundled pretrained model

No download or path required — the 32k model is shipped with the package:

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.load_pretrained()
ids = tok.encode("Kumain siya ng pagkain.")

Train on your own corpus

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.train("corpus.txt", vocab_size=32000)

Encode / decode round-trip

ids = tok.encode("Kumain siya ng pagkain.")
assert tok.decode(ids) == "kumain siya ng pagkain."

Inspect tokens

tok.tokenize("Pinakamahusay ang ginawa niya.")
# ['pinaka', '▁', 'ma', '▁', 'husay', ' ', 'ang', ' ', ...]

Save and reload

tok.save("my_tokenizer/")

tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")
assert tok.encode("test") == tok2.encode("test")