MorphAwareBPE¶

class filipino_tokenizer.tagalog.bpe.MorphAwareBPE[source]¶

Bases: object

Byte-Pair Encoding tokenizer with a morpheme-boundary constraint.

During train(), the algorithm counts bigram frequencies across the corpus but skips any pair that spans a BOUNDARY marker. This guarantees that learned merges never glue together parts of different morphemes.

Vocabulary layout:: id 0 → <pad> id 1 → <unk> id 2 → <s> id 3 → </s> id 4 … 259 → individual bytes (all 256 byte values) id 260 … → learned merge tokens

PAD = '<pad>'¶

UNK = '<unk>'¶

BOS = '<s>'¶

EOS = '</s>'¶

SPECIALS = ['<pad>', '<unk>', '<s>', '</s>']¶

train(corpus, vocab_size=32000)[source]¶

Train the BPE vocabulary from corpus (list of pre-segmented strings with BOUNDARY markers between morphemes).

Uses an optimized incremental algorithm: 1. Doubly-linked list for each word. 2. Max-heap for finding the most frequent pair. 3. Lazy deletion for stale heap entries.

Parameters:

corpus (list[str])
vocab_size (int)

Return type:

None

encode(text)[source]¶

Encode text into a list of token IDs.

Delegates to the Rust CoreBPE backend for greedy BPE encoding. Results are cached per unique surface token.

Parameters:: text (str)
Return type:: list[int]

decode(ids)[source]¶

Decode a list of token IDs back to a string.

Special tokens are silently dropped; boundary markers are removed. Delegates to the Rust CoreBPE backend.

Parameters:: ids (list[int])
Return type:: str

save(directory)[source]¶

Persist the tokenizer to directory as two files:

vocab.json — maps token string → integer id
merges.txt — one merge per line, token_a<TAB>token_b

Parameters:: directory (str)
Return type:: None

load(directory)[source]¶

Load a previously saved tokenizer from directory.

Parameters:: directory (str)
Return type:: None

Method reference¶

Method	Signature	Description
`train`	`(corpus, vocab_size=32000)`	Train BPE from a list of pre-annotated strings (with `▁` markers).
`encode`	`(text) → list[int]`	Encode a boundary-annotated string to token IDs.
`decode`	`(ids) → str`	Decode token IDs back to a string (boundary markers removed).
`save`	`(directory)`	Write `vocab.json` and `merges.txt`.
`load`	`(directory)`	Load a previously saved BPE model.

Vocabulary layout¶

Token	ID	Notes
`<pad>`	0	Always present
`<unk>`	1	Unknown character fallback
`<s>`	2	Beginning of sequence
`</s>`	3	End of sequence
chars	4+	All printable ASCII (32–126) + `▁` + corpus characters, sorted, allocated in order
merges	…	Learned BPE merge tokens, in training order

The CBPE constraint¶

During train(), the algorithm counts bigram frequencies across the corpus but skips any pair that contains a ▁ boundary marker. Concretely, in _init_pair_counts():

if BOUNDARY not in pair[0] and BOUNDARY not in pair[1]:
    pair_counts[pair] += freq

This guarantees that no learned merge rule ever combines tokens from different morphemes.

Saving and loading¶

save(directory) writes two files:

vocab.json — JSON object mapping token string → integer ID.
merges.txt — one merge per line, token_a<TAB>token_b.

Both files are UTF-8 and human-readable.

from filipino_tokenizer.tagalog.bpe import MorphAwareBPE, BOUNDARY

bpe = MorphAwareBPE()
bpe.train([f"pag{BOUNDARY}kain", f"ma{BOUNDARY}ganda"] * 10, vocab_size=100)
bpe.save("bpe_model/")

bpe2 = MorphAwareBPE()
bpe2.load("bpe_model/")
assert bpe.encode(f"pag{BOUNDARY}kain") == bpe2.encode(f"pag{BOUNDARY}kain")

Constants¶

filipino_tokenizer.tagalog.bpe.BOUNDARY = "▁"¶

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

The boundary marker (U+2581 LOWER ONE EIGHTH BLOCK) inserted between morphemes in surface-annotated text. Identical to the SentencePiece word-boundary character.