MorphAwareBPE¶
- class filipino_tokenizer.tagalog.bpe.MorphAwareBPE[source]¶
Bases:
objectByte-Pair Encoding tokenizer with a morpheme-boundary constraint.
During
train(), the algorithm counts bigram frequencies across the corpus but skips any pair that spans aBOUNDARYmarker. This guarantees that learned merges never glue together parts of different morphemes.- Vocabulary layout:
id 0 → <pad> id 1 → <unk> id 2 → <s> id 3 → </s> id 4 … 259 → individual bytes (all 256 byte values) id 260 … → learned merge tokens
- PAD = '<pad>'¶
- UNK = '<unk>'¶
- BOS = '<s>'¶
- EOS = '</s>'¶
- SPECIALS = ['<pad>', '<unk>', '<s>', '</s>']¶
- train(corpus, vocab_size=32000)[source]¶
Train the BPE vocabulary from corpus (list of pre-segmented strings with
BOUNDARYmarkers between morphemes).Uses an optimized incremental algorithm: 1. Doubly-linked list for each word. 2. Max-heap for finding the most frequent pair. 3. Lazy deletion for stale heap entries.
- encode(text)[source]¶
Encode text into a list of token IDs.
Delegates to the Rust CoreBPE backend for greedy BPE encoding. Results are cached per unique surface token.
- decode(ids)[source]¶
Decode a list of token IDs back to a string.
Special tokens are silently dropped; boundary markers are removed. Delegates to the Rust CoreBPE backend.
Method reference¶
Method |
Signature |
Description |
|---|---|---|
|
|
Train BPE from a list of pre-annotated strings (with |
|
|
Encode a boundary-annotated string to token IDs. |
|
|
Decode token IDs back to a string (boundary markers removed). |
|
|
Write |
|
|
Load a previously saved BPE model. |
Vocabulary layout¶
Token |
ID |
Notes |
|---|---|---|
|
0 |
Always present |
|
1 |
Unknown character fallback |
|
2 |
Beginning of sequence |
|
3 |
End of sequence |
chars |
4+ |
All printable ASCII (32–126) + |
merges |
… |
Learned BPE merge tokens, in training order |
The CBPE constraint¶
During train(), the algorithm counts bigram frequencies across the corpus but
skips any pair that contains a ▁ boundary marker. Concretely, in
_init_pair_counts():
if BOUNDARY not in pair[0] and BOUNDARY not in pair[1]:
pair_counts[pair] += freq
This guarantees that no learned merge rule ever combines tokens from different morphemes.
Saving and loading¶
save(directory) writes two files:
vocab.json— JSON object mapping token string → integer ID.merges.txt— one merge per line,token_a<TAB>token_b.
Both files are UTF-8 and human-readable.
from filipino_tokenizer.tagalog.bpe import MorphAwareBPE, BOUNDARY
bpe = MorphAwareBPE()
bpe.train([f"pag{BOUNDARY}kain", f"ma{BOUNDARY}ganda"] * 10, vocab_size=100)
bpe.save("bpe_model/")
bpe2 = MorphAwareBPE()
bpe2.load("bpe_model/")
assert bpe.encode(f"pag{BOUNDARY}kain") == bpe2.encode(f"pag{BOUNDARY}kain")
Constants¶
- filipino_tokenizer.tagalog.bpe.BOUNDARY = "▁"¶
str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str
Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.
The boundary marker (U+2581 LOWER ONE EIGHTH BLOCK) inserted between morphemes in surface-annotated text. Identical to the SentencePiece word-boundary character.