MorphAwareBPE

class filipino_tokenizer.tagalog.bpe.MorphAwareBPE[source]

Bases: object

Byte-Pair Encoding tokenizer with a morpheme-boundary constraint.

During train(), the algorithm counts bigram frequencies across the corpus but skips any pair that spans a BOUNDARY marker. This guarantees that learned merges never glue together parts of different morphemes.

Vocabulary layout:

id 0 → <pad> id 1 → <unk> id 2 → <s> id 3 → </s> id 4 … 259 → individual bytes (all 256 byte values) id 260 … → learned merge tokens

PAD = '<pad>'
UNK = '<unk>'
BOS = '<s>'
EOS = '</s>'
SPECIALS = ['<pad>', '<unk>', '<s>', '</s>']
train(corpus, vocab_size=32000)[source]

Train the BPE vocabulary from corpus (list of pre-segmented strings with BOUNDARY markers between morphemes).

Uses an optimized incremental algorithm: 1. Doubly-linked list for each word. 2. Max-heap for finding the most frequent pair. 3. Lazy deletion for stale heap entries.

Parameters:
Return type:

None

encode(text)[source]

Encode text into a list of token IDs.

Delegates to the Rust CoreBPE backend for greedy BPE encoding. Results are cached per unique surface token.

Parameters:

text (str)

Return type:

list[int]

decode(ids)[source]

Decode a list of token IDs back to a string.

Special tokens are silently dropped; boundary markers are removed. Delegates to the Rust CoreBPE backend.

Parameters:

ids (list[int])

Return type:

str

save(directory)[source]

Persist the tokenizer to directory as two files:

  • vocab.json — maps token string → integer id

  • merges.txt — one merge per line, token_a<TAB>token_b

Parameters:

directory (str)

Return type:

None

load(directory)[source]

Load a previously saved tokenizer from directory.

Parameters:

directory (str)

Return type:

None


Method reference

Method

Signature

Description

train

(corpus, vocab_size=32000)

Train BPE from a list of pre-annotated strings (with markers).

encode

(text) list[int]

Encode a boundary-annotated string to token IDs.

decode

(ids) str

Decode token IDs back to a string (boundary markers removed).

save

(directory)

Write vocab.json and merges.txt.

load

(directory)

Load a previously saved BPE model.


Vocabulary layout

Token

ID

Notes

<pad>

0

Always present

<unk>

1

Unknown character fallback

<s>

2

Beginning of sequence

</s>

3

End of sequence

chars

4+

All printable ASCII (32–126) + + corpus characters, sorted, allocated in order

merges

Learned BPE merge tokens, in training order


The CBPE constraint

During train(), the algorithm counts bigram frequencies across the corpus but skips any pair that contains a boundary marker. Concretely, in _init_pair_counts():

if BOUNDARY not in pair[0] and BOUNDARY not in pair[1]:
    pair_counts[pair] += freq

This guarantees that no learned merge rule ever combines tokens from different morphemes.


Saving and loading

save(directory) writes two files:

  • vocab.json — JSON object mapping token string → integer ID.

  • merges.txt — one merge per line, token_a<TAB>token_b.

Both files are UTF-8 and human-readable.

from filipino_tokenizer.tagalog.bpe import MorphAwareBPE, BOUNDARY

bpe = MorphAwareBPE()
bpe.train([f"pag{BOUNDARY}kain", f"ma{BOUNDARY}ganda"] * 10, vocab_size=100)
bpe.save("bpe_model/")

bpe2 = MorphAwareBPE()
bpe2.load("bpe_model/")
assert bpe.encode(f"pag{BOUNDARY}kain") == bpe2.encode(f"pag{BOUNDARY}kain")

Constants

filipino_tokenizer.tagalog.bpe.BOUNDARY = "▁"

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

The boundary marker (U+2581 LOWER ONE EIGHTH BLOCK) inserted between morphemes in surface-annotated text. Identical to the SentencePiece word-boundary character.