TagalogHFTokenizer¶
- class filipino_tokenizer.tagalog.hf_tokenizer.TagalogHFTokenizer(vocab_file=None, merges_file=None, bos_token='<s>', eos_token='</s>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]¶
Bases:
PreTrainedTokenizerHuggingFace-compatible tokenizer for Tagalog.
Wraps
TagalogTokenizer(morphological segmentation + constrained BPE) behind thePreTrainedTokenizerinterface so it works with the full HuggingFace ecosystem.- Special tokens (already part of the BPE vocab):
<pad>id=0<unk>id=1<s>id=2</s>id=3
- Parameters:
- vocab_files_names = {'merges_file': 'merges.txt', 'vocab_file': 'vocab.json'}¶
- model_input_names = ['input_ids', 'attention_mask']¶
Overview¶
TagalogHFTokenizer wraps TagalogTokenizer
behind the HuggingFace PreTrainedTokenizer interface. It requires
transformers>=4.30:
pip install filipino-tokenizer[hf]
Method reference¶
Method |
Signature |
Description |
|---|---|---|
|
|
Standard HF tokenizer call — returns |
|
|
Encode a single string to token IDs. |
|
|
Decode token IDs back to text. |
|
|
Save in HF format (adds |
|
|
Load from a local directory or HuggingFace Hub repo. |
Attributes¶
Attribute |
Description |
|---|---|
|
Total vocabulary size (32,000 for the pretrained model). |
|
|
|
|
|
|
|
|
Compatibility note¶
TagalogHFTokenizer now validates special-token mappings on load. If a
custom or older tokenizer directory is missing one or more expected special
token strings, it falls back to safe in-vocabulary IDs so batched padding and
truncation remain stable in HuggingFace training/evaluation pipelines.
Examples¶
Load the bundled pretrained model¶
No path required — the 32k model is shipped with the package:
from filipino_tokenizer.tagalog import TagalogHFTokenizer
tok = TagalogHFTokenizer() # loads bundled pretrained model
encoding = tok("Kumain siya ng pagkain.", return_tensors="pt")
# {"input_ids": tensor([[...]]), "attention_mask": tensor([[...]])}
Load from a custom trained model¶
tok = TagalogHFTokenizer.from_pretrained("my_tokenizer/")
# or pass files directly:
tok = TagalogHFTokenizer(
vocab_file="my_tokenizer/vocab.json",
merges_file="my_tokenizer/merges.txt",
)
Batch with padding¶
sentences = [
"Kumain siya ng pagkain.",
"Nagtatrabaho ang tatay sa opisina araw-araw.",
]
encoding = tok(sentences, padding=True, truncation=True,
max_length=128, return_tensors="pt")
Save and reload¶
tok.save_pretrained("hf_tokenizer/")
tok2 = TagalogHFTokenizer.from_pretrained("hf_tokenizer/")
Use with a language model¶
from transformers import GPT2Config, GPT2LMHeadModel
model = GPT2LMHeadModel(GPT2Config(
vocab_size=tok.vocab_size,
pad_token_id=tok.pad_token_id,
bos_token_id=tok.bos_token_id,
eos_token_id=tok.eos_token_id,
))