TagalogHFTokenizer¶

class filipino_tokenizer.tagalog.hf_tokenizer.TagalogHFTokenizer(vocab_file=None, merges_file=None, bos_token='<s>', eos_token='</s>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]¶

Bases: PreTrainedTokenizer

HuggingFace-compatible tokenizer for Tagalog.

Wraps TagalogTokenizer (morphological segmentation + constrained BPE) behind the PreTrainedTokenizer interface so it works with the full HuggingFace ecosystem.

Special tokens (already part of the BPE vocab):: <pad> id=0 <unk> id=1 <s> id=2 </s> id=3

Parameters:

vocab_file (str | None)
merges_file (str | None)
bos_token (str)
eos_token (str)
unk_token (str)
pad_token (str)

vocab_files_names = {'merges_file': 'merges.txt', 'vocab_file': 'vocab.json'}¶

model_input_names = ['input_ids', 'attention_mask']¶

property vocab_size: int¶: Total vocabulary size including special tokens.

get_vocab()[source]¶

Return type:: dict[str, int]

convert_tokens_to_string(tokens)[source]¶

Decode a list of token strings back to readable text.

Parameters:: tokens (list[str])
Return type:: str

save_vocabulary(save_directory, filename_prefix=None)[source]¶

Save vocab.json and merges.txt to save_directory.

Parameters:

save_directory (str)
filename_prefix (str | None)

Return type:

tuple[str, str]

Overview¶

TagalogHFTokenizer wraps TagalogTokenizer behind the HuggingFace PreTrainedTokenizer interface. It requires transformers>=4.30:

pip install filipino-tokenizer[hf]

Method reference¶

Method	Signature	Description
`__call__`	`(text, padding, truncation, return_tensors, ...)`	Standard HF tokenizer call — returns `input_ids`, `attention_mask`.
`encode`	`(text) → list[int]`	Encode a single string to token IDs.
`decode`	`(ids) → str`	Decode token IDs back to text.
`save_pretrained`	`(directory)`	Save in HF format (adds `tokenizer_config.json`).
`from_pretrained`	`(directory_or_repo)`	Load from a local directory or HuggingFace Hub repo.

Attributes¶

Attribute	Description
`vocab_size`	Total vocabulary size (32,000 for the pretrained model).
`bos_token` / `bos_token_id`	`"<s>"` / `2`
`eos_token` / `eos_token_id`	`"</s>"` / `3`
`pad_token` / `pad_token_id`	`"<pad>"` / `0`
`unk_token` / `unk_token_id`	`"<unk>"` / `1`

Compatibility note¶

TagalogHFTokenizer now validates special-token mappings on load. If a custom or older tokenizer directory is missing one or more expected special token strings, it falls back to safe in-vocabulary IDs so batched padding and truncation remain stable in HuggingFace training/evaluation pipelines.

Examples¶

Load the bundled pretrained model¶

No path required — the 32k model is shipped with the package:

from filipino_tokenizer.tagalog import TagalogHFTokenizer

tok = TagalogHFTokenizer()   # loads bundled pretrained model
encoding = tok("Kumain siya ng pagkain.", return_tensors="pt")
# {"input_ids": tensor([[...]]), "attention_mask": tensor([[...]])}

Load from a custom trained model¶

tok = TagalogHFTokenizer.from_pretrained("my_tokenizer/")
# or pass files directly:
tok = TagalogHFTokenizer(
    vocab_file="my_tokenizer/vocab.json",
    merges_file="my_tokenizer/merges.txt",
)

Batch with padding¶

sentences = [
    "Kumain siya ng pagkain.",
    "Nagtatrabaho ang tatay sa opisina araw-araw.",
]
encoding = tok(sentences, padding=True, truncation=True,
               max_length=128, return_tensors="pt")

Save and reload¶

tok.save_pretrained("hf_tokenizer/")
tok2 = TagalogHFTokenizer.from_pretrained("hf_tokenizer/")

Use with a language model¶

from transformers import GPT2Config, GPT2LMHeadModel

model = GPT2LMHeadModel(GPT2Config(
    vocab_size=tok.vocab_size,
    pad_token_id=tok.pad_token_id,
    bos_token_id=tok.bos_token_id,
    eos_token_id=tok.eos_token_id,
))