TagalogHFTokenizer

class filipino_tokenizer.tagalog.hf_tokenizer.TagalogHFTokenizer(vocab_file=None, merges_file=None, bos_token='<s>', eos_token='</s>', unk_token='<unk>', pad_token='<pad>', **kwargs)[source]

Bases: PreTrainedTokenizer

HuggingFace-compatible tokenizer for Tagalog.

Wraps TagalogTokenizer (morphological segmentation + constrained BPE) behind the PreTrainedTokenizer interface so it works with the full HuggingFace ecosystem.

Special tokens (already part of the BPE vocab):

<pad> id=0 <unk> id=1 <s> id=2 </s> id=3

Parameters:
  • vocab_file (str | None)

  • merges_file (str | None)

  • bos_token (str)

  • eos_token (str)

  • unk_token (str)

  • pad_token (str)

vocab_files_names = {'merges_file': 'merges.txt', 'vocab_file': 'vocab.json'}
model_input_names = ['input_ids', 'attention_mask']
property vocab_size: int

Total vocabulary size including special tokens.

get_vocab()[source]
Return type:

dict[str, int]

convert_tokens_to_string(tokens)[source]

Decode a list of token strings back to readable text.

Parameters:

tokens (list[str])

Return type:

str

save_vocabulary(save_directory, filename_prefix=None)[source]

Save vocab.json and merges.txt to save_directory.

Parameters:
  • save_directory (str)

  • filename_prefix (str | None)

Return type:

tuple[str, str]


Overview

TagalogHFTokenizer wraps TagalogTokenizer behind the HuggingFace PreTrainedTokenizer interface. It requires transformers>=4.30:

pip install filipino-tokenizer[hf]

Method reference

Method

Signature

Description

__call__

(text, padding, truncation, return_tensors, ...)

Standard HF tokenizer call — returns input_ids, attention_mask.

encode

(text) list[int]

Encode a single string to token IDs.

decode

(ids) str

Decode token IDs back to text.

save_pretrained

(directory)

Save in HF format (adds tokenizer_config.json).

from_pretrained

(directory_or_repo)

Load from a local directory or HuggingFace Hub repo.


Attributes

Attribute

Description

vocab_size

Total vocabulary size (32,000 for the pretrained model).

bos_token / bos_token_id

"<s>" / 2

eos_token / eos_token_id

"</s>" / 3

pad_token / pad_token_id

"<pad>" / 0

unk_token / unk_token_id

"<unk>" / 1

Compatibility note

TagalogHFTokenizer now validates special-token mappings on load. If a custom or older tokenizer directory is missing one or more expected special token strings, it falls back to safe in-vocabulary IDs so batched padding and truncation remain stable in HuggingFace training/evaluation pipelines.


Examples

Load the bundled pretrained model

No path required — the 32k model is shipped with the package:

from filipino_tokenizer.tagalog import TagalogHFTokenizer

tok = TagalogHFTokenizer()   # loads bundled pretrained model
encoding = tok("Kumain siya ng pagkain.", return_tensors="pt")
# {"input_ids": tensor([[...]]), "attention_mask": tensor([[...]])}

Load from a custom trained model

tok = TagalogHFTokenizer.from_pretrained("my_tokenizer/")
# or pass files directly:
tok = TagalogHFTokenizer(
    vocab_file="my_tokenizer/vocab.json",
    merges_file="my_tokenizer/merges.txt",
)

Batch with padding

sentences = [
    "Kumain siya ng pagkain.",
    "Nagtatrabaho ang tatay sa opisina araw-araw.",
]
encoding = tok(sentences, padding=True, truncation=True,
               max_length=128, return_tensors="pt")

Save and reload

tok.save_pretrained("hf_tokenizer/")
tok2 = TagalogHFTokenizer.from_pretrained("hf_tokenizer/")

Use with a language model

from transformers import GPT2Config, GPT2LMHeadModel

model = GPT2LMHeadModel(GPT2Config(
    vocab_size=tok.vocab_size,
    pad_token_id=tok.pad_token_id,
    bos_token_id=tok.bos_token_id,
    eos_token_id=tok.eos_token_id,
))