Developer Guide

This guide covers everything you need to integrate Filipino Tokenizer into an application or ML pipeline.

Corpus preparation

The tokenizer reads a plain UTF-8 text file — one sentence per line.

Kumain siya ng pagkain sa hapagkainan.
Ang mga bata ay masayang naglalaro sa labas.
Maganda ang panahon ngayon.

Size guidelines

Corpus size

Recommended use

< 10k sentences

Prototyping / demos only

10k – 100k

Small-scale experiments

100k – 1M

Production NLP tasks

> 1M

Large language model pre-training

A good starting corpus for Tagalog is the WikiText-TL-39 dataset.

Writing a corpus programmatically

import tempfile, os

sentences = [
    "Kumain siya ng pagkain.",
    "Maganda ang panahon ngayon.",
]

with tempfile.NamedTemporaryFile("w", suffix=".txt", delete=False,
                                 encoding="utf-8") as f:
    f.write("\n".join(sentences))
    corpus_path = f.name

Training

from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.train(corpus_path, vocab_size=32000)

print(f"Vocab size  : {len(tok.bpe.vocab)}")
print(f"Merge rules : {len(tok.bpe.merges)}")

Choosing vocab_size

vocab_size sets an upper bound on the BPE vocabulary. A larger vocabulary means longer, more-complete tokens (lower fertility) but increases model embedding table size.

vocab_size

Typical use case

500 – 2000

Small experiments, unit tests

8000

Lightweight production tokenizer

32000

Standard for transformer language models

64000+

Very large corpora / multilingual settings

Note

If the corpus is too small to generate vocab_size unique merges, training stops early. Check len(tok.bpe.merges) after training to see the actual count.


Encoding

Single sentence

ids = tok.encode("Kumain siya ng pagkain.")
# [79, 99, 115, 99, 133, 4, 154, 4, 100, 4, 125, 99, 145, 18]
  • Input is lowercased automatically.

  • Returns list[int].

  • Unknown characters fall back to the <unk> token (ID 1).

Inspecting tokens as strings

tokens = tok.tokenize("Kumain siya ng pagkain.")
# ['k', 'um', 'ain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']

Batch encoding

There is no built-in batch method. Use a list comprehension:

sentences = ["Kumain siya.", "Maganda ang panahon."]
batch_ids = [tok.encode(s) for s in sentences]

For large batches, use concurrent.futures to parallelise:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as pool:
    batch_ids = list(pool.map(tok.encode, sentences))

Decoding

text = tok.decode(ids)
# 'kumain siya ng pagkain.'
  • Special tokens (<pad>, <unk>, <s>, </s>) are silently dropped.

  • Boundary markers () are removed.

  • Output is always lowercase (encoding lowercases input).


Saving and loading

# Save
tok.save("my_tokenizer/")
# Creates my_tokenizer/vocab.json and my_tokenizer/merges.txt

# Load
from filipino_tokenizer.tagalog import TagalogTokenizer
tok2 = TagalogTokenizer()
tok2.load("my_tokenizer/")

The saved files are human-readable plain text — you can inspect or version-control them.


Using the segmenter independently

The morphological segmenter can be used without the BPE layer:

from filipino_tokenizer.tagalog import TagalogSegmenter

seg = TagalogSegmenter()

seg.segment("kumain")          # ['um', 'kain']
seg.segment("pagkain")         # ['pag', 'kain']
seg.segment("pinakamahusay")   # ['pinaka', 'ma', 'husay']
seg.segment("pangalan")        # ['pangalan']  ← frozen form, not decomposed
seg.segment("computer")        # ['computer']  ← loan word, not decomposed

# Segment a full sentence (splits on whitespace/punctuation first)
seg.segment_text("Kumain siya ng pagkain.")
# ['um', 'kain', ' ', 'siya', ' ', 'ng', ' ', 'pag', 'kain', '.']

This is useful for:

  • Feature extraction for non-BPE models

  • Linguistic analysis and corpus statistics

  • Preprocessing for other NLP tools


Integrating with PyTorch

Filipino Tokenizer produces plain Python lists, which convert directly to tensors:

import torch
from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.load("my_tokenizer/")

def collate(sentences, pad_id=0):
    encoded = [tok.encode(s) for s in sentences]
    max_len = max(len(e) for e in encoded)
    padded = [e + [pad_id] * (max_len - len(e)) for e in encoded]
    return torch.tensor(padded, dtype=torch.long)

batch = collate(["Kumain siya.", "Maganda ang panahon ngayon."])
# tensor of shape (2, max_seq_len)

Integrating with HuggingFace datasets

from datasets import Dataset
from filipino_tokenizer.tagalog import TagalogTokenizer

tok = TagalogTokenizer()
tok.load("my_tokenizer/")

raw = Dataset.from_dict({"text": ["Kumain siya.", "Maganda ang panahon."]})

def tokenize_fn(batch):
    return {"input_ids": [tok.encode(t) for t in batch["text"]]}

tokenized = raw.map(tokenize_fn, batched=True)

Special token IDs

Token

ID

Meaning

<pad>

0

Padding (for fixed-length batches)

<unk>

1

Unknown character fallback

<s>

2

Beginning of sequence

</s>

3

End of sequence

These are always at IDs 0–3 regardless of corpus. Add them manually if your model expects them:

BOS, EOS = 2, 3
ids = [BOS] + tok.encode(sentence) + [EOS]

HuggingFace Transformers integration

TagalogHFTokenizer implements the PreTrainedTokenizer interface so it works directly with any HuggingFace-compatible training framework.

pip install filipino-tokenizer[hf]

Loading a trained tokenizer

from filipino_tokenizer.tagalog import TagalogHFTokenizer

tok = TagalogHFTokenizer(
    vocab_file="my_tokenizer/vocab.json",
    merges_file="my_tokenizer/merges.txt",
)

print(tok.vocab_size)      # 32000
print(tok.bos_token)       # '<s>'
print(tok.pad_token_id)    # 0

Batch encoding

sentences = [
    "Kumain siya ng pagkain.",
    "Nagtatrabaho ang tatay sa opisina araw-araw.",
]
encoding = tok(sentences, padding=True, truncation=True, return_tensors="pt")
# encoding["input_ids"]       — shape (2, seq_len)
# encoding["attention_mask"]  — shape (2, seq_len)

Save and reload in HuggingFace format

tok.save_pretrained("hf_tokenizer/")
# Creates: vocab.json, merges.txt, tokenizer_config.json, special_tokens_map.json

tok2 = TagalogHFTokenizer.from_pretrained("hf_tokenizer/")

Building a dataset for causal LM training

import torch
from torch.utils.data import Dataset

class FilipinoTextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.encodings = tokenizer(
            texts,
            max_length=max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

    def __len__(self):
        return self.encodings["input_ids"].shape[0]

    def __getitem__(self, idx):
        item = {k: v[idx] for k, v in self.encodings.items()}
        item["labels"] = item["input_ids"].clone()
        return item

Setting up a model

The only tokenizer-specific value a model needs is vocab_size:

from transformers import GPT2Config, GPT2LMHeadModel

config = GPT2Config(
    vocab_size=tok.vocab_size,
    pad_token_id=tok.pad_token_id,
    bos_token_id=tok.bos_token_id,
    eos_token_id=tok.eos_token_id,
)
model = GPT2LMHeadModel(config)

The same pattern works for any architecture (LlamaForCausalLM, BertForMaskedLM, T5ForConditionalGeneration, etc.) — only the config class changes.


Running tests

python -m unittest discover tests -v

All 49 tests should pass. Individual test files:

python -m unittest tests.test_affixes -v
python -m unittest tests.test_segmenter -v
python -m unittest tests.test_tokenizer -v