TagalogHFTokenizer ================== .. autoclass:: filipino_tokenizer.tagalog.hf_tokenizer.TagalogHFTokenizer :members: :undoc-members: :show-inheritance: ---- Overview -------- ``TagalogHFTokenizer`` wraps :class:`~filipino_tokenizer.tagalog.tokenizer.TagalogTokenizer` behind the HuggingFace ``PreTrainedTokenizer`` interface. It requires ``transformers>=4.30``: .. code-block:: bash pip install filipino-tokenizer[hf] ---- Method reference ---------------- .. list-table:: :header-rows: 1 :widths: 30 35 35 * - Method - Signature - Description * - ``__call__`` - ``(text, padding, truncation, return_tensors, ...)`` - Standard HF tokenizer call — returns ``input_ids``, ``attention_mask``. * - ``encode`` - ``(text) → list[int]`` - Encode a single string to token IDs. * - ``decode`` - ``(ids) → str`` - Decode token IDs back to text. * - ``save_pretrained`` - ``(directory)`` - Save in HF format (adds ``tokenizer_config.json``). * - ``from_pretrained`` - ``(directory_or_repo)`` - Load from a local directory or HuggingFace Hub repo. ---- Attributes ---------- .. list-table:: :header-rows: 1 :widths: 25 75 * - Attribute - Description * - ``vocab_size`` - Total vocabulary size (32,000 for the pretrained model). * - ``bos_token`` / ``bos_token_id`` - ``""`` / ``2`` * - ``eos_token`` / ``eos_token_id`` - ``""`` / ``3`` * - ``pad_token`` / ``pad_token_id`` - ``""`` / ``0`` * - ``unk_token`` / ``unk_token_id`` - ``""`` / ``1`` Compatibility note ------------------ ``TagalogHFTokenizer`` now validates special-token mappings on load. If a custom or older tokenizer directory is missing one or more expected special token strings, it falls back to safe in-vocabulary IDs so batched padding and truncation remain stable in HuggingFace training/evaluation pipelines. ---- Examples -------- Load the bundled pretrained model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ No path required — the 32k model is shipped with the package: .. code-block:: python from filipino_tokenizer.tagalog import TagalogHFTokenizer tok = TagalogHFTokenizer() # loads bundled pretrained model encoding = tok("Kumain siya ng pagkain.", return_tensors="pt") # {"input_ids": tensor([[...]]), "attention_mask": tensor([[...]])} Load from a custom trained model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python tok = TagalogHFTokenizer.from_pretrained("my_tokenizer/") # or pass files directly: tok = TagalogHFTokenizer( vocab_file="my_tokenizer/vocab.json", merges_file="my_tokenizer/merges.txt", ) Batch with padding ~~~~~~~~~~~~~~~~~~ .. code-block:: python sentences = [ "Kumain siya ng pagkain.", "Nagtatrabaho ang tatay sa opisina araw-araw.", ] encoding = tok(sentences, padding=True, truncation=True, max_length=128, return_tensors="pt") Save and reload ~~~~~~~~~~~~~~~ .. code-block:: python tok.save_pretrained("hf_tokenizer/") tok2 = TagalogHFTokenizer.from_pretrained("hf_tokenizer/") Use with a language model ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from transformers import GPT2Config, GPT2LMHeadModel model = GPT2LMHeadModel(GPT2Config( vocab_size=tok.vocab_size, pad_token_id=tok.pad_token_id, bos_token_id=tok.bos_token_id, eos_token_id=tok.eos_token_id, ))