🎯

huggingface-tokenizers

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Enables lightning-fast text tokenization with Rust-powered performance, supporting custom vocabulary training and multiple tokenization algorithms for NLP tasks.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

huggingface-tokenizers

Installation

pip installInstall Python package
pip install tokenizers
pip installInstall Python package
pip install tokenizers transformers
πŸ“– Extracted from docs: ovachiever/droid-tings
16Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Fast tokenizers optimized for research and production. Rust-based implementation tokenizes 1GB in <20 seconds. Supports BPE, WordPiece, and Unigram algorithms. Train custom vocabularies, track alignments, handle padding/truncation. Integrates seamlessly with transformers. Use when you need high-performance tokenization or custom tokenizer training.

Overview

# HuggingFace Tokenizers - Fast Tokenization for NLP

Fast, production-ready tokenizers with Rust performance and Python ease-of-use.

When to use HuggingFace Tokenizers

Use HuggingFace Tokenizers when:

  • Need extremely fast tokenization (<20s per GB of text)
  • Training custom tokenizers from scratch
  • Want alignment tracking (token β†’ original text position)
  • Building production NLP pipelines
  • Need to tokenize large corpora efficiently

Performance:

  • Speed: <20 seconds to tokenize 1GB on CPU
  • Implementation: Rust core with Python/Node.js bindings
  • Efficiency: 10-100Γ— faster than pure Python implementations

Use alternatives instead:

  • SentencePiece: Language-independent, used by T5/ALBERT
  • tiktoken: OpenAI's BPE tokenizer for GPT models
  • transformers AutoTokenizer: Loading pretrained only (uses this library internally)

Quick start

Installation

```bash

# Install tokenizers

pip install tokenizers

# With transformers integration

pip install tokenizers transformers

```

Load pretrained tokenizer

```python

from tokenizers import Tokenizer

# Load from HuggingFace Hub

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

# Encode text

output = tokenizer.encode("Hello, how are you?")

print(output.tokens) # ['hello', ',', 'how', 'are', 'you', '?']

print(output.ids) # [7592, 1010, 2129, 2024, 2017, 1029]

# Decode back

text = tokenizer.decode(output.ids)

print(text) # "hello, how are you?"

```

Train custom BPE tokenizer

```python

from tokenizers import Tokenizer

from tokenizers.models import BPE

from tokenizers.trainers import BpeTrainer

from tokenizers.pre_tokenizers import Whitespace

# Initialize tokenizer with BPE model

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

tokenizer.pre_tokenizer = Whitespace()

# Configure trainer

trainer = BpeTrainer(

vocab_size=30000,

special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],

min_frequency=2

)

# Train on files

files = ["train.txt", "validation.txt"]

tokenizer.train(files, trainer)

# Save

tokenizer.save("my-tokenizer.json")

```

Training time: ~1-2 minutes for 100MB corpus, ~10-20 minutes for 1GB

Batch encoding with padding

```python

# Enable padding

tokenizer.enable_padding(pad_id=3, pad_token="[PAD]")

# Encode batch

texts = ["Hello world", "This is a longer sentence"]

encodings = tokenizer.encode_batch(texts)

for encoding in encodings:

print(encoding.ids)

# [101, 7592, 2088, 102, 3, 3, 3]

# [101, 2023, 2003, 1037, 2936, 6251, 102]

```

Tokenization algorithms

BPE (Byte-Pair Encoding)

How it works:

  1. Start with character-level vocabulary
  2. Find most frequent character pair
  3. Merge into new token, add to vocabulary
  4. Repeat until vocabulary size reached

Used by: GPT-2, GPT-3, RoBERTa, BART, DeBERTa

```python

from tokenizers import Tokenizer

from tokenizers.models import BPE

from tokenizers.trainers import BpeTrainer

from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE(unk_token="<|endoftext|>"))

tokenizer.pre_tokenizer = ByteLevel()

trainer = BpeTrainer(

vocab_size=50257,

special_tokens=["<|endoftext|>"],

min_frequency=2

)

tokenizer.train(files=["data.txt"], trainer=trainer)

```

Advantages:

  • Handles OOV words well (breaks into subwords)
  • Flexible vocabulary size
  • Good for morphologically rich languages

Trade-offs:

  • Tokenization depends on merge order
  • May split common words unexpectedly

WordPiece

How it works:

  1. Start with character vocabulary
  2. Score merge pairs: frequency(pair) / (frequency(first) Γ— frequency(second))
  3. Merge highest scoring pair
  4. Repeat until vocabulary size reached

Used by: BERT, DistilBERT, MobileBERT

```python

from tokenizers import Tokenizer

from tokenizers.models import WordPiece

from tokenizers.trainers import WordPieceTrainer

from tokenizers.pre_tokenizers import Whitespace

from tokenizers.normalizers import BertNormalizer

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

tokenizer.normalizer = BertNormalizer(lowercase=True)

tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(

vocab_size=30522,

special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],

continuing_subword_prefix="##"

)

tokenizer.train(files=["corpus.txt"], trainer=trainer)

```

Advantages:

  • Prioritizes meaningful merges (high score = semantically related)
  • Used successfully in BERT (state-of-the-art results)

Trade-offs:

  • Unknown words become [UNK] if no subword match
  • Saves vocabulary, not merge rules (larger files)

Unigram

How it works:

  1. Start with large vocabulary (all substrings)
  2. Compute loss for corpus with current vocabulary
  3. Remove tokens with minimal impact on loss
  4. Repeat until vocabulary size reached

Used by: ALBERT, T5, mBART, XLNet (via SentencePiece)

```python

from tokenizers import Tokenizer

from tokenizers.models import Unigram

from tokenizers.trainers import UnigramTrainer

tokenizer = Tokenizer(Unigram())

trainer = UnigramTrainer(

vocab_size=8000,

special_tokens=["", "", ""],

unk_token=""

)

tokenizer.train(files=["data.txt"], trainer=trainer)

```

Advantages:

  • Probabilistic (finds most likely tokenization)
  • Works well for languages without word boundaries
  • Handles diverse linguistic contexts

Trade-offs:

  • Computationally expensive to train
  • More hyperparameters to tune

Tokenization pipeline

Complete pipeline: Normalization β†’ Pre-tokenization β†’ Model β†’ Post-processing

Normalization

Clean and standardize text:

```python

from tokenizers.normalizers import NFD, StripAccents, Lowercase, Sequence

tokenizer.normalizer = Sequence([

NFD(), # Unicode normalization (decompose)

Lowercase(), # Convert to lowercase

StripAccents() # Remove accents

])

# Input: "HΓ©llo WORLD"

# After normalization: "hello world"

```

Common normalizers:

  • NFD, NFC, NFKD, NFKC - Unicode normalization forms
  • Lowercase() - Convert to lowercase
  • StripAccents() - Remove accents (Γ© β†’ e)
  • Strip() - Remove whitespace
  • Replace(pattern, content) - Regex replacement

Pre-tokenization

Split text into word-like units:

```python

from tokenizers.pre_tokenizers import Whitespace, Punctuation, Sequence, ByteLevel

# Split on whitespace and punctuation

tokenizer.pre_tokenizer = Sequence([

Whitespace(),

Punctuation()

])

# Input: "Hello, world!"

# After pre-tokenization: ["Hello", ",", "world", "!"]

```

Common pre-tokenizers:

  • Whitespace() - Split on spaces, tabs, newlines
  • ByteLevel() - GPT-2 style byte-level splitting
  • Punctuation() - Isolate punctuation
  • Digits(individual_digits=True) - Split digits individually
  • Metaspace() - Replace spaces with ▁ (SentencePiece style)

Post-processing

Add special tokens for model input:

```python

from tokenizers.processors import TemplateProcessing

# BERT-style: [CLS] sentence [SEP]

tokenizer.post_processor = TemplateProcessing(

single="[CLS] $A [SEP]",

pair="[CLS] $A [SEP] $B [SEP]",

special_tokens=[

("[CLS]", 1),

("[SEP]", 2),

],

)

```

Common patterns:

```python

# GPT-2: sentence <|endoftext|>

TemplateProcessing(

single="$A <|endoftext|>",

special_tokens=[("<|endoftext|>", 50256)]

)

# RoBERTa: sentence

TemplateProcessing(

single=" $A ",

pair=" $A $B ",

special_tokens=[("", 0), ("", 2)]

)

```

Alignment tracking

Track token positions in original text:

```python

output = tokenizer.encode("Hello, world!")

# Get token offsets

for token, offset in zip(output.tokens, output.offsets):

start, end = offset

print(f"{token:10} β†’ [{start:2}, {end:2}): {text[start:end]!r}")

# Output:

# hello β†’ [ 0, 5): 'Hello'

# , β†’ [ 5, 6): ','

# world β†’ [ 7, 12): 'world'

# ! β†’ [12, 13): '!'

```

Use cases:

  • Named entity recognition (map predictions back to text)
  • Question answering (extract answer spans)
  • Token classification (align labels to original positions)

Integration with transformers

Load with AutoTokenizer

```python

from transformers import AutoTokenizer

# AutoTokenizer automatically uses fast tokenizers

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Check if using fast tokenizer

print(tokenizer.is_fast) # True

# Access underlying tokenizers.Tokenizer

fast_tokenizer = tokenizer.backend_tokenizer

print(type(fast_tokenizer)) #

```

Convert custom tokenizer to transformers

```python

from tokenizers import Tokenizer

from transformers import PreTrainedTokenizerFast

# Train custom tokenizer

tokenizer = Tokenizer(BPE())

# ... train tokenizer ...

tokenizer.save("my-tokenizer.json")

# Wrap for transformers

transformers_tokenizer = PreTrainedTokenizerFast(

tokenizer_file="my-tokenizer.json",

unk_token="[UNK]",

pad_token="[PAD]",

cls_token="[CLS]",

sep_token="[SEP]",

mask_token="[MASK]"

)

# Use like any transformers tokenizer

outputs = transformers_tokenizer(

"Hello world",

padding=True,

truncation=True,

max_length=512,

return_tensors="pt"

)

```

Common patterns

Train from iterator (large datasets)

```python

from datasets import load_dataset

# Load dataset

dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

# Create batch iterator

def batch_iterator(batch_size=1000):

for i in range(0, len(dataset), batch_size):

yield dataset[i:i + batch_size]["text"]

# Train tokenizer

tokenizer.train_from_iterator(

batch_iterator(),

trainer=trainer,

length=len(dataset) # For progress bar

)

```

Performance: Processes 1GB in ~10-20 minutes

Enable truncation and padding

```python

# Enable truncation

tokenizer.enable_truncation(max_length=512)

# Enable padding

tokenizer.enable_padding(

pad_id=tokenizer.token_to_id("[PAD]"),

pad_token="[PAD]",

length=512 # Fixed length, or None for batch max

)

# Encode with both

output = tokenizer.encode("This is a long sentence that will be truncated...")

print(len(output.ids)) # 512

```

Multi-processing

```python

from tokenizers import Tokenizer

from multiprocessing import Pool

# Load tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")

def encode_batch(texts):

return tokenizer.encode_batch(texts)

# Process large corpus in parallel

with Pool(8) as pool:

# Split corpus into chunks

chunk_size = 1000

chunks = [corpus[i:i+chunk_size] for i in range(0, len(corpus), chunk_size)]

# Encode in parallel

results = pool.map(encode_batch, chunks)

```

Speedup: 5-8Γ— with 8 cores

Performance benchmarks

Training speed

| Corpus Size | BPE (30k vocab) | WordPiece (30k) | Unigram (8k) |

|-------------|-----------------|-----------------|--------------|

| 10 MB | 15 sec | 18 sec | 25 sec |

| 100 MB | 1.5 min | 2 min | 4 min |

| 1 GB | 15 min | 20 min | 40 min |

Hardware: 16-core CPU, tested on English Wikipedia

Tokenization speed

| Implementation | 1 GB corpus | Throughput |

|----------------|-------------|---------------|

| Pure Python | ~20 minutes | ~50 MB/min |

| HF Tokenizers | ~15 seconds | ~4 GB/min |

| Speedup | 80Γ— | 80Γ— |

Test: English text, average sentence length 20 words

Memory usage

| Task | Memory |

|-------------------------|---------|

| Load tokenizer | ~10 MB |

| Train BPE (30k vocab) | ~200 MB |

| Encode 1M sentences | ~500 MB |

Supported models

Pre-trained tokenizers available via from_pretrained():

BERT family:

  • bert-base-uncased, bert-large-cased
  • distilbert-base-uncased
  • roberta-base, roberta-large

GPT family:

  • gpt2, gpt2-medium, gpt2-large
  • distilgpt2

T5 family:

  • t5-small, t5-base, t5-large
  • google/flan-t5-xxl

Other:

  • facebook/bart-base, facebook/mbart-large-cc25
  • albert-base-v2, albert-xlarge-v2
  • xlm-roberta-base, xlm-roberta-large

Browse all: https://huggingface.co/models?library=tokenizers

References

  • [Training Guide](references/training.md) - Train custom tokenizers, configure trainers, handle large datasets
  • [Algorithms Deep Dive](references/algorithms.md) - BPE, WordPiece, Unigram explained in detail
  • [Pipeline Components](references/pipeline.md) - Normalizers, pre-tokenizers, post-processors, decoders
  • [Transformers Integration](references/integration.md) - AutoTokenizer, PreTrainedTokenizerFast, special tokens

Resources

  • Docs: https://huggingface.co/docs/tokenizers
  • GitHub: https://github.com/huggingface/tokenizers ⭐ 9,000+
  • Version: 0.20.0+
  • Course: https://huggingface.co/learn/nlp-course/chapter6/1
  • Paper: BPE (Sennrich et al., 2016), WordPiece (Schuster & Nakajima, 2012)