๐ŸŽฏ

sentencepiece

๐ŸŽฏSkill

from ovachiever/droid-tings

VibeIndex|
What it does

Tokenizes raw Unicode text using unsupervised BPE and Unigram algorithms, enabling multilingual and CJK language support with deterministic, lightweight vocabulary generation.

๐Ÿ“ฆ

Part of

ovachiever/droid-tings(370 items)

sentencepiece

Installation

pip installInstall Python package
pip install sentencepiece
git cloneClone repository
git clone https://github.com/google/sentencepiece.git
๐Ÿ“– Extracted from docs: ovachiever/droid-tings
17Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigram algorithms. Fast (50k sentences/sec), lightweight (6MB memory), deterministic vocabulary. Used by T5, ALBERT, XLNet, mBART. Train on raw text without pre-tokenization. Use when you need multilingual support, CJK languages, or reproducible tokenization.

Overview

# SentencePiece - Language-Independent Tokenization

Unsupervised tokenizer that works on raw text without language-specific preprocessing.

When to use SentencePiece

Use SentencePiece when:

  • Building multilingual models (no language-specific rules)
  • Working with CJK languages (Chinese, Japanese, Korean)
  • Need reproducible tokenization (deterministic vocabulary)
  • Want to train on raw text (no pre-tokenization needed)
  • Require lightweight deployment (6MB memory, 50k sentences/sec)

Performance:

  • Speed: 50,000 sentences/sec
  • Memory: ~6MB for loaded model
  • Languages: All (language-independent)

Use alternatives instead:

  • HuggingFace Tokenizers: Faster training, more flexibility
  • tiktoken: OpenAI models (GPT-3.5/4)
  • BERT WordPiece: English-centric tasks

Quick start

Installation

```bash

# Python

pip install sentencepiece

# C++ (requires CMake)

git clone https://github.com/google/sentencepiece.git

cd sentencepiece

mkdir build && cd build

cmake .. && make -j $(nproc)

sudo make install

```

Train model

```bash

# Command-line (BPE with 8000 vocab)

spm_train --input=data.txt --model_prefix=m --vocab_size=8000 --model_type=bpe

# Python API

import sentencepiece as spm

spm.SentencePieceTrainer.train(

input='data.txt',

model_prefix='m',

vocab_size=8000,

model_type='bpe'

)

```

Training time: ~1-2 minutes for 100MB corpus

Encode and decode

```python

import sentencepiece as spm

# Load model

sp = spm.SentencePieceProcessor(model_file='m.model')

# Encode to pieces

pieces = sp.encode('This is a test', out_type=str)

print(pieces) # ['โ–This', 'โ–is', 'โ–a', 'โ–test']

# Encode to IDs

ids = sp.encode('This is a test', out_type=int)

print(ids) # [284, 47, 11, 1243]

# Decode

text = sp.decode(ids)

print(text) # "This is a test"

```

Language-independent design

Whitespace as symbol (โ–)

```python

text = "Hello world"

pieces = sp.encode(text, out_type=str)

print(pieces) # ['โ–Hello', 'โ–world']

# Decode preserves spaces

decoded = sp.decode_pieces(pieces)

print(decoded) # "Hello world"

```

Key principle: Treat text as raw Unicode, whitespace = โ– (meta symbol)

Tokenization algorithms

BPE (Byte-Pair Encoding)

```python

spm.SentencePieceTrainer.train(

input='data.txt',

model_prefix='bpe_model',

vocab_size=16000,

model_type='bpe'

)

```

Used by: mBART

Unigram (default)

```python

spm.SentencePieceTrainer.train(

input='data.txt',

model_prefix='unigram_model',

vocab_size=8000,

model_type='unigram'

)

```

Used by: T5, ALBERT, XLNet

Training configuration

Essential parameters

```python

spm.SentencePieceTrainer.train(

input='corpus.txt',

model_prefix='m',

vocab_size=32000,

model_type='unigram',

character_coverage=0.9995, # 1.0 for CJK

user_defined_symbols=['[SEP]', '[CLS]'],

unk_piece='',

num_threads=16

)

```

Character coverage

| Language Type | Coverage | Rationale |

|---------------|----------|-----------|

| English | 0.9995 | Most common chars |

| CJK (Chinese) | 1.0 | All characters needed |

| Multilingual | 0.9995 | Balance |

Encoding options

Subword regularization

```python

# Sample different tokenizations

for _ in range(3):

pieces = sp.encode('tokenization', out_type=str, enable_sampling=True, alpha=0.1)

print(pieces)

# Output (different each time):

# ['โ–token', 'ization']

# ['โ–tok', 'en', 'ization']

```

Use case: Data augmentation for robustness.

Common patterns

T5-style training

```python

spm.SentencePieceTrainer.train(

input='c4_corpus.txt',

model_prefix='t5',

vocab_size=32000,

model_type='unigram',

user_defined_symbols=[f'' for i in range(100)],

unk_id=2,

eos_id=1,

pad_id=0

)

```

Integration with transformers

```python

from transformers import T5Tokenizer

# T5 uses SentencePiece internally

tokenizer = T5Tokenizer.from_pretrained('t5-base')

inputs = tokenizer('translate English to French: Hello', return_tensors='pt')

```

Performance benchmarks

Training speed

| Corpus | BPE (16k) | Unigram (8k) |

|--------|-----------|--------------|

| 100 MB | 1-2 min | 3-4 min |

| 1 GB | 10-15 min | 30-40 min |

Tokenization speed

  • SentencePiece: 50,000 sentences/sec
  • HF Tokenizers: 200,000 sentences/sec (4ร— faster)

Supported models

T5 family: t5-base, t5-large (32k vocab, Unigram)

ALBERT: albert-base-v2 (30k vocab, Unigram)

XLNet: xlnet-base-cased (32k vocab, Unigram)

mBART: facebook/mbart-large-50 (250k vocab, BPE)

References

  • [Training Guide](references/training.md) - Detailed options, corpus preparation
  • [Algorithms](references/algorithms.md) - BPE vs Unigram, subword regularization

Resources

  • GitHub: https://github.com/google/sentencepiece โญ 10,000+
  • Paper: https://arxiv.org/abs/1808.06226 (EMNLP 2018)
  • Version: 0.2.0+