Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why does tokenization exist?

Computers can already read bytes. So why do language models insist on chopping text into these weird half-words first?

AI & ML intro Apr 29, 2026

Why it exists

A language model’s input layer is a lookup table. One row per token, each row a vector. Before any “thinking” happens, your text has to be turned into integer IDs that index into that table. The question is: what should those units be?

Two obvious answers fail in opposite directions.

One unit per word. Sounds clean. Falls apart immediately. Every typo, every plural, every hyphenation, every language other than English needs its own row. “run”, “running”, “ran”, “runs”, “Running” — five rows for one idea. The lookup table balloons to millions of rows, most of them seen a handful of times during training, most of which the model never learns anything useful about. And the first time a user types a word the table has never seen, the model is blind: there’s no row to look up. This is the out-of-vocabulary problem, and it haunted NLP for decades.

One unit per character (or byte). Also clean. Also fails, but in the other direction. Now the vocabulary is tiny — a few hundred rows — but every sentence is enormous. “transformer” is one concept to a human and eleven characters to a model. The model has to spend its first several layers just re-discovering that those eleven symbols form a word, before it can do anything interesting. Sequence length blows up, context window budget evaporates, and attention — which costs compute quadratic in sequence length — gets brutally expensive.

Tokenization is the compromise. Cut text into pieces that are bigger than characters but smaller than words, chosen so that common stuff (the, ing, tion, JavaScript) becomes a single token, and rare or unseen stuff (xyzzy42, a new product name, an emoji nobody’s seen before) gracefully falls back to several smaller tokens. Vocabulary stays manageable (~30k–200k entries), sequences stay short, and nothing is ever truly out-of-vocabulary because in the worst case you can always spell it out one byte at a time.

Why it matters now

Once you start building on top of LLMs, tokenization stops being an academic detail and starts showing up in your bills, your bugs, and your benchmarks.

If you ship anything LLM-shaped to production, tokenization is in the critical path of cost, latency, and correctness.

The short answer

tokenization = a learned vocabulary + a deterministic algorithm that splits any string into pieces from that vocabulary

The vocabulary is built once, by scanning a giant corpus and greedily merging the byte pairs that co-occur most. At inference time, a fixed algorithm chops your input into the longest pieces from that vocabulary it can find, falling back to bytes when needed. The model only ever sees the resulting sequence of integer IDs.

How it works

The dominant algorithm in modern LLMs is BPE (or close variants: WordPiece, SentencePiece, tiktoken’s byte-level BPE). The training procedure is almost embarrassingly simple:

1. Start with the vocabulary = all individual bytes (or characters).
2. Count every adjacent pair of symbols in your training corpus.
3. The most frequent pair becomes a new symbol; add it to the vocabulary.
4. Apply that merge everywhere in the corpus.
5. Repeat until the vocabulary is the size you want (e.g. 100k).

You end up with a vocabulary that has the bytes at the bottom (so nothing is ever unrepresentable), short common sequences in the middle (th, ing, er), and whole common words or sub-words at the top (tokenization, JavaScript, the). Note that leading space — most modern tokenizers treat the and the as different tokens, because spacing matters for reconstruction.

At inference, splitting a string is the merge process replayed: start from bytes, keep applying the highest-priority merges until no more apply. It’s deterministic, it’s fast, and it produces the same tokens for the same input every time.

A worked example with the GPT-style tokenizer:

"tokenization is fun"  →  ["token", "ization", " is", " fun"]   (4 tokens)
"tokenizashun is fun"  →  ["token", "iz", "ash", "un", " is", " fun"] (6)
"strawberry"           →  ["str", "aw", "berry"]                (3 tokens, no 'r' visible)
"日本語"                →  ["日", "本", "語"] or several bytes each, depending on tokenizer

A few things that surprise people the first time:

The deep reason tokenization is good enough to stay is that it pushes a hard problem (segmenting text) out of the model and into a cheap, fixed preprocessing step. The model gets to spend its capacity on the things it’s uniquely good at — composing meaning across the sequence — and not on re-deriving “these eleven bytes are the word transformer” on every forward pass.

Going deeper

A note on what I’m sure of: the algorithmic shape (BPE-style merges, byte-level fallback, deterministic encoding) and the practical consequences (cost, context, non-English overhead, the strawberry-r family of bugs) are well-established. The relative quality and adoption of specific tokenizers shifts model-by-model and year-by-year — verify against the current model card rather than memorize.