Why does tokenization exist?
Computers can already read bytes. So why do language models insist on chopping text into these weird half-words first?
Why it exists
A language model’s input layer is a lookup table. One row per token, each row a vector. Before any “thinking” happens, your text has to be turned into integer IDs that index into that table. The question is: what should those units be?
Two obvious answers fail in opposite directions.
One unit per word. Sounds clean. Falls apart immediately. Every typo, every plural, every hyphenation, every language other than English needs its own row. “run”, “running”, “ran”, “runs”, “Running” — five rows for one idea. The lookup table balloons to millions of rows, most of them seen a handful of times during training, most of which the model never learns anything useful about. And the first time a user types a word the table has never seen, the model is blind: there’s no row to look up. This is the out-of-vocabulary problem, and it haunted NLP for decades.
One unit per character (or byte). Also clean. Also fails, but in the other direction. Now the vocabulary is tiny — a few hundred rows — but every sentence is enormous. “transformer” is one concept to a human and eleven characters to a model. The model has to spend its first several layers just re-discovering that those eleven symbols form a word, before it can do anything interesting. Sequence length blows up, context window budget evaporates, and attention — which costs compute quadratic in sequence length — gets brutally expensive.
Tokenization is the compromise. Cut text into pieces that are bigger than
characters but smaller than words, chosen so that common stuff (the,
ing, tion, JavaScript) becomes a single token, and rare or unseen stuff
(xyzzy42, a new product name, an emoji nobody’s seen before) gracefully
falls back to several smaller tokens. Vocabulary stays manageable
(~30k–200k entries), sequences stay short, and nothing is ever truly
out-of-vocabulary because in the worst case you can always spell it out one
byte at a time.
Why it matters now
Once you start building on top of LLMs, tokenization stops being an academic detail and starts showing up in your bills, your bugs, and your benchmarks.
- Pricing is per-token. Every API you call charges by tokens in and tokens out. “How long is this prompt?” is not a question about characters or words — it’s a question about that model’s tokenizer. The same string costs different amounts on different models.
- Context windows are measured in tokens. “200k context” is 200,000 tokens, not characters. How much of your codebase actually fits depends on how token-efficient the tokenizer is for your content. Code, non-English languages, and structured data tokenize very differently from English prose.
- Weird model failures trace back here. The classic “how many
rs in strawberry?” miss happens partly because the model never sees the letters individually — it sees a couple of tokens and is being asked a question about a representation it doesn’t have. Same family of bug: arithmetic on long numbers, exact string manipulation, counting characters. - Prompt injection and jailbreaks sometimes exploit tokenizer quirks. Unicode lookalikes, unusual whitespace, and rare-token “glitch” sequences can route a prompt through paths the safety training never saw.
If you ship anything LLM-shaped to production, tokenization is in the critical path of cost, latency, and correctness.
The short answer
tokenization = a learned vocabulary + a deterministic algorithm that splits any string into pieces from that vocabulary
The vocabulary is built once, by scanning a giant corpus and greedily merging the byte pairs that co-occur most. At inference time, a fixed algorithm chops your input into the longest pieces from that vocabulary it can find, falling back to bytes when needed. The model only ever sees the resulting sequence of integer IDs.
How it works
The dominant algorithm in modern LLMs is BPE (or close variants: WordPiece, SentencePiece, tiktoken’s byte-level BPE). The training procedure is almost embarrassingly simple:
1. Start with the vocabulary = all individual bytes (or characters).
2. Count every adjacent pair of symbols in your training corpus.
3. The most frequent pair becomes a new symbol; add it to the vocabulary.
4. Apply that merge everywhere in the corpus.
5. Repeat until the vocabulary is the size you want (e.g. 100k).
You end up with a vocabulary that has the bytes at the bottom (so nothing is
ever unrepresentable), short common sequences in the middle (th, ing,
er), and whole common words or sub-words at the top (tokenization,
JavaScript, the). Note that leading space — most modern tokenizers treat
the and the as different tokens, because spacing matters for
reconstruction.
At inference, splitting a string is the merge process replayed: start from bytes, keep applying the highest-priority merges until no more apply. It’s deterministic, it’s fast, and it produces the same tokens for the same input every time.
A worked example with the GPT-style tokenizer:
"tokenization is fun" → ["token", "ization", " is", " fun"] (4 tokens)
"tokenizashun is fun" → ["token", "iz", "ash", "un", " is", " fun"] (6)
"strawberry" → ["str", "aw", "berry"] (3 tokens, no 'r' visible)
"日本語" → ["日", "本", "語"] or several bytes each, depending on tokenizer
A few things that surprise people the first time:
- Tokens are not morphemes. They’re whatever pairs happened to co-occur
often in the training corpus.
izationis one token because lots of English words end that way; the tokenizer doesn’t know it’s a suffix, it just knows those bytes show up together. - The same word tokenizes differently in different positions.
Helloat the start of a string andHellomid-sentence are usually different tokens. This is correct behavior, but it makes “count the tokens of this word” a slightly ill-posed question. - Non-English text often costs 2–4× more tokens per character. Because the merges were learned from corpora that were mostly English (or mostly code, or mostly whatever the trainer optimized for), other scripts fall back to shorter, less efficient pieces. A Japanese sentence and its English translation can have wildly different token counts even when they carry the same meaning. This is a real cost-and-fairness issue, not a rounding error.
- Switching tokenizers means retraining. A model is welded to the exact vocabulary it was trained on. You can’t take GPT’s tokenizer, swap in Llama’s, and expect anything sensible to come out — the embedding rows are indexed by ID, and the IDs no longer mean what they meant.
- There’s an active research thread on getting rid of tokenization entirely. Byte-level and “tokenizer-free” architectures (ByT5, Charformer, more recently work on byte-level transformers) try to operate directly on bytes, paying the longer-sequence cost in exchange for removing a brittle preprocessing step. As of writing, the production frontier is still tokenized; the case for going byte-native is real but hasn’t won. I don’t have a confident read on how soon, or whether, that changes.
The deep reason tokenization is good enough to stay is that it pushes a
hard problem (segmenting text) out of the model and into a cheap, fixed
preprocessing step. The model gets to spend its capacity on the things it’s
uniquely good at — composing meaning across the sequence — and not on
re-deriving “these eleven bytes are the word transformer” on every
forward pass.
Famous related terms
- BPE (Byte Pair Encoding) —
BPE = greedy merge of most-common pairs. The dominant tokenization algorithm; originally a 1994 compression idea (Gage), repurposed for NLP by Sennrich et al., 2016. - WordPiece —
WordPiece ≈ BPE + a likelihood-based merge criterion. The variant used by BERT. - SentencePiece —
SentencePiece = BPE/unigram + treats input as a raw byte stream. Doesn’t require pre-tokenized words; popular for non-whitespace languages. - Unigram LM tokenization —
unigram = pick a vocab that maximizes corpus likelihood under a unigram model. An alternative to BPE used in SentencePiece. - tiktoken —
tiktoken = OpenAI's fast byte-level BPE implementation— the reference for GPT-family token counts. - Embeddings —
embedding = learned vector representation of a discrete thing— what tokens get turned into after tokenization: the float vectors the model actually does math on. - Context window —
context window = max number of tokens a model can attend to in one pass— the budget tokenization spends against. Always measured in tokens, not characters. - OOV (out-of-vocabulary) —
OOV = a token at inference time that the training vocab never saw— the problem subword tokenization makes go away by always being able to fall back to bytes.
Going deeper
- Neural Machine Translation of Rare Words with Subword Units (Sennrich, Haddow, Birch, 2016) — the paper that brought BPE into modern NLP.
- SentencePiece: A simple and language independent subword tokenizer (Kudo & Richardson, 2018).
- The OpenAI tiktoken repo — read the README and play with the encoder for ten minutes; it’s the fastest way to build intuition for what your prompts actually look like to the model.
- Andrej Karpathy’s “Let’s build the GPT tokenizer” video — a from-scratch walkthrough of BPE that makes the whole thing concrete.
A note on what I’m sure of: the algorithmic shape (BPE-style merges, byte-level fallback, deterministic encoding) and the practical consequences (cost, context, non-English overhead, the strawberry-
rfamily of bugs) are well-established. The relative quality and adoption of specific tokenizers shifts model-by-model and year-by-year — verify against the current model card rather than memorize.