Why LLMs can't count the r's in 'strawberry'
A model that can write a sonnet stumbles on a question a five-year-old gets right. The reason isn't intelligence — it's that the model never sees the letters.
Why it exists
You’ve probably tried it. You ask ChatGPT or Claude “how many r’s are in the word strawberry?” and watch a model that can debug your code, summarize a research paper, and write passable poetry confidently answer “two.” You correct it. It apologizes, recounts, and sometimes still gets it wrong. The screenshot has been a meme for years now, and it survives every model upgrade with surprising stubbornness.
The instinct is to call this a bug, or a sign that the model is “dumber than it looks.” Neither is quite right. The mistake isn’t in the model’s reasoning — it’s in what the model is allowed to look at. By the time the question reaches the network, the word “strawberry” isn’t there anymore. What arrives is a small handful of integers, and you’re asking it to count something inside a representation it doesn’t have.
This is the cleanest, most viral demonstration of a structural quirk in how every modern LLM reads text. It’s worth understanding because the same quirk shows up everywhere — arithmetic on long numbers, exact string edits, character-by-character transformations — and once you see it, half of the “weird LLM failure” genre stops looking weird.
Why it matters now
Every chatbot you use ships with this property. So does every coding agent, every customer-service bot, every “AI summarizer.” When users probe a model and find it failing on a kindergarten task, trust collapses faster than it should — because the failure looks like unreliability in general, when it’s actually a narrow, predictable artifact.
For people building on LLMs, this matters in a few specific places:
- Don’t ask the model to do character-level work directly. Counting letters, reversing strings, checking exact spelling, character-by-character edits. The model can fake it sometimes, but it’s the wrong tool. Use a code interpreter or a regex.
- “It got the easy thing wrong, can I trust the hard thing?” is a fair question with a non-obvious answer. A model that miscounts r’s may still be excellent at higher-level reasoning, because the two failure modes have different causes. The strawberry test isn’t a general intelligence test — it’s a tokenizer test.
- Reasoning models help, partially. Models trained to think out loud often catch this by spelling the word first, then counting. It’s not a fix to the input representation — it’s a workaround the model learned to apply. Sometimes it works; sometimes it doesn’t.
The short answer
LLM letter-counting failure = tokenizer hides letters + no character-level grounding
The model never receives the word “strawberry” as ten letters. It receives a short sequence of integer IDs that each stand for a chunk of the word. To count r’s it would have to recover the spelling from those chunks — and it was never explicitly trained to do that, only to predict plausible next tokens.
How it works
Before any “thinking” happens, your text is run through a tokenizer — a fixed lookup that chops a string into pieces from a learned vocabulary of ~50k–200k entries. Modern OpenAI models use a byte-level BPE tokenizer (the tiktoken library), and most other frontier models use close cousins.
For common English words, the tokenizer typically merges large chunks. “strawberry” gets split into a small number of subword pieces — often two or three, depending on the exact tokenizer and whether there’s a leading space. (I’m being deliberately vague about the exact split: it varies by model, and Anthropic in particular doesn’t fully publish Claude’s tokenizer, so any specific claim like “Claude splits it as X+Y+Z” would be guessing. You can verify the OpenAI split yourself in seconds with tiktoken or the public tokenizer playground.)
Whatever the exact pieces are, the important fact is what happens next: each piece becomes an integer ID, and the model only ever sees those integers. The string “strawberry” might enter the tokenizer, but what reaches the first transformer layer is something like [123, 4567, 890]. There are no letters in there. There is no r to count.
So when you ask “how many r’s are in strawberry?”, the model is being asked to answer a question about a representation it threw away at the door. The question makes sense to you, the human reading the prompt as characters. To the model, the prompt itself is a sequence of opaque chunk-IDs, and the word “strawberry” inside it is a couple of those chunks.
Why does the model so often get it almost right — answering “two” instead of throwing up its hands? Because somewhere in pretraining, sentences like “strawberry is spelled s-t-r-a-w-b-e-r-r-y” showed up. The model has fragmentary, indirect knowledge of how words spell. It can sometimes recall that knowledge, sometimes can’t, and sometimes recalls a slightly wrong version. So it produces a confident-sounding number that’s frequently off-by-one. This is the same machinery that produces hallucinations — fluent text generated from a partial, lossy memory — applied to a question whose true answer was never reliably in the training data in the form the model needs.
A few seams worth seeing:
- It’s not really “hallucination” in the made-up-a-fact sense. The model isn’t fabricating a paper or inventing a quote. It’s failing at a perception task — what letters are in this token? — that it was never given the inputs to do well.
- Character-level models wouldn’t have this problem. Architectures that operate directly on bytes or characters (ByT5, Charformer, more recent byte-level transformers) see every letter. They pay for it in sequence length and compute. The case for going byte-native is real, but it hasn’t displaced tokenized models at the production frontier — and I don’t have a confident read on whether or when that changes.
- Tool use fixes it cleanly. Give the model a Python interpreter and ask it to count:
"strawberry".count("r")returns 3. Done. The fix isn’t smarter weights; it’s letting the model offload the character-level operation to something that actually sees characters. - Reasoning models partly mitigate it. Models trained to produce long chains of thought often handle this by first writing the word out letter-by-letter in their scratchpad — “s, t, r, a, w, b, e, r, r, y” — and then counting. Spelling out a word coaxes the tokenizer into emitting one-letter tokens (or close to it), which gives the rest of the forward pass actual letters to work with. It’s a behavioral workaround that lives in the chain of thought, not a fix to the input pipeline.
- The bug generalizes. “How many words in this paragraph?”, “reverse this string”, “what’s the 7th character?”, “do these two long numbers add correctly?” — all the same family. Operations that need to see the substrate beneath the tokens are operations the model is structurally bad at. Many of the famous “LLMs can’t do basic math” examples are really tokenizer artifacts in disguise: a 12-digit number gets split into a few chunks, and the model is asked to do digit-by-digit arithmetic on chunks that don’t line up with digits.
The deep point: the strawberry question is a small, repeatable demonstration that the model and the user are looking at different objects. You see a string of letters. The model sees a sequence of subword IDs. Most of the time the gap doesn’t matter, because most language tasks don’t require seeing through the tokens. When the task does — counting, spelling, exact transformations — the gap is the whole story.
Famous related terms
- Tokenization —
tokenization = learned vocabulary + deterministic split into subword pieces. The preprocessing step that throws the letters away. - BPE (Byte Pair Encoding) —
BPE = greedy merge of the most-common adjacent pairs. The algorithm behind almost every modern LLM tokenizer. - Hallucination —
hallucination = next-token model + no built-in "I don't know" + a prompt the model can't actually answer. The strawberry miss is a tokenizer-flavored cousin of this. - How to spot hallucinations — practical heuristics; the strawberry test is one of the cheapest red flags for “this answer wasn’t grounded in something the model could actually see.”
- Chain-of-thought — letting the model spell things out before answering. The standard mitigation, and how reasoning models partly route around the tokenizer.
- Tool use / code interpreter —
tool use ≈ LLM + an external function it can call mid-generation. Hand off character-level work to something that actually sees characters. The clean fix. - Character-level / byte-level models —
byte-level model = transformer + bytes/characters as the input units. Architectures that skip tokenization entirely. No strawberry bug; longer sequences and higher compute cost.
Going deeper
- Andrej Karpathy’s Let’s build the GPT tokenizer video — the clearest from-scratch walkthrough of BPE. After watching, the strawberry behavior stops feeling mysterious.
- The OpenAI tiktoken repo and the public tokenizer playgrounds — paste “strawberry” in and watch the split happen.
- Sennrich, Haddow, Birch, Neural Machine Translation of Rare Words with Subword Units (2016) — the paper that brought BPE into modern NLP.
What I’m confident about: the mechanism (tokenizer hides letters, model sees IDs, character-level operations land outside the model’s input representation) is well-established. What I’m deliberately vague about: the exact token split for “strawberry” in any specific model — it depends on the tokenizer, and Claude’s tokenizer in particular isn’t fully public. If you want the precise split for a model you care about, run it through that model’s tokenizer rather than trusting a number from a blog post.