Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why LLMs can't count the r's in 'strawberry'

A model that can write a sonnet stumbles on a question a five-year-old gets right. The reason isn't intelligence — it's that the model never sees the letters.

AI & ML intro May 2, 2026

Why it exists

You’ve probably tried it. You ask ChatGPT or Claude “how many r’s are in the word strawberry?” and watch a model that can debug your code, summarize a research paper, and write passable poetry confidently answer “two.” You correct it. It apologizes, recounts, and sometimes still gets it wrong. The screenshot has been a meme for years now, and it survives every model upgrade with surprising stubbornness.

The instinct is to call this a bug, or a sign that the model is “dumber than it looks.” Neither is quite right. The mistake isn’t in the model’s reasoning — it’s in what the model is allowed to look at. By the time the question reaches the network, the word “strawberry” isn’t there anymore. What arrives is a small handful of integers, and you’re asking it to count something inside a representation it doesn’t have.

This is the cleanest, most viral demonstration of a structural quirk in how every modern LLM reads text. It’s worth understanding because the same quirk shows up everywhere — arithmetic on long numbers, exact string edits, character-by-character transformations — and once you see it, half of the “weird LLM failure” genre stops looking weird.

Why it matters now

Every chatbot you use ships with this property. So does every coding agent, every customer-service bot, every “AI summarizer.” When users probe a model and find it failing on a kindergarten task, trust collapses faster than it should — because the failure looks like unreliability in general, when it’s actually a narrow, predictable artifact.

For people building on LLMs, this matters in a few specific places:

The short answer

LLM letter-counting failure = tokenizer hides letters + no character-level grounding

The model never receives the word “strawberry” as ten letters. It receives a short sequence of integer IDs that each stand for a chunk of the word. To count r’s it would have to recover the spelling from those chunks — and it was never explicitly trained to do that, only to predict plausible next tokens.

How it works

Before any “thinking” happens, your text is run through a tokenizer — a fixed lookup that chops a string into pieces from a learned vocabulary of ~50k–200k entries. Modern OpenAI models use a byte-level BPE tokenizer (the tiktoken library), and most other frontier models use close cousins.

For common English words, the tokenizer typically merges large chunks. “strawberry” gets split into a small number of subword pieces — often two or three, depending on the exact tokenizer and whether there’s a leading space. (I’m being deliberately vague about the exact split: it varies by model, and Anthropic in particular doesn’t fully publish Claude’s tokenizer, so any specific claim like “Claude splits it as X+Y+Z” would be guessing. You can verify the OpenAI split yourself in seconds with tiktoken or the public tokenizer playground.)

Whatever the exact pieces are, the important fact is what happens next: each piece becomes an integer ID, and the model only ever sees those integers. The string “strawberry” might enter the tokenizer, but what reaches the first transformer layer is something like [123, 4567, 890]. There are no letters in there. There is no r to count.

So when you ask “how many r’s are in strawberry?”, the model is being asked to answer a question about a representation it threw away at the door. The question makes sense to you, the human reading the prompt as characters. To the model, the prompt itself is a sequence of opaque chunk-IDs, and the word “strawberry” inside it is a couple of those chunks.

Why does the model so often get it almost right — answering “two” instead of throwing up its hands? Because somewhere in pretraining, sentences like “strawberry is spelled s-t-r-a-w-b-e-r-r-y” showed up. The model has fragmentary, indirect knowledge of how words spell. It can sometimes recall that knowledge, sometimes can’t, and sometimes recalls a slightly wrong version. So it produces a confident-sounding number that’s frequently off-by-one. This is the same machinery that produces hallucinations — fluent text generated from a partial, lossy memory — applied to a question whose true answer was never reliably in the training data in the form the model needs.

A few seams worth seeing:

The deep point: the strawberry question is a small, repeatable demonstration that the model and the user are looking at different objects. You see a string of letters. The model sees a sequence of subword IDs. Most of the time the gap doesn’t matter, because most language tasks don’t require seeing through the tokens. When the task does — counting, spelling, exact transformations — the gap is the whole story.

Going deeper

What I’m confident about: the mechanism (tokenizer hides letters, model sees IDs, character-level operations land outside the model’s input representation) is well-established. What I’m deliberately vague about: the exact token split for “strawberry” in any specific model — it depends on the tokenizer, and Claude’s tokenizer in particular isn’t fully public. If you want the precise split for a model you care about, run it through that model’s tokenizer rather than trusting a number from a blog post.