Hacks & Hallucinations: How Many Rs in Strawberry?

a-gnt CommunityApril 12, 20265 min read

The famous counting failure that reveals everything about how LLMs actually see text. Not a bug — a consequence of tokenization. With reproducible prompts and the surprisingly clever workarounds.

hacks-and-hallucinations tokenization llm tutorial in-the-weeds

The question that broke the internet

In mid-2024, somebody asked ChatGPT how many Rs are in the word "strawberry." It said two.

The screenshot went viral. Ridiculed. Memed. Used as proof that LLMs are dumb, overhyped, not-actually-intelligent. For weeks, every AI skeptic had a new favorite example.

The problem is that "LLMs are dumb" is the least interesting explanation. The interesting one is much weirder: the model cannot see letters.

Let me show you why.

The model sees tokens, not characters

Every modern language model starts by breaking text into tokens — small fragments of words drawn from a fixed vocabulary of 30,000 to 200,000 entries. The exact algorithm varies (BPE, WordPiece, SentencePiece) but the idea is the same: turn text into a sequence of integers the neural net can actually process.

Here's what matters: a token is not a letter. A token might be a whole common word ("the"), a sub-word chunk ("ing"), a punctuation mark (","), or — for rarer words — several pieces glued together.

Let's tokenize "strawberry" using a real tokenizer you can run locally:

pythonfrom transformers import GPT2Tokenizer
tok = GPT2Tokenizer.from_pretrained("gpt2")
print(tok.tokenize("strawberry"))
['straw', 'berry']

Two tokens. Not ten characters. Two tokens.

When you ask the model "how many Rs are in strawberry," what it actually receives is something like:

['how', ' many', ' R', 's', ' are', ' in', ' straw', 'berry', '?']

The model is being asked to count something (letter R) inside two opaque chunks (straw and berry) whose internal structure it was never given. The only way it can answer correctly is if — during training — it happened to see enough examples of "strawberry has three Rs" to memorize the fact as a token-to-token association.

Different tokenizers split the word differently. Different models were trained on different corpora with different associations. That's why some frontier models answer correctly and others don't. The correct answer is memorized, not computed.

Reproducing it cleanly

Here's a prompt that reliably reveals the architectural ceiling across multiple model families:

For each of the following words, count the number of times the letter "r" appears. Show your work character by character.

- strawberry - embarrassment - preferred - correspondence - transfer

Run this on a few different models. Watch how confidently they produce wrong answers for some, right answers for others, with no apparent pattern. The pattern is there — it's just invisible from the outside. It's which words happened to have their letter counts discussed in training.

Here's the tell: if you ask the model to "show your work character by character," it will often comply — emitting s-t-r-a-w-b-e-r-r-y and counting correctly. That's not because it suddenly got smarter. It's because once the word is spelled out with hyphens, each letter is its own token. Now the model can actually count the things you asked it to count.

The trick that works

This is the best workaround in practice:

Spell "strawberry" one letter at a time with spaces between each letter.
Then count the "r"s in the spaced-out version.

When the model emits s t r a w b e r r y, each letter becomes an individual token (most tokenizers split on whitespace). Now the "counting Rs" task is well-posed for the first time. The model can correctly produce "3."

You've essentially given the model a way to see what you wanted it to see. You re-tokenized the input into a form where the task makes sense.

Other words that break LLMs

The strawberry failure has cousins. Every word whose letter structure wasn't heavily discussed in training is a landmine. A non-exhaustive list of words that trip GPT-4 class models:

"mississippi" — how many Ss? (four)
"bookkeeper" — three double-letter pairs: oo, kk, ee (the only common English word with three consecutive double letters)
"raspberry" — how many letters total? Models undercount because "rasp" and "berry" merge
"facetiously" — contains all five vowels in order (AEIOU)
"queueing" — five vowels in a row; models drop letters when listing them

The pattern: whenever the question depends on character-level structure and not semantic meaning, the model is probably relying on memory, and memory is patchy.

Why didn't they fix it?

They tried.

The obvious fix is a character-level tokenizer — one token per character. Nothing to merge, nothing to split, every letter visible to the model. It works. It also produces sequences roughly 5x longer than BPE, which means:
- 5x more compute per forward pass
- 5x more memory
- 5x slower inference
- 5x smaller effective context window

In 2024, researchers at Meta showed a variant called Byte Latent Transformer (BLT) that operates on raw bytes but uses a dynamic patching system to keep the effective sequence length close to BPE. Early results are promising. Whether frontier models adopt it is a different question — it's a massive architecture change.

Claude, GPT, and Gemini instead fight the problem at a different layer: they use tool use. When you ask Claude to count Rs in strawberry, a modern version will quietly reach for its Python interpreter, run "strawberry".count("r"), and give you the exact answer. The tokenizer is still the same. It just got a calculator.

If you're building on top of these models and care about character-level correctness, the lesson is: don't ask the model to count. Ask it to generate the code that counts. Let the interpreter do what the transformer can't see.

The lesson

🍓The Strawberry Test is not about how dumb AI is. It's about where AI thinks.

A model's attention mechanism can only operate on things it was tokenized to see. The representation is the ceiling. Whatever your input is chopped into — tokens, words, bytes, patches — is the finest grain your model can possibly reason about. Finer than that and it's literally blind.

Once you understand this, you stop being surprised by character-level failures and start designing around them. Any task involving:

Counting characters, syllables, or vowels
Reversing, rotating, or permuting letters
Spelling out loud
Rhyming
Pig Latin and similar games
Crosswords and anagrams

…should be handed to a tool, not the raw model.

The models are smart. Just not at the letter level. They were never looking at letters.

Try it yourself: the 🍓Strawberry Test prompt is in our hack & hallucinate pack. Run it on whichever model you use most. Get a feel for where it passes and where it breaks. Then remember, the next time someone tells you to ship a chatbot that counts words.

Share this post:

Ratings & Reviews

0.0

out of 5

0 ratings

No reviews yet. Be the first to share your experience.

Tools in this post

🍓

The Strawberry Test

Watch your AI fail at something a first-grader can do