Hacks & Hallucinations: A Field Guide to AI's Weirdest Failures

a-gnt CommunityApril 12, 20265 min read

Why AI models hallucinate, where they break, and how to make them do strange things on purpose. The first post in a new series on the weird, broken, and fascinating edges of modern AI.

hacks-and-hallucinations ai-safety llm tutorial in-the-weeds

The thing nobody tells you

The first time an AI confidently told me a fact that didn't exist, I almost fell for it. It was a citation — a paper, an author, a year, a quote. All of it pristine. All of it fake.

That wasn't a bug. It was the system working exactly as designed.

Modern language models don't store facts the way a database stores rows. They store a web of statistical relationships between tokens — fragments of words that frequently appear together. When you ask one a question, it doesn't look anything up. It generates the next most plausible token, then the next, then the next, all the way to the end of the sentence. At every step, "plausible" means "this is the kind of thing that sounded right in the training data."

That process is hallucination by default. The surprise isn't that AI makes things up. The surprise is how often it's right.

This series is about the other times — when it's wrong, and why it's wrong. Not the clumsy mistakes (those are boring). The weird ones. The edges where the architecture shows through. The failures that teach you more about how LLMs work than any whitepaper will.

Three kinds of wrong

Over the last eighteen months, every AI failure I've seen in the wild falls into one of three buckets.

1. Structural failures

These come from how the model is built. They have nothing to do with training data or bad prompting. They're baked into the architecture itself.

The most famous example: ask a modern LLM how many Rs are in "strawberry." Many of them will say two. It's not stupidity — it's that the model never sees the letters "s-t-r-a-w-b-e-r-r-y." It sees tokens. Depending on the tokenizer, "strawberry" might be one token or three, but almost never ten letters. Asking it to count characters is like asking you to count the pixels in a printed page — the thing you're counting isn't even visible to you.

Structural failures are predictable. Once you understand where the architecture is thin, you can trigger them on demand. The Strawberry Test. The Time Paradox. The Arithmetic Mirror. Each one teaches you a specific truth about how the inside of a transformer works.

2. Training artifacts

These come from the training data. They're not bugs — they're echoes.

If the internet at scale says "the capital of Australia is Sydney" more often than "the capital of Australia is Canberra" (it doesn't, but bear with me), a naive model will confidently tell you Sydney. Models carry the biases, gaps, and errors of their training corpus like ghosts. They will confidently cite books that don't exist, because that's the shape a citation takes. They will give you a doctor's advice because training data is full of doctors' advice. They won't admit ignorance because training data rewards confidence.

Artifacts are learnable. A well-tuned model with good RLHF will suppress most of them. But the shape is still there, and you can surface it with the right prompt.

3. Adversarial manipulation

These are attacks. Someone — a user, an attacker, a mischievous prompt designer — makes the model do something it's not supposed to.

Prompt injection is the classic case: you trust user input, the input contains instructions disguised as content, the model follows them instead of yours. The attack works because the model has no way to tell data from commands. To the attention mechanism, it's all just tokens.

Adversarial manipulation is exploitable, and that makes it scarier than the first two. It's also the category where defense actually works — sanitize inputs, fence instructions, use structured prompts, never trust what a user pastes.

Why this series exists

If you build anything with AI in it — a chatbot, an agent, a writing assistant, a support tool — you need to know where it breaks. Not in the abstract. Concretely. Reproducibly. With the specific prompt that causes the specific failure.

The alternative is shipping something that works great in your demo and humiliates you in production. We've all seen it happen. A lawyer submitting a brief with fake citations. A customer service bot giving out coupon codes it invented. A support chatbot promising refunds it can't actually process.

Those aren't edge cases. They're Tuesdays.

The curriculum

Each post in this series picks one category of failure and goes deep:

[The Strawberry Test] — How tokenization sets the ceiling on what a model can count, and why asking an LLM to spell backward is a trick question.

[The Date Hallucination] — Why your AI thinks it's still 2023, what "knowledge cutoff" actually means, and how to force accurate temporal reasoning.

[Prompt Injection in the Wild] — Five real attacks, how they work, and the defense patterns that stop them. With code.

[The Repetition Death Spiral] — How LLMs get stuck in loops, why greedy decoding amplifies it, and the day OpenAI's API returned "poem poem poem" for an hour.

[Unicode Gremlins] — Zero-width joiners, bidi overrides, homoglyphs, and the invisible characters that can rewrite an AI's instructions.

We'll also release a pack of hack & hallucinate prompts — copy-paste demos you can run in any AI chat to see each failure yourself — and a set of defensive skills like scrub-unicode and spot-hallucination that you can drop into your workflow.

Two rules for reading along

One: don't be mad at the AI. These failures are not character flaws. They're consequences of architecture and training that are invariably tied to the things models do well. A model that never hallucinated would also never guess. A model immune to prompt injection would probably be worse at following instructions. The failures are the price of the abilities.

Two: try everything. The posts in this series are worth reading, but the prompts are worth running. Every single failure in this series is still reproducible in some form in at least one major frontier model as of this writing. Go try them. Get a feel for the edges. The only way to ship AI that survives contact with users is to know where the walls are.

See you in the weeds.

Share this post:

Ratings & Reviews

0.0

out of 5

0 ratings

No reviews yet. Be the first to share your experience.