Teju's Blog

Full stack engineer and AI architect. Notes from the work.


Hallucinations in AI: what they are and how to prevent them

A model says the Eiffel Tower is in Berlin and people call it a hallucination, like the model was on something. It was not. It computed the next token, and the next token, and the next, and at no point did it ever check whether what it was saying was true. There is no fact-checker in there. There is only a probability distribution over what word comes next.

What “hallucination” really means

The word is convenient. It is also slightly wrong. A hallucination implies a system thought it was perceiving something real and was mistaken. A language model is not perceiving anything. It is sampling tokens from a probability distribution conditioned on the prompt and on its own previous output.

Harry Frankfurt’s On Bullshit gets at the better word. A liar knows the truth and works against it. A bullshitter does not care about the truth one way or the other. The output is shaped to be persuasive, plausible, well-formed. Whether it is true is incidental.

Models are not lying when they make things up. They do not have a notion of truth to lie against.

That said, “hallucination” is the term the industry uses, so I will use it for the rest of this post.

How a model writes a sentence

A token is a chunk of text. For the models you use day-to-day, a token is roughly four characters of English, give or take. The string hallucination is around four tokens. The string the is one.

When you send a prompt, the model:

  1. Splits it into tokens.
  2. Looks up each token’s embedding (a fixed-size vector, typically a few thousand dimensions).
  3. Runs the sequence through a stack of transformer layers, each of which mixes information across positions via attention.
  4. At the end, produces a vector of logits, one per token in the vocabulary (typically 30K to 200K entries).
  5. Softmaxes that vector into a probability distribution.
  6. Samples one token from the distribution.
  7. Appends it to the prompt and starts over from step 1.

That is the whole loop. Every word you have ever seen from a language model came from steps 5, 6, 7 repeated, one token at a time.

Here is what that distribution looks like for a friendly prompt. Click step (or play) to sample one token at a time.

Probabilities are illustrative; the shape matters more than the exact numbers.

Notice that even when the answer is correct, the mechanism is “what is a plausible next token”. The model is not pulling Paris from a fact store. It is finding that Paris is the highest-probability continuation given the text it was trained on. Whether the highest-probability continuation happens to be true does not change the mechanism.

Why hallucinations are not a bug

The model has no truth function. It only has “what does the training distribution suggest comes next?”. That distribution lives in the model’s weights, and truthfulness is not directly optimized for when those weights are trained; coherence and fluency are.

When you ask “who wrote the paper introducing the transformer architecture?”, the model has seen the strings Vaswani et al. and Attention Is All You Need close to each other thousands of times during training. The continuation flows naturally. Correct answer, by accident of the data.

Ask “who wrote the 2024 paper on small-batch contrastive losses for vision-language alignment?”, and the model has seen far fewer references that fit. The continuation still has to come from somewhere. The model will produce a name (often a real researcher in the field), a venue, even a DOI, because those are the shapes that fit the prompt. It does not know which combination is real.

Here is the same mechanic applied to a fabrication:

A fabricated citation, token by token. Each individual choice is a sensible continuation. The whole thing is invented.

Each token, on its own, is a reasonable continuation. The whole sentence is invented. Making the model bigger does not fix it.

The garden path

There is a second reason hallucinations stick around. Once the model commits to a token, the rest of the sentence is conditioned on that choice. Even if a slightly different choice would have led to a true statement, the model has no “undo” button mid-sentence.

Prompt: 'Attention Is All You Need was published at' NeurIPS52% ICML31% ICLR10% in 2017true in 2018false but consistent in 2018false but consistent

If the model picks NeurIPS, the next-token distribution centers on 2017, which is correct. If it picks ICML, the distribution centers on a different year, which is wrong, but consistent with the wrong venue. The sentence reads fluently either way; the model has no way to notice mid-sentence that it took the wrong branch.

You can sometimes catch a model mid-mistake by asking it to verify the answer in a fresh prompt. The new prompt does not have the garden path baked in. The “second look” is often more reliable than the first answer, because the model is not committed to defending the earlier choice.

The flavors of hallucination

Five shapes I see most often, with the prompts that tend to trigger them:

Confabulated facts. Specifics that sound right but are wrong. Capital cities of unfamiliar regions, dates of historical events outside the famous ones, version numbers of libraries. Triggered by any “what is the X of Y” question where the model has thin training data.

Fabricated citations. Real-looking but invented papers, URLs, books. The author names are usually real people in the field. The titles are plausible. The DOIs are formatted correctly. Nothing about the citation actually exists. Triggered by asking for references the model has not memorized verbatim.

Hallucinated APIs. Method names that should exist but do not. df.unique_sorted() (not on pandas), ctx.WithTimeoutOrCancel() (not in Go’s context package), requests.post_json() (not in requests). Triggered by code generation tasks where the model is following patterns rather than recalling exact APIs.

Self-contradiction. Two statements in the same response that cannot both be true. The model said the function returns a list at the top, returns a generator at the bottom. Triggered by long outputs where the model has drifted from what it said earlier in the response.

Plausible-but-wrong reasoning. Step-by-step explanations that arrive at the wrong answer through logic that sounds correct. Triggered by math, logic, and counterfactuals. Chain-of-thought helps, but does not eliminate it: the model can produce a fluent argument for the wrong conclusion.

How to reduce them

You cannot eliminate hallucinations entirely. Expect “low single digits of responses contain a factual error” from an unattended production agent, not zero. The techniques below have moved the needle for me, in rough order of impact.

Grounding (RAG)

Retrieve relevant documents at query time and put them in the prompt. The model is much more likely to quote the documents than to invent.

User query Embed query Vector index Top K docs Prompt + docs Model Answer with citations

This is the single biggest thing you can do for factual tasks. A model with the right documents in context produces correct answers most of the time. A model without them produces plausible answers.

Two caveats. The model can still hallucinate when it interpolates across documents. And it can confidently quote a document that does not actually answer the question, so include document IDs in the answer and let users click through to check.

Constrained generation

Force the output to match a structure: a JSON schema, a regex, or a context-free grammar (llama.cpp’s GBNF, Outlines, the OpenAI structured outputs API). At each token, the sampler is filtered down to only those tokens consistent with the grammar.

go
type BookCitation struct {
    Title  string `json:"title"`
    Author string `json:"author"`
    Year   int    `json:"year"`
    ISBN   string `json:"isbn"` // pattern: ^97[89]\d{10}$
}

The model still invents the content. The grammar makes sure the content has the right shape, which is sometimes enough (the ISBN regex makes the model produce something that at least looks like an ISBN). For real correctness, you still need grounding.

Tool use

Stop asking the model to recall, start asking it to look up. Give it a search tool, a read_docs tool, a query_db tool. The model decides what to fetch; the fetched content is real.

This works because the model is not bad at deciding what to look up. It is bad at remembering. Give it the right interface and the failure mode flips from “wrong answer” to “could not find it”, which is a much better failure mode.

Self-evaluation

After generating an answer, ask the model (or, better, a different model) to verify it against the source material. This catches confidently-wrong answers maybe 30 to 60 percent of the time in my use.

PROMPT:  Given the answer below, identify any claim not supported
         by the source documents. Quote the unsupported claim verbatim.

ANSWER:  {answer}
SOURCES: {sources}

Self-evaluation is cheap to add and easy to over-trust. The verifier is also a language model. It can hallucinate too.

Lower temperature

Temperature controls the sharpness of the sampling distribution. Temperature 0 is greedy decoding: always pick the highest-probability token. Lower temperature reduces creative deviations but does not eliminate hallucinations, because the highest-probability token can itself be wrong (see the Zhang et al. demo above).

A useful default: temperature 0 for anything that should be reproducible (code, structured output, citations). Temperature 0.7 or above for brainstorming, drafting, creative tasks.

Smaller, narrower prompts

A short, specific prompt with one task hallucinates less than a long, multi-part prompt asking the model to “summarize, then analyze, then suggest follow-ups, then…” The model has fewer chances to drift on a task it can hold in its head.

If you are seeing hallucinations and you have a 3000-token system prompt, the first thing to try is making it smaller.

Cite-as-you-generate

Have the model output its answer interleaved with source IDs:

The first Linux kernel was released in 1991 [src:1] by Linus Torvalds [src:1]
while he was a student at the University of Helsinki [src:2].

This makes verification mechanical. It also has a subtle effect on the generation itself: the model is conditioned at every step to produce a claim that has a source, which biases it toward claims it can ground.

Defensive UX

A confident wrong answer is much less harmful when the user can click through to the source it came from. Cite every factual claim. Render confidence as a visible state instead of burying it in the prose. Make “I do not know” a valid response rather than a failure mode. Do not lean on the model’s own self-reports: asking “are you sure?” sometimes catches a mistake, but the same machine that produced the wrong answer is the one being asked.


← all posts