Context in AI: what it is and how to use it

A model says the Eiffel Tower is in Berlin and we call it a hallucination. A model says the Eiffel Tower is in Berlin after you spent three paragraphs telling it about your trip to Berlin, and we have to admit it was sort of our fault. The first kind was the subject of the last post. This one is about the second kind, and more broadly about context: the buffer the model is writing from, what fits in it, and how the shape of what you put in there changes what comes out.

What context actually is

Context is everything the model can see when it picks the next token: the system prompt, the conversation history, any retrieved documents, the current user message, and the assistant’s own in-progress reply, all concatenated into one long token sequence. The model has no other working memory. If it is not in the context, it does not exist for that token.

A typical chat prompt has this shape:

All of this is one flat sequence to the model. The “fields” are not really fields. They are a convention you (or the chat template) maintain with special tokens like <|im_start|>system to mark transitions. The model is just reading.

How context shapes the next token

Recall the loop from the hallucinations post: at every step, the model produces a probability distribution over the next token, samples one, appends it, repeats. The distribution is conditioned on the entire context. Change the context, change the distribution.

Here is what that looks like in practice. The same final question with different framing.

Without an example:

Default behaviour for an arithmetic prompt: pick a digit.

With one example added before it:

One in-context example shifts the distribution dramatically toward the same format.

This is in-context learning. The model was not retrained between the two demos. The weights are identical. The only difference is what was in the prompt, and the difference at the output is dramatic. A few tokens of demonstration are doing the work of a fine-tune.

The flip side: noise in the context can pull the distribution toward unrelated material.

The prior paragraph primed 'Berlin' as a high-probability token. The model still gets it right here, but much less confidently than it should.

The model lands on Paris, but the gap between Paris and the wrong answer is much narrower than it would have been with a clean prompt. Push the priming a little harder, give the model a less famous landmark, and it would have flipped. Context is double-edged. It is the lever you use to steer the model, and it is the thing that pulls the model off course when you fill it with the wrong stuff.

Why the context window has a limit

Transformer attention is roughly $O(n^2)$ : every token attends to every other token. Double the context length, quadruple the compute. The math for why is in the math post: the $QK^T$ multiplication is the culprit. There are real engineering tricks that ease the pain (Flash Attention, sliding-window attention, ring attention), but the cost is real, and it is why even 1M-token models charge significantly more per call when you actually fill the window.

There is also a softer limit: models were trained on sequences up to some length. Going past that without position-extension tricks (RoPE scaling, ALiBi, and friends) produces unpredictable behavior. A model trained on 8K tokens does not magically know what to do with 100K.

And there is a third, less obvious limit. Even when the window is huge, the model’s effective context is often much smaller than the advertised number. That is the next section.

Lost in the middle

Liu et al. (2023), “Lost in the Middle” showed that LLMs recall information from the start and end of a long context much better than from the middle. The accuracy curve looks like a U: high at both ends, valley in between.

For practical purposes: if you stuff 20 retrieved documents into the prompt and the right answer is in document 10, the model is more likely to miss it than if the answer were in document 1 or document 20. This holds regardless of how big the window is.

Two operational consequences. Put critical information at the start and the end. And keep the context lean. Every extra token in the middle pushes the important stuff toward the recall valley.

Context engineering

A 200K context window does not mean you should fill it. The model’s effective context is shorter than the advertised size for any task that requires integrating across the whole window, and adding marginally relevant material can degrade accuracy on the actual task. When an agent is failing the instinct is to pile more into the prompt; the fix is more often to remove the stuff that is not pulling its weight. The tactics below help you put less in, more usefully, in rough order of impact.

(Context engineering is about what sits in the prompt. The sibling craft of how to phrase the parts you write is what most people mean by prompt engineering; the two overlap and reinforce each other.)

Put structure in

The model parses your prompt better when it has clear sections. Markdown headings, code fences, or XML-style tags all help. Anthropic explicitly recommends XML for Claude:

xml

<conventions>
- Use sync.WaitGroup, not raw goroutines, unless you can explain why.
- Always wrap returned errors with %w.
- Prefer named return values for functions over 20 lines.
</conventions>

<file path="server/handler.go">
package server
...
</file>

<task>
Review the file against the conventions. List violations with line numbers.
</task>

The model has seen lots of structured documents during training. Structure helps it find what it needs. As a bonus, structure helps you tell sections apart in your own logs.

Sandwich your instructions

Put critical instructions at the start AND the end of the prompt. Especially the end. The last thing the model read is the freshest in its working memory.

You are a code reviewer for a Go project. Follow the conventions below strictly.

<conventions>...</conventions>

<file>...</file>

Reminder: only flag violations of the conventions above. Do not suggest
stylistic preferences that are not in the conventions.

The closing reminder is doing real work. Without it, the model often improvises extra suggestions you did not ask for, because the conventions are now buried 2000 tokens back.

Just-in-time retrieval

Do not dump 50 documents into the context “just in case”. Give the model a tool and let it fetch what it needs, when it needs it.

This trades a few extra round trips for a much tighter context. The model’s attention is not spread across noise. Latency goes up a bit. Accuracy usually goes up a lot.

Strip noise before you put it in

If a document is five pages and one paragraph is relevant, summarize or extract that paragraph before putting it in the prompt. Run a smaller, cheaper model over the source first (“does this passage answer the question?”). Forward only the useful parts.

Every irrelevant token is one your important tokens have to compete with.

Summarize long histories

For chats with many turns, do not keep the entire history. After N turns, replace the older middle with a summary and keep the most recent K turns verbatim. Classic sliding window with a fingerprint at the front.

The summary is short, dense, deterministic. It loses detail but keeps the gist so the model does not lose the thread.

System prompt hygiene

A 3000-token system prompt is rarely better than a 300-token one. Most “rules” you add stop being followed past the first thousand tokens. If your system prompt is enormous and the model keeps ignoring rules, the first thing to try is making it smaller.

If you must have a long system prompt, sandwich it too: critical rules at the very top, critical rules at the very bottom, examples and background in between.

Context vs parametric memory

The model has two kinds of “memory”:

Parametric memory lives in the model’s weights. Vast (trillions of parameters worth of patterns), lossy (it cannot quote things verbatim with certainty), and frozen (you cannot edit it at inference time unless you fine-tune).

Context memory is what is in the current prompt. Precise (whatever you put in is what the model sees), bounded (a few hundred K tokens at most), and ephemeral (discarded the moment the next call begins).

Useful agentic systems lean on both. Parametric memory for general reasoning and language. Context memory, filled by RAG and tool calls and conversation history, for the current facts and the specifics of the task.

When a model is getting facts wrong, the first thing to change is what sits in front of it at runtime; reaching for fine-tuning to “make it remember” is almost always the wrong instinct.