Teju's Blog

Full stack engineer and AI architect. Notes from the work.


Prompt engineering: what prompts are and how to write good ones

“Prompt engineering” gets eye-rolled as a job title, fairly. As a skill it is just writing clear instructions to a system that takes everything you say literally and statistically. That is harder than it sounds, and getting better at it has a much higher payoff than people give it credit for.

What a prompt actually is

A prompt is the full text the model sees at the start of a turn. In a chat API, that text gets assembled from a few roles:

System message(persona and rules) Conversation history(prior user/assistant turns) Current user message(the task)

Each role gets wrapped with special tokens to mark the boundary. The model sees one long sequence. The roles are a convention, not a separate input channel. The mechanical details (and what to put where) are in the context post.

How phrasing shapes the output

Recall from the hallucinations post: at every step the model samples a token from a probability distribution conditioned on the prompt. The prompt is the lever. Same model, same weights, different prompt, different distribution.

Watch a zero-shot prompt first. The model has to guess the format you want.

Zero-shot. The distribution is spread across capitalised words and a stray 'The' that hints the model is about to write a sentence instead of a label.

Now with one example before it:

One example plus a constrained label set. The distribution sharpens, the format matches the example, the capitalisation matches the example.

Same model, same weights. The second prompt earns a sharper, lower-case answer in the expected format.

The patterns that move the needle

In rough order of impact:

Be specific

The biggest prompt-engineering mistake is asking for “a summary” when you want “three bullets, each under fifteen words, written for someone who knows X”. The model is happy to be specific. It is also happy to be vague. Pick one.

Bad:  Summarize this article.

Good: Summarize this article in three bullets. Each bullet is one sentence.
      The audience is a senior engineer who has not read the article and
      will not read it. Do not include preamble.

The “do not include preamble” line saves you the “Here is a summary of the article:” warmup that chat models love.

Show with examples, not just tell

A few-shot example is worth fifty words of explanation. The model is excellent at pattern matching. The second demo above shifts the output format with no rules, just one example.

This is in-context learning, the same mechanism explained in the context post. The practical translation: if you can write three good examples of what you want, you almost never need a long instruction.

Use structure to mark the parts

Models parse prompts much more reliably when you mark sections. Markdown headings, code fences, or XML-style tags all work. Anthropic’s docs recommend XML for Claude:

xml
<task>
Review this Go file for goroutine leaks.
</task>

<conventions>
- A leak is any goroutine that can outlive its parent context.
- Defer/cancel patterns are not leaks.
</conventions>

<file path="server.go">
package main
...
</file>

This is partly cargo-culted (the model does not literally understand XML) and partly real (the model has seen mountains of structured documents during training). Use what your provider recommends.

Constrain the output shape

“Reply with one word” is a constraint. “Reply with JSON matching this schema” is a stronger one. The stronger you constrain, the less the model can drift. Modern providers offer structured outputs via JSON schema or grammar (OpenAI structured outputs, Anthropic tool use, llama.cpp’s GBNF). Use them when the output is going somewhere mechanical.

When you want creative output, the constraint is “do not constrain”. Crank temperature, leave the format open, let the model surprise you.

Tell it what NOT to do

Models like to be helpful. Sometimes the help is the problem.

Do not apologise. Do not say "Certainly" or "I'd be happy to".
Do not summarise what I asked for. Just answer.

Two or three lines, real token savings, real tone control.

Put critical instructions at the end too

The “lost in the middle” effect (covered in the context post) means rules in the middle of a long prompt get forgotten. A short reminder at the end of the prompt is one of the highest-impact edits you can make:

[System prompt with rules]
[Long context, documents, etc.]
[User question]

Reminder: only respond using the documents above. If the answer is not
in them, say "I do not know".

Chain-of-thought, briefly

For problems that require reasoning (math, multi-step logic, planning), asking the model to “think step by step” before answering measurably improves accuracy. Most modern models do this implicitly when the task is complex, but the explicit cue still helps:

Solve the problem below. Think through your reasoning before giving the
final answer. Put the final answer on a line starting with "Answer:".

The cost is more tokens, which means more latency and money. The benefit is fewer wrong-but-confident answers. For trivial tasks it is overkill.

Tool-use prompts

When you give the model tools (functions it can call), the tool descriptions are prompts too. Treat them as carefully as the system prompt.

go
type Tool struct {
    Name        string
    Description string // <-- this is a prompt
    Schema      map[string]any
}

The description tells the model when to call the tool and what its arguments mean. A description like “Get data” leaves the model guessing. A description like “Look up a user by their email address. Returns the user’s display name and timezone. Use when the user mentions someone by name or email.” removes most of the ambiguity.

If you find yourself adding rules to the system prompt like “do not call the X tool when Y”, the right fix is almost always to improve the X tool’s description.

Iterating

The first draft of a prompt is the cheap part. Most of the work is in iterating on it.

yes no Write Run on 5 cases Pass? Save and move on Adjust prompt or examples

The “5 cases” is the load-bearing part. A small eval set is what turns prompt-tweaking from vibes into something you can compare versions on.

Minimum viable eval: a JSON file with 10-30 rows of {input, expected_shape}. Diff the model’s output against the expected shape, count passes. Run it after every prompt change. If you cannot define expected_shape, you do not yet understand the task well enough to prompt for it.


← all posts