Chain of thought: why thinking out loud helps a model think

A language model that answers a math problem in one token usually gets it wrong. The same model, given the same problem, asked to “think step by step before answering”, usually gets it right. The model did not get smarter between the two prompts. It just had more tokens to think with.

That gap is what chain-of-thought prompting exploits, and it is also the seed of the new generation of reasoning models.

The original observation

Wei et al, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2022), showed that adding worked examples of step-by-step reasoning to a prompt dramatically improves a model’s accuracy on math, common-sense reasoning, and symbolic problems.

The simplest version, with one example:

Q: There are 15 trees in the grove. Grove workers planted some today.
   After planting, there are 21 trees. How many did they plant?
A: There were 15 trees before. After planting, there are 21.
   So they planted 21 - 15 = 6 trees. The answer is 6.

Q: Tom had 23 apples. He ate 7 and gave 4 to his sister. How many does
   he have left?
A:

The first Q/A is a demonstration. The model, primed by it, produces a step-by-step answer for the second question instead of jumping straight to a number.

Kojima et al, “Large Language Models are Zero-Shot Reasoners”, showed that you do not even need the worked example. Just the phrase “Let’s think step by step” before the answer is often enough.

Q: Tom had 23 apples. He ate 7 and gave 4 to his sister. How many does
   he have left?
A: Let's think step by step.

Both versions work. Both have the same mechanism underneath.

Why it works

Two things are happening at once.

More tokens means more compute. Every token the model produces requires a full forward pass through the network. A one-token answer gets one pass through the weights. A fifty-token chain of reasoning gets fifty. The model is literally doing more thinking before committing to the final number.

Each token conditions the next. Reasoning tokens become part of the context the model sees when picking subsequent ones. Recall from the hallucinations post that the model samples each token from a distribution conditioned on everything before it. If the early tokens are “Tom started with 23, then ate 7, leaving 16”, the distribution for the next answer-token sharpens around “12” and away from random guesses.

Spreading the computation across many forward passes, each conditioned on the last, is what makes the difference. A single-token answer has only one chance to be right.

Here is the same arithmetic prompt without any reasoning:

Asked for the answer directly. Distribution is spread; the model is essentially guessing the arithmetic in one forward pass.

And with the reasoning already laid out:

After the reasoning has been written out, the answer distribution collapses to a single token. The reasoning did not change the math; it changed what the model was conditioned on.

Two flavors

Zero-shot CoT. Add a single trigger phrase (“Let’s think step by step”, “First, let me work through this carefully”, “Show your reasoning”). One extra line of prompt, a chunk of extra latency. Often a 5-30 percentage-point bump on reasoning tasks.

Few-shot CoT. Include one or more worked examples with reasoning steps shown. More accurate than zero-shot on hard problems. Costs more prompt tokens. The right call when the task is unusual enough that the model needs to see the format you want.

Pick zero-shot first. If accuracy is not where you need it, add examples.

Self-consistency

A clever extension: sample several CoT completions at non-zero temperature, take the majority vote on the final answer. Wang et al, “Self-Consistency Improves Chain of Thought Reasoning”, found this beats single-sample CoT by another 5-15 percentage points on hard problems.

The cost is N times the latency and N times the API spend. Use it when correctness on a small number of queries is worth more than throughput.

When not to use it

Chain of thought is not always the right call.

Trivial tasks. Asking “what is 2+2” with “let’s think step by step” is wasted tokens and latency. The model does not need to think.

Latency-sensitive interactive UI. A user staring at a “thinking…” spinner for three seconds is a worse experience than a fast wrong answer in many UIs. Pick the right tradeoff for the surface.

Tasks that need a strict output shape. CoT produces long, freeform reasoning. If you need JSON or a single label, you either have to post-process (extract the final answer from the reasoning) or use a structured-output mode that suppresses the reasoning. Both work; both have friction.

Direct lookup. “What is the capital of France?” does not benefit from reasoning. The model just knows.

Use CoT when getting the answer requires combining facts, not when it requires recalling them.

Reasoning models: CoT built in

Since 2024, providers have shipped models with reasoning baked in: OpenAI’s o-series, Anthropic’s “thinking” models, DeepSeek-R1. The pitch is the same as zero-shot CoT, internalised: the model spends extra tokens reasoning before its visible answer, and you do not have to prompt for it.

In practice, these models:

Do CoT (or something denser) automatically.
Often hide the reasoning from the final response to keep the user-facing output clean.
Cost more per call, because they generate the hidden reasoning.
Reach higher accuracy ceilings on hard math, code, and logic.

If you have access to one and your task benefits from reasoning, use it. You will spend less prompt-engineering effort and get similar or better results than hand-rolling CoT on a non-reasoning model. On the cheaper non-reasoning models the trigger phrase is still the cheapest accuracy upgrade you can apply.