The math behind LLMs, mostly without tears

A language model is a function. Specifically: input a list of tokens, output a probability for every possible next token. Everything else (the architecture, the parameters, the training) is just machinery for computing that function quickly and accurately.

There is math below. Most of it is matrix multiplication; if you can keep track of dimensions when you multiply matrices, you can follow what is happening at each step.

What the model is, in plain shapes

The model takes a list of tokens (say, the tokens for The cat sat) and outputs a long vector of probabilities, one entry per token in the vocabulary, summing to 1. For models like GPT-4 or Claude, the vocabulary has 100K to 200K entries.

Two numbers keep showing up below. $n$ is the number of tokens currently being read; this is the context length (8K, 32K, 1M, depending on the model). $d$ is the embedding dimension, the size of the vector each token gets turned into inside the model (often 4096 or larger).

The model’s parameters are the weights inside its matrices. The whole function is differentiable with respect to those parameters, which is what makes training the model possible at all.

Tokens become vectors

The first thing the model does is convert each input token into a vector. There is a giant lookup table called the embedding matrix:

E \in \mathbb{R}^{|V| \times d}

$d$ is the embedding dimension (often 4096 or larger). Each row of $E$ is the vector for one vocabulary token. Looking up a token is just selecting that row.

If the input is The cat sat, the model produces a matrix $X \in \mathbb{R}^{n \times d}$ where each row is the embedding of one input token.

Each input token is used as an index into the embedding matrix E. The row at that index is copied into a row of X. Repeat for every input token; that is the entire embedding step.

That is the entire tokens-to-vectors step. The interesting math happens after.

A transformer block, the shape

A transformer is a stack of identical blocks. Each block does two things in order: self-attention, then a feed-forward network. There is a residual connection and a layer normalization around each, but those are details. The diagram:

Stack $L$ of these blocks (often 32 to 96 of them), feed the output into a final linear layer, take a softmax, and you have the next-token distribution. Everything below is what happens inside one block.

Attention: the one new idea

Self-attention is what makes transformers work. Each token in the sequence “looks at” every other token and decides how much information to pull from each. The result is a new vector for each token that incorporates context from the rest of the sequence.

From the input $X$ , the model computes three projections using three learned weight matrices:

Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V

where $W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$ . $Q$ is the query matrix, $K$ is the key matrix, $V$ is the value matrix. Each row of $Q$ is “what this token is looking for”. Each row of $K$ is “what this token offers”. Each row of $V$ is “what this token will hand over if it is attended to”. (If you have seen “KV cache” mentioned in serving notes, $K$ and $V$ here are exactly what is cached: precomputed for previous tokens so subsequent ones do not have to recompute them.)

Attention scores come from comparing every query to every key:

\text{scores} = \frac{Q K^T}{\sqrt{d_k}}

The $\sqrt{d_k}$ is a scaling factor that keeps the numbers stable as $d_k$ grows. The result is an $n \times n$ matrix where entry $(i, j)$ tells you how much token $i$ should attend to token $j$ . This $n \times n$ shape is also why attention scales as $O(n^2 d_k)$ per layer: doubling the context length quadruples the work, which is why long-context models cost more per call.

A softmax (row-wise) turns the scores into probabilities; each row sums to 1:

A = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)

The output: each token’s new vector is a weighted combination of the value vectors:

\text{Attention}(Q, K, V) = A V

That is the whole operation. Six matrices multiplied together with a softmax in the middle.

A picture of attention

For a six-token sentence, the attention matrix $A$ might look like this:

The

cat

sat

the

mat

The

0.74

0.08

0.06

0.04

0.05

0.03

cat

0.12

0.62

0.10

0.06

0.05

sat

0.08

0.42

0.34

0.07

0.05

0.04

0.06

0.08

0.10

0.58

0.10

0.08

the

0.04

0.05

0.06

0.20

0.51

0.14

mat

0.03

0.05

0.06

0.18

0.16

0.52

Illustrative attention weights. Each row sums to 1 (the row's view of the sentence). Strong diagonal means each token mostly attends to itself; off-diagonal weight is the model pulling context from related tokens.

Each row is one token’s attention distribution: how much it pulled from every other token (including itself). In a real model the patterns are far less diagonal and far more interpretable: prepositions attend to the noun they modify, pronouns attend to their antecedents, verbs attend to their subjects.

To see what one row of that matrix actually does, here is the same computation traced from the perspective of one query token. Press play; the dot products fill in one at a time, then softmax normalises them into attention weights, then those weights produce an output vector by mixing the value rows.

Attention from cat's point of view. The Q vector compares to every K row (dot product). Softmax turns those scores into weights that sum to 1. The output is a weighted sum of the V rows.

A real transformer does this with many heads in parallel, each with its own $W_Q, W_K, W_V$ . The outputs are concatenated and projected back to dimension $d$ . That is multi-head attention. The math is the same shape; there are just $h$ copies of it side by side.

The softmax, in detail

The softmax keeps coming back, so it is worth understanding directly. Given a vector of logits $z \in \mathbb{R}^k$ , the softmax produces a probability vector $p \in \Delta^k$ :

p_i = \frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}}

Two properties matter:

The bigger $z_i$ is relative to the others, the bigger $p_i$ . The relationship is exponential, which is why softmax is “sharp”: small changes in $z$ produce large changes in $p$ at the top of the distribution.

The output is invariant to adding a constant. Adding the same number to every $z_i$ does not change $p$ . This is why softmax-cross-entropy loss is numerically stable: in implementation, you subtract $\max(z)$ before taking the exponential.

Temperature is just a divisor inside the exponential:

p_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}

$T = 1$ is the default. $T \to 0$ makes the distribution sharper (greedy decoding). $T > 1$ flattens it (more creative sampling). This is the temperature you set in the OpenAI or Anthropic API call.

Drag the slider to see the effect on a sample distribution of logits:

T 1.00 p_i = e^z_i/T / Σ e^z_j/T

Paris z=2.40

0.0%

the z=0.50

0.0%

in z=1.10

0.0%

France z=0.80

0.0%

a z=0.30

0.0%

located z=-0.20

0.0%

The same logits, different temperatures. At T<0.5 the distribution collapses to 'Paris' (greedy). At T=1 it spreads. At T>1.5 it flattens toward uniform, and the model is essentially picking at random from a shortlist.

Putting it together

A complete forward pass through a transformer-based LLM, in order:

To generate text, you append the sampled token to the input and run the whole thing again. That is autoregressive decoding.