Teju's Blog

Full stack engineer and AI architect. Notes from the work.


The math behind LLMs, mostly without tears

A language model is a function. Specifically: input a list of tokens, output a probability for every possible next token. Everything else (the architecture, the parameters, the training) is just machinery for computing that function quickly and accurately.

There is math below. Most of it is matrix multiplication; if you can keep track of dimensions when you multiply matrices, you can follow what is happening at each step.

What the model is, in plain shapes

The model takes a list of tokens (say, the tokens for The cat sat) and outputs a long vector of probabilities, one entry per token in the vocabulary, summing to 1. For models like GPT-4 or Claude, the vocabulary has 100K to 200K entries.

Two numbers keep showing up below. nn is the number of tokens currently being read; this is the context length (8K, 32K, 1M, depending on the model). dd is the embedding dimension, the size of the vector each token gets turned into inside the model (often 4096 or larger).

The model’s parameters are the weights inside its matrices. The whole function is differentiable with respect to those parameters, which is what makes training the model possible at all.

Tokens become vectors

The first thing the model does is convert each input token into a vector. There is a giant lookup table called the embedding matrix:

ERV×dE \in \mathbb{R}^{|V| \times d}

dd is the embedding dimension (often 4096 or larger). Each row of EE is the vector for one vocabulary token. Looking up a token is just selecting that row.

If the input is The cat sat, the model produces a matrix XRn×dX \in \mathbb{R}^{n \times d} where each row is the embedding of one input token.

The cat sat . . . The . . . cat . . . sat . . . row 1 row 2 row 3
Each input token is used as an index into the embedding matrix E. The row at that index is copied into a row of X. Repeat for every input token; that is the entire embedding step.

That is the entire tokens-to-vectors step. The interesting math happens after.

A transformer block, the shape

A transformer is a stack of identical blocks. Each block does two things in order: self-attention, then a feed-forward network. There is a residual connection and a layer normalization around each, but those are details. The diagram:

X LayerNorm Self-attention + residual LayerNorm Feed-forward MLP + residual Output to next block

Stack LL of these blocks (often 32 to 96 of them), feed the output into a final linear layer, take a softmax, and you have the next-token distribution. Everything below is what happens inside one block.

Attention: the one new idea

Self-attention is what makes transformers work. Each token in the sequence “looks at” every other token and decides how much information to pull from each. The result is a new vector for each token that incorporates context from the rest of the sequence.

From the input XX, the model computes three projections using three learned weight matrices:

Q=XWQ,K=XWK,V=XWVQ = X W_Q, \qquad K = X W_K, \qquad V = X W_V

where WQ,WK,WVRd×dkW_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}. QQ is the query matrix, KK is the key matrix, VV is the value matrix. Each row of QQ is “what this token is looking for”. Each row of KK is “what this token offers”. Each row of VV is “what this token will hand over if it is attended to”. (If you have seen “KV cache” mentioned in serving notes, KK and VV here are exactly what is cached: precomputed for previous tokens so subsequent ones do not have to recompute them.)

Attention scores come from comparing every query to every key:

scores=QKTdk\text{scores} = \frac{Q K^T}{\sqrt{d_k}}

The dk\sqrt{d_k} is a scaling factor that keeps the numbers stable as dkd_k grows. The result is an n×nn \times n matrix where entry (i,j)(i, j) tells you how much token ii should attend to token jj. This n×nn \times n shape is also why attention scales as O(n2dk)O(n^2 d_k) per layer: doubling the context length quadruples the work, which is why long-context models cost more per call.

A softmax (row-wise) turns the scores into probabilities; each row sums to 1:

A=softmax ⁣(QKTdk)A = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)

The output: each token’s new vector is a weighted combination of the value vectors:

Attention(Q,K,V)=AV\text{Attention}(Q, K, V) = A V

That is the whole operation. Six matrices multiplied together with a softmax in the middle.

Xn x d Q = X W_Qn x d_k K = X W_Kn x d_k V = X W_Vn x d_k scores = Q K^T / sqrt d_kn x n A = softmax rowsn x n Output = A Vn x d_k

A picture of attention

For a six-token sentence, the attention matrix AA might look like this:

The
cat
sat
on
the
mat
The
0.74
0.08
0.06
0.04
0.05
0.03
cat
0.12
0.62
0.10
0.06
0.05
0.05
sat
0.08
0.42
0.34
0.07
0.05
0.04
on
0.06
0.08
0.10
0.58
0.10
0.08
the
0.04
0.05
0.06
0.20
0.51
0.14
mat
0.03
0.05
0.06
0.18
0.16
0.52
Illustrative attention weights. Each row sums to 1 (the row's view of the sentence). Strong diagonal means each token mostly attends to itself; off-diagonal weight is the model pulling context from related tokens.

Each row is one token’s attention distribution: how much it pulled from every other token (including itself). In a real model the patterns are far less diagonal and far more interpretable: prepositions attend to the noun they modify, pronouns attend to their antecedents, verbs attend to their subjects.

To see what one row of that matrix actually does, here is the same computation traced from the perspective of one query token. Press play; the dot products fill in one at a time, then softmax normalises them into attention weights, then those weights produce an output vector by mixing the value rows.

attending from cat The 0.00 cat 0.00 sat 0.00 The 0% cat 0% sat 0%
Attention from cat's point of view. The Q vector compares to every K row (dot product). Softmax turns those scores into weights that sum to 1. The output is a weighted sum of the V rows.

A real transformer does this with many heads in parallel, each with its own WQ,WK,WVW_Q, W_K, W_V. The outputs are concatenated and projected back to dimension dd. That is multi-head attention. The math is the same shape; there are just hh copies of it side by side.

The softmax, in detail

The softmax keeps coming back, so it is worth understanding directly. Given a vector of logits zRkz \in \mathbb{R}^k, the softmax produces a probability vector pΔkp \in \Delta^k:

pi=ezij=1kezjp_i = \frac{e^{z_i}}{\sum_{j=1}^{k} e^{z_j}}

Two properties matter:

The bigger ziz_i is relative to the others, the bigger pip_i. The relationship is exponential, which is why softmax is “sharp”: small changes in zz produce large changes in pp at the top of the distribution.

The output is invariant to adding a constant. Adding the same number to every ziz_i does not change pp. This is why softmax-cross-entropy loss is numerically stable: in implementation, you subtract max(z)\max(z) before taking the exponential.

Temperature is just a divisor inside the exponential:

pi=ezi/Tjezj/Tp_i = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}

T=1T = 1 is the default. T0T \to 0 makes the distribution sharper (greedy decoding). T>1T > 1 flattens it (more creative sampling). This is the temperature you set in the OpenAI or Anthropic API call.

Drag the slider to see the effect on a sample distribution of logits:

pi = ezi/T / Σ ezj/T
Paris z=2.40
0.0%
the z=0.50
0.0%
in z=1.10
0.0%
France z=0.80
0.0%
a z=0.30
0.0%
located z=-0.20
0.0%
The same logits, different temperatures. At T<0.5 the distribution collapses to 'Paris' (greedy). At T=1 it spreads. At T>1.5 it flattens toward uniform, and the model is essentially picking at random from a shortlist.

Putting it together

A complete forward pass through a transformer-based LLM, in order:

Tokens Embed: X = E @ token_ids Add position info Transformer block 1 Transformer block 2 ...block L Final hidden state h_n Linear: logits = h_n @ W_out softmax: p = softmax of logits / T Sample next token

To generate text, you append the sampled token to the input and run the whole thing again. That is autoregressive decoding.


← all posts