Teju's Blog

Full stack engineer and AI architect. Notes from the work.


Weights: what a language model actually 'knows'

A language model is a function. Text goes in, a probability distribution over the next token comes out. Most of the function is fixed; you cannot change it at inference time. The fixed part is billions of numbers arranged into matrices. Those numbers are the weights.

Almost every other piece of the LLM toolkit (prompts, RAG, fine-tuning) makes more sense once you know what the weights are and what they are not.

What the weights actually are

A transformer is mostly a stack of matrix multiplications. Input tokens get turned into vectors, those vectors get multiplied through a sequence of matrices, and the final output is a vector of logits (one per token in the vocabulary). The matrices are the weights.

Inputtokens Embeddingweights Layer 1weights Layer 2weights Layer Nweights Outputweights Next-tokendistribution

The size depends on the model. A 7B model has seven billion numbers. A 70B has seventy billion. Each number is typically stored as a 16-bit float (bfloat16 or fp16), so a 7B model is around 14GB on disk and a 70B is around 140GB. This is why running large models locally is a hardware question, not a software one.

The numbers themselves are continuous and dense. There is no row called “Paris is the capital of France”. There is no neuron that “stores” your favourite movie. The patterns are spread across the weights in a way that nobody can fully interpret.

What they encode (and what they do not)

Weights encode patterns. After enough training, the weights have internalised statistical regularities like “the token ‘Paris’ is highly probable after the prefix ‘The capital of France is’”. They have also internalised “code blocks tend to sit inside triple backticks”, “polite responses often start with ‘I’d be happy to’”, and “after def in a Python file, an identifier comes next”.

What they do not store, in any retrievable sense:

  • Specific facts the way a database stores rows. The model can usually produce true facts about well-trained-on topics, but it is reconstructing them, not looking them up. This is the same mechanism that produces hallucinations when the reconstruction goes wrong.
  • Information from after the training cutoff. Whatever the model was trained on, it has seen. Whatever happened after, it has not. No amount of prompting gets you 2025 news from a model trained in 2023.
  • Information about you specifically. Unless your text was in the training data, the model has never seen you. New facts have to be told to it via the prompt every time.

Weights are the model’s long-term memory, but it is a lossy, fuzzy, undirected memory. A database it is not.

How the weights got that way

Pretraining is one giant optimisation. The model starts with random weights. It is shown a huge amount of text (trillions of tokens). At each step it tries to predict the next token, gets a loss based on how wrong it was, and the weights nudge in the direction that would have made the prediction less wrong. The algorithm doing the nudging is gradient descent. Multiply by trillions of steps and you get the patterns above.

update Training data(trillions of tokens) Model withweights Predictnext token Compute loss Backprop:nudge weights

After pretraining, the weights know how to do next-token prediction across the distribution of text the model saw. They do not yet know how to be a chatbot. That comes from a second stage: instruction tuning (and sometimes RLHF), which is itself a form of fine-tuning.

The total cost of pretraining a frontier model is in the tens-of-millions-of-dollars range. The cost of inference (using the model) is many orders of magnitude less. This asymmetry is why nobody trains a frontier model from scratch unless they have to.

Why this matters when you build with LLMs

Three operational consequences worth knowing.

The weights are static. Every call to a deployed model uses the same weights. If the model is wrong about something, it will be wrong about that thing on every call until either you change what you send (prompt or context) or the weights themselves are updated (a new fine-tune or a model swap).

The weights are not yours. When you call a hosted model, you are renting access to weights someone else trained. The provider can update them. If you are building on a specific version, pin the version explicitly. Most APIs let you.

Weights interact with context in a specific way. Whatever is in the prompt overrides whatever the weights “think”. If the weights say “the latest version of Python is 3.11” but your prompt says “the latest version of Python is 3.13”, the model will use 3.13 in its output. Retrieval-augmented generation, when you see it later, is just an elaborate version of this.

A small demonstration

Here is what the weights-only distribution might look like for a time-sensitive factual prompt:

The weights pick one of several plausible answers, biased by how recent the training cutoff was. The 'correct' answer changes every six months. The weights do not.

If your user asks “what is the latest Python?”, and you want today’s actual answer, you cannot fix this by editing the model. You either fine-tune (expensive, slow to update) or you put the current answer into the prompt at runtime (cheap, instant).

The same logic applies to your internal API, today’s date, recent news, anything your company changed last week. The runtime way in is the prompt; the build-time way is fine-tuning a new set of weights, which most teams never need.


← all posts