Fine-tuning LLMs: what it is, when you actually need it

Every team I have worked with on LLM applications has eventually said “we should fine-tune”. Almost none of them needed to; the model was rarely the thing failing.

Most “fine-tune it” instincts are about getting different behaviour out of a model. Almost all of those instincts can be satisfied without touching the weights at all: better prompts, better retrieved context, better tool definitions. This post is about the small set of cases where you actually do need to change the model.

What fine-tuning actually is

Training a base language model from scratch takes thousands of GPUs and months. Fine-tuning is a much smaller version of the same process: take a pre-trained model, run more gradient descent on a much smaller dataset, update the weights.

What changes during fine-tuning is the weights themselves. After it is done, the model’s next-token distributions for prompts in your domain shift toward the patterns in your dataset. The model has not “memorised facts”. It has shifted the distribution.

How it differs from prompts and context

Three ways to change what a model produces, in order of cost:

Prompt engineering (post): minutes of work, changes the wording of the input.
Context engineering / RAG (post): hours to days, changes what information is in the input.
Fine-tuning (this post): days to weeks, changes the model’s weights.

Prompts and context are runtime knobs. The model is unchanged; you change what you send it. Fine-tuning is a build-time knob. You change the model itself.

When to reach for which:

If your problem is “the model formats the output wrong”, a better prompt or a few-shot example fixes it. Do not fine-tune for that.

If your problem is “the model does not know about our internal API”, RAG fixes it. Do not fine-tune for that either.

If your problem is “the model talks in a default friendly assistant voice and we want a clipped, professional tone across thousands of calls”, fine-tuning might be your answer. Prompts can shift tone, but only so far before they start contradicting the model’s training.

When fine-tuning is actually worth it

Five cases where I have seen fine-tuning pay back the investment:

Consistent format or style across all calls. The model produces output in a specific shape every time, without you having to spell it out in every prompt. Saves tokens at runtime too.

A task the base model is bad at. Niche domains (medical coding, legal contract parsing, specific game logs) where the model has thin training data. A few thousand examples of the right answer move it.

Latency and cost. A 7B fine-tuned model can match a 70B base model on a specific task. If you serve millions of calls, that math works out fast.

Format constraint that resists prompting. Some output shapes (a specific structured DSL, an unusual XML dialect) are easier to teach via examples than to specify in words.

Behavior the prompt cannot reliably impose. “Refuse politely if asked about X” works in a system prompt about 80% of the time. Fine-tuning gets you closer to 100% with less prompt overhead.

Outside these cases, the answer is almost always “better prompt or better context, not fine-tune”.

The flavors

Not all fine-tuning is equal.

Full fine-tuning updates every parameter. Highest quality, highest cost. Rarely the right choice anymore.

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices that get added to the weights. Updates are tiny (often under 1% of the parameter count). Trains faster, uses less memory, deploys easily (the adapter is a small file).

QLoRA is LoRA on a quantised base model. Lets you fine-tune large models on a single GPU. Slight quality loss vs full LoRA, large cost win.

Instruction tuning is fine-tuning specifically on instruction/response pairs to make a base model follow instructions. This is what most “chat” or “instruct” model variants are: a base model plus instruction tuning.

RLHF and DPO are preference-based methods. Instead of feeding the model right answers, you feed it pairs of “this is better than that” and let it learn the preference. Useful for tone, style, and refusal behavior.

For most teams: start with LoRA or QLoRA. Move to full fine-tuning only if evals say you need it.

A practical LoRA recipe

The simplest path is HuggingFace TRL plus PEFT, which handles the boilerplate.

python

from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTConfig, SFTTrainer

model_id = "meta-llama/Llama-3.2-3B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Each row in your dataset is { "messages": [...] } in chat format.
data = load_dataset("json", data_files="train.jsonl", split="train")

lora = LoraConfig(
    r=16,                     # rank; 8-32 is typical
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

cfg = SFTConfig(
    output_dir="./out",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    bf16=True,
    save_strategy="epoch",
    logging_steps=10,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=data,
    peft_config=lora,
    args=cfg,
)
trainer.train()
trainer.save_model("./out/final")

Three things worth pointing out.

The dataset shape matters more than the model size. A clean, consistent dataset of a few thousand examples beats a noisy dataset of a few hundred thousand. Spend most of your time on the data, not the hyperparameters.

The rank r controls how much capacity the adapter has. 8 for cheap experiments, 16-32 for serious work, 64+ if you have evidence you need it. Doubling r roughly doubles training memory.

The learning rate is the second most-tuned hyperparameter after r. 1e-4 to 5e-4 is the usual range for LoRA. Start at 2e-4 and adjust based on the training-loss curves.

Evals come first

Without evals you cannot tell whether a fine-tune helped or hurt. Put the eval set together before you start training, not after.

A minimum eval setup:

Held-out test set. 100-500 examples not in your training data, same shape as your training data.
Automated scoring where possible. Exact match for classification. Schema match for structured output. ROUGE or BLEU for similarity if you must. Human eval for the rest.
A baseline to beat. The prompt-only version of the same model. Or a smaller model with the same prompt. Or your previous fine-tune.

If your fine-tune does not beat the baseline on the test set, do not ship it.