RAG: retrieval-augmented generation, beyond document chunks

If you have built one AI feature on a hosted model, the second one you tried to build probably needed information the model did not have. Your internal API. Last week’s pricing table. The Slack thread from Tuesday. Retrieval-augmented generation (RAG) is the standard way to put that information in front of the model at inference time without changing the model itself.

The “embed the docs, stuff the top-K into the prompt” version is a one-afternoon demo. The interesting parts (what you embed, how you chunk, how you actually retrieve, and how you check that retrieval is working) are where the demo turns into a system.

The basic pattern

A RAG pipeline has five jobs: turn the query into a vector, find similar vectors in a store, fetch the underlying chunks, build a prompt that includes them, and generate an answer.

1

query

"why is the sky blue?"
2

embedding

[0.12, -0.34, 0.78, ...]
3

vector search

cosine sim · top-K
4

retrieved

3 chunks · scores 0.91 / 0.84 / 0.71
5

augment prompt

system + docs + question
6

LLM

generate grounded answer
7

answer

"sunlight scatters off air molecules..."

The seven stages of a RAG request. Click play to step through them. In production every one of these stages has its own knobs and failure modes.

A naive implementation in Go, using whichever embedding model and vector store you like:

func RAG(ctx context.Context, question string) (string, error) {
    qVec, err := embedder.Embed(ctx, question)
    if err != nil {
        return "", err
    }
    hits, err := store.Search(ctx, qVec, 5)
    if err != nil {
        return "", err
    }
    texts := make([]string, len(hits))
    for i, h := range hits {
        texts[i] = h.Text
    }
    prompt := fmt.Sprintf(`Use ONLY the documents below to answer.
If the documents do not contain the answer, say "I do not know".

Documents:
%s

Question: %s
`, strings.Join(texts, "\n\n"), question)
    return llm.Generate(ctx, prompt)
}

About fifteen lines. It works. It is also fragile in every interesting way: the chunking, the embedding choice, the retrieval method, the lack of re-ranking, the unbounded context, the missing eval. The rest of this post is each of those in turn.

RAG is not only about documents

The word “document” carries baggage. People hear it and think PDF or wiki page. Most useful RAG systems retrieve other things.

Code. Embed code chunks (functions, classes, files) so a coding agent can find relevant existing implementations before writing new ones. The chunking has to respect language structure (split on functions, not at character N).

Conversation history. For long-running agents, the conversation transcript itself becomes a retrieval corpus. The agent retrieves earlier turns that relate to the current task instead of dragging the entire history along in context.

Agent memory. Tools, prior actions, observed results. An autonomous agent that runs for hundreds of steps quickly fills its context window; embedding past observations and retrieving the relevant ones per step is the same RAG pattern applied to the agent’s own history.

Structured data. Embed table rows, JSON objects, or graph nodes. The retrieval target is a record, not a passage. The augmentation step formats the records as readable text for the model.

Multi-modal. Images, audio, video chunks embedded with a model like CLIP. The query is text; the retrieved items are images. Same machinery.

API specs and schemas. When the model needs to call an API it has not seen before, retrieve the relevant endpoint specs. Cheaper than fine-tuning, more flexible than hardcoding.

In every case it is the same machinery: pick a unit of retrieval, embed it, store it, retrieve top-K at query time, format and include in the prompt.

Chunking is harder than it looks

The chunk plays two roles, and those two roles want different things. Retrieval wants chunks small enough that a single chunk is about one thing (so the embedding is meaningful). The model wants chunks big enough to contain a complete thought (so it can use them).

Five chunking strategies, in rough order of “obvious” to “actually good”.

Fixed-size with overlap

Split text every N tokens (often 200-500), with a token overlap (often 10-20%) so context near boundaries is not lost. The classic LangChain / LlamaIndex default.

func FixedChunks(text string, size, overlap int) []string {
    tokens := tokenizer.Encode(text)
    var chunks []string
    for i := 0; i < len(tokens); i += size - overlap {
        end := i + size
        if end > len(tokens) {
            end = len(tokens)
        }
        chunks = append(chunks, tokenizer.Decode(tokens[i:end]))
    }
    return chunks
}

Easy to implement, terrible at respecting natural boundaries. Sentences get split mid-clause, headings get separated from their bodies, code gets sliced at random.

Structure-aware

Split on the document’s actual structure. Markdown headings, HTML tags, code function boundaries, paragraph breaks. Then if a section is too big, split it further (fall back to size-based within that section).

func MarkdownChunks(text string, maxTokens int) []string {
    var chunks []string
    for _, s := range splitOnHeadings(text) {
        if tokenCount(s) <= maxTokens {
            chunks = append(chunks, s)
        } else {
            chunks = append(chunks, FixedChunks(s, maxTokens, 40)...)
        }
    }
    return chunks
}

For wiki-shaped content and code, this is the strongest default. Each chunk is a coherent unit.

Recursive

Try the largest natural separator first (double newline), and if any resulting chunk is still too big, split it on the next-largest separator (single newline), and so on down to characters. The “fallback ladder”.

var separators = []string{"\n\n", "\n", ". ", " ", ""}

func RecursiveChunks(text string, maxSize int) []string {
    if len(text) <= maxSize {
        return []string{text}
    }
    for _, sep := range separators {
        if !strings.Contains(text, sep) {
            continue
        }
        parts := strings.Split(text, sep)
        var chunks []string
        buf := ""
        for _, p := range parts {
            candidate := p
            if buf != "" {
                candidate = buf + sep + p
            }
            if len(candidate) <= maxSize {
                buf = candidate
                continue
            }
            if buf != "" {
                chunks = append(chunks, buf)
            }
            if len(p) <= maxSize {
                buf = p
            } else {
                buf = ""
                chunks = append(chunks, RecursiveChunks(p, maxSize)...)
            }
        }
        if buf != "" {
            chunks = append(chunks, buf)
        }
        return chunks
    }
    return []string{text[:maxSize]}
}

A reasonable middle ground when you do not know the document structure in advance.

Semantic

Embed every sentence, then group consecutive sentences whose embeddings are close. A new chunk starts when the cosine similarity between the running average and the next sentence drops below some threshold. The idea is that chunk boundaries follow topic shifts, not character counts.

Higher quality on essays and long-form articles. Slower (you embed once to chunk, then embed each chunk again to store). Worth it when retrieval quality matters and the corpus is small enough that the extra embedding cost is fine.

Hierarchical

Embed at multiple granularities. Store sentence-level chunks for precise retrieval, paragraph-level chunks for context. At query time, retrieve sentences, then fetch their parent paragraphs to give the model.

The “parent-document retrieval” pattern in LangChain. Best of both worlds: small embedding units (better matching) and large context units (better generation). Costs you N+M embeddings instead of N.

Embedding choices that matter

Three knobs.

The model. Pick from the leaderboards (MTEB is the standard). For most English-only work, text-embedding-3-small (OpenAI), voyage-3-lite (Voyage), or bge-base-en-v1.5 (open-weight) are fine defaults. For multilingual, text-embedding-3-large or a multilingual BGE variant.

The dimension. Older models are 768-dim. Newer ones are 1536, 3072, or higher. Bigger embeddings have more nuance, but they cost more storage and slower search. Modern models support “matryoshka” dimensions: train at 3072 dims, but you can truncate to 256 with minimal quality loss. Start at 768 or 1024 unless you measure a real benefit at higher.

What text you actually pass to the model. For documents: usually the chunk text plus any helpful context (the document title, the section heading). For queries: this is where the next section gets interesting.

What to embed: HyDE and friends

The naive move is to embed the user’s query and search with that. The problem: queries and documents look different. A user types “sky blue why?”. A document says “Light from the sun appears white, but it is made up of all the colors of the rainbow. When sunlight enters the atmosphere…”

In embedding space, terse questions cluster in one region and prose documents cluster in another. The cosine similarity between them might be poor even when they are about the same thing. The model trained the embeddings on document-shaped text; queries are out of distribution.

Three workarounds.

HyDE: embed the answer instead

Gao et al, “Precise Zero-Shot Dense Retrieval without Relevance Labels”, 2022. Ask the LLM to generate a hypothetical answer to the user’s question, then embed the answer and use it as the search vector.

func HyDERetrieve(ctx context.Context, question string, k int) ([]Chunk, error) {
    hypothetical, err := llm.Generate(ctx, "Write a short paragraph answering: "+question)
    if err != nil {
        return nil, err
    }
    hVec, err := embedder.Embed(ctx, hypothetical)
    if err != nil {
        return nil, err
    }
    return store.Search(ctx, hVec, k)
}

The hypothetical answer is in the same shape as the documents, so its embedding lands in the document region of the space. Even when the hypothetical is partially wrong (the LLM hallucinates), the embedding still tends to land near real documents about the right topic.

An illustrative 2D embedding space. The query q sits far from the relevant docs (blue) because it is shaped like a query, not a document. The hypothetical answer h, generated by the LLM and embedded, lands inside the relevant cluster. Click the buttons to compare the retrieval circles.

Cost: one extra LLM call per query (and the extra latency). Win: meaningful accuracy bump on queries that are much shorter than the documents you are searching.

Query rewriting

Same family of idea: rewrite the query into something more document-shaped. “What is the cheapest way to host a Postgres database?” → “comparison of postgres hosting providers by price”. One small LLM call. Cheaper than HyDE and helps on the same class of problems.

Multi-query

Generate several different rephrasings of the question, embed each, union the retrieval results. Costs you more API calls, gives you recall on questions where the right answer was retrieved only by one of the phrasings.

func MultiQuery(ctx context.Context, question string, n, k int) ([]Chunk, error) {
    prompt := fmt.Sprintf("Write %d different ways to phrase this query: %s", n, question)
    out, err := llm.Generate(ctx, prompt)
    if err != nil {
        return nil, err
    }
    queries := append([]string{question}, strings.Split(out, "\n")...)
    seen := map[string]Chunk{}
    for _, q := range queries {
        vec, err := embedder.Embed(ctx, q)
        if err != nil {
            return nil, err
        }
        hits, err := store.Search(ctx, vec, k)
        if err != nil {
            return nil, err
        }
        for _, h := range hits {
            seen[h.ID] = h
        }
    }
    out2 := make([]Chunk, 0, len(seen))
    for _, c := range seen {
        out2 = append(out2, c)
    }
    return out2, nil
}

Useful when recall matters more than latency (offline summarisation, research agents).

Retrieval: dense, sparse, hybrid

The vector-similarity retrieval everyone calls “RAG” is dense retrieval. It works well when the query and the documents share semantic meaning even if they share no words. It fails when the user’s query contains specific tokens that need to match exactly (a product SKU, an error code, an unusual name).

For exact-match-y queries, the classic algorithm is sparse retrieval: BM25 (Best Matching 25), an evolution of TF-IDF. It scores documents by how many query terms they contain, weighted by how rare those terms are in the corpus.

Hybrid retrieval runs both and combines their results. The standard combination is Reciprocal Rank Fusion (RRF):

\text{RRF}(d) = \sum_{r \in \text{retrievers}} \frac{1}{k_{\text{rrf}} + \text{rank}_r(d)}

where $k_{\text{rrf}}$ is a constant (usually 60) that flattens the contribution of low-ranked results. Each retriever votes; the document that ranks well in either gets a bump; ranking very well in one outweighs missing from the other.

Engineering-wise: Elasticsearch and OpenSearch have built-in BM25. pgvector + a tsvector column gets you both in Postgres. Specialist vector DBs (Pinecone, Weaviate, Qdrant) usually have hybrid as a feature flag. Use it. The single largest improvement most RAG systems can make is “we added BM25 alongside our dense retrieval”.

Re-ranking

Initial retrieval is fast and recall-oriented. You ask for top 20 or top 50 and you get back the candidates closest in embedding space. Some of them are great; some of them snuck in because the embedding happened to be close even though the semantic match is poor.

A re-ranker is a second-pass model that takes each retrieved chunk plus the query and scores the pair directly. Usually a cross-encoder: the query and the document are concatenated and run through a transformer that outputs a relevance score. Slower than dense retrieval (you cannot precompute it), but much more accurate per pair because the model gets to look at both texts together.

Standard pattern: retrieve 20-50 with the dense+sparse pipeline, re-rank with a cross-encoder, pass the top 3-5 to the LLM. Cohere Rerank, BAAI/bge-reranker-large, and mxbai-rerank-large-v1 are the usual choices.

func RetrieveAndRerank(ctx context.Context, question string, kInitial, kFinal int) ([]Chunk, error) {
    qVec, err := embedder.Embed(ctx, question)
    if err != nil {
        return nil, err
    }
    candidates, err := store.Search(ctx, qVec, kInitial)
    if err != nil {
        return nil, err
    }
    pairs := make([]Pair, len(candidates))
    for i, c := range candidates {
        pairs[i] = Pair{Query: question, Doc: c.Text}
    }
    scores, err := reranker.Score(ctx, pairs) // cross-encoder
    if err != nil {
        return nil, err
    }
    sort.SliceStable(candidates, func(i, j int) bool { return scores[i] > scores[j] })
    if kFinal > len(candidates) {
        kFinal = len(candidates)
    }
    return candidates[:kFinal], nil
}

The cost is one reranker forward pass per (query, candidate) pair. For 30 candidates and a fast reranker, that is 100-300ms in practice. Worth almost every time you care about quality.

What goes wrong, in production

Five failure modes I see often.

The right chunk was not in top-K. Recall failure. Sometimes the chunking split the answer across two chunks and neither carried enough signal alone. Sometimes the embedding model genuinely thought another chunk was closer. Fix: hybrid retrieval, multi-query, re-ranking, larger K.

The right chunk was in top-K but the model ignored it. Generation failure. Often the chunk was buried in the middle of a long context window where the model recalls badly (the “lost in the middle” effect from the context post). Fix: shorter contexts, put critical info at the start and end, re-rank so the best chunk is first.

Contradictory chunks. Two retrieved docs disagree. The model picks one without telling you. Fix: at retrieval time, if the top results contradict each other on any extracted facts, surface that to the user instead of generating a confident single answer.

Stale data. The corpus was indexed in March; the user is asking about a June change. The retrieval works, but the retrieved content is wrong. Fix: include last_updated in metadata, decay scores for old docs, re-embed regularly.

Confidently invented citations. The model cites a document ID but the cited content does not actually support the claim. Fix: extract the cited spans and check they exist in the retrieved chunks; if not, mark the citation as unverified.

Evals are the only way to know

RAG is two systems chained, and you need to eval each of them separately:

Retrieval metrics.

Recall@k: of all the relevant documents for a query, what fraction did you retrieve in the top K? Requires labelled (query, relevant-docs) pairs.
MRR (Mean Reciprocal Rank): averaged $\frac{1}{\text{rank of first relevant doc}}$ across queries. Rewards getting the right answer high in the list.
NDCG (Normalized Discounted Cumulative Gain): like MRR but factors in graded relevance (some docs more relevant than others).

Generation metrics.

Faithfulness: are the model’s claims actually supported by the retrieved chunks? Usually measured by another LLM that grades the answer against the sources.
Answer relevance: does the answer address the question that was asked? Cheap LLM-as-judge call.
Context precision: of the chunks passed to the model, how many were used? High context precision means you can shrink K and save tokens.

Set up the eval before you tune anything. Five to ten labelled queries are enough to start; the gradient of “is my latest change better or worse” is what matters, not the absolute number. The frameworks worth looking at: Ragas, TruLens, or write your own (often the right call once you understand what you are measuring).