← Library · foundations

RAG — Retrieval-Augmented Generation

RAG — Retrieval-Augmented Generation

When the model doesn't have your data in its head, it fetches it from a vector store or full-text search. RAG is a pattern, not a product.

What it is

RAG is the pattern "find relevant context → drop it into the prompt →
let the model answer from that". Not a new generation of AI, not magic.
A workflow.

It exists because LLMs:

  • have a fixed cutoff date,
  • can't reliably answer questions about your data they've never seen,
  • hallucinate when they have nothing to grip onto.

How it works

  1. Indexing. Cut documents into chunks (typically 500–1500 tokens),
    embed each one, store in a vector store (pgvector, Qdrant, LanceDB).
  2. Retrieval. Embed the query, fetch top-K closest chunks. Often
    hybrid with BM25 full-text, because embeddings alone lose on
    keywords and numbers.
  3. Reranker (optional). A cross-encoder re-scores results by actual
    relevance. Without it your top-10 is often "similar-sounding" chunks,
    not answers.
  4. Generation. Top-N chunks (typically 4–8) go into the prompt as
    context next to the original question.

When it makes sense (and when it doesn't)

Yes: company knowledge base, documentation, legal or medical
archive, data that keeps changing.

No: creative tasks without ground truth, questions solvable by a
single SQL JOIN, cases where long context (1M token models) is enough.

Common mistakes

  • Sentence-level chunking. You lose "what's the subject of this sentence".
  • Embedding only, no BM25. Embeddings can't tell "GPT-5" from "GPT-4".
  • No reranker. Top-K embedding similarity ≠ top-K relevance.
  • No evals. If you can't tell whether RAG beats the raw model, you
    might as well turn it off.

What to remember

RAG is tooling around the prompt. It doesn't solve tasks the model
can't do. It solves tasks the model can do but has no data for.