RAG — Retrieval-Augmented Generation

When the model doesn't have your data in its head, it fetches it from a vector store or full-text search. RAG is a pattern, not a product.

#rag #vector-db #embeddings

What it is

RAG is the pattern "find relevant context → drop it into the prompt →
let the model answer from that". Not a new generation of AI, not magic.
A workflow.

It exists because LLMs:

have a fixed cutoff date,
can't reliably answer questions about your data they've never seen,
hallucinate when they have nothing to grip onto.

How it works

Indexing. Cut documents into chunks (typically 500–1500 tokens),
embed each one, store in a vector store (pgvector, Qdrant, LanceDB).
Retrieval. Embed the query, fetch top-K closest chunks. Often
hybrid with BM25 full-text, because embeddings alone lose on
keywords and numbers.
Reranker (optional). A cross-encoder re-scores results by actual
relevance. Without it your top-10 is often "similar-sounding" chunks,
not answers.
Generation. Top-N chunks (typically 4–8) go into the prompt as
context next to the original question.

When it makes sense (and when it doesn't)

Yes: company knowledge base, documentation, legal or medical
archive, data that keeps changing.

No: creative tasks without ground truth, questions solvable by a
single SQL JOIN, cases where long context (1M token models) is enough.

Common mistakes

Sentence-level chunking. You lose "what's the subject of this sentence".
Embedding only, no BM25. Embeddings can't tell "GPT-5" from "GPT-4".
No reranker. Top-K embedding similarity ≠ top-K relevance.
No evals. If you can't tell whether RAG beats the raw model, you
might as well turn it off.

What to remember

RAG is tooling around the prompt. It doesn't solve tasks the model
can't do. It solves tasks the model can do but has no data for.