Search your knowledge base first. Then answer grounded in real documents. The AI stops guessing and starts citing.
AI models make things up. They generate confident-sounding answers about facts they don't actually know, dates they've never seen, and documents they can't access. This is the fundamental hallucination problem, and Retrieval-Augmented Generation (RAG) is the most widely deployed solution.
The concept is straightforward: before the AI answers your question, search a knowledge base for relevant passages and include them in the prompt. Now the model generates an answer grounded in actual source material rather than relying solely on what it memorized during training. The difference is like asking someone to answer from memory versus handing them the reference book first.
This composition builds on:
Give It the Source Recall First Index FirstRAG combines document indexing (prepare knowledge for search), retrieval (find what's relevant), and context augmentation (give the AI real sources to cite) into a production-ready pipeline.
RAG isn't one step — it's a pipeline where each stage matters.
Split documents into smaller pieces. Too large and they dilute the context. Too small and they lose meaning.
Convert each chunk into a numerical vector — a mathematical fingerprint that captures its meaning.
When a question arrives, embed it too and find the chunks with the closest vectors. Fast, but can be imprecise.
Score each retrieved chunk against the actual question. This precision step improves results by 30–40%.
Feed the top chunks into the prompt alongside the question. The AI answers grounded in real sources.
An employee asking a question about their company's parental leave policy.
Without RAG, the AI would guess a generic answer. With RAG, it cites the exact policy.
How you split documents is one of the biggest decisions in a RAG system. Get it wrong and even perfect retrieval won't help.
Split every 512 tokens with 50-token overlap. Easy to implement but may break mid-sentence or mid-thought.
Split by paragraph or section boundaries. Preserves meaning and context, but chunk sizes vary.
Group 5 sentences per chunk. Natural boundaries with consistent sizes. A solid middle ground.
Small chunks for precise retrieval, but return the larger parent chunk for context. Best of both worlds.
The basic "retrieve then generate" pattern has evolved into several specialized variants.
Retrieve top-K chunks, stuff into prompt, generate. Simple and effective for many use cases.
The model decides if it needs retrieval. Skips the search for questions it already knows. Evaluates its own answers for groundedness.
An agent decides what, when, and how to retrieve in a loop. Can reformulate queries, try different sources, and validate results.
Chunks too large — the relevant sentence gets buried in paragraphs of irrelevant text, diluting the signal
Chunks too small — you retrieve the right sentence but lose the surrounding context needed to understand it
No reranking — vector search is fast but imprecise; without reranking, irrelevant chunks crowd out useful ones
Wrong embedding model — a general-purpose embedding may not capture domain-specific terminology well
RAG works because it plays to the strengths of both search and generation. Search engines are excellent at finding relevant documents but terrible at synthesizing answers. Language models are excellent at synthesis but unreliable at recalling specific facts. RAG combines them: let search handle the facts, let the model handle the language.
There's a deeper reason too: grounding reduces hallucination because the model has less need to "fill in" from memory. When relevant source text is right there in the prompt, the path of least resistance is to paraphrase and cite rather than fabricate.
Don't ask the AI to answer from memory. Search your documents first, retrieve the most relevant passages, and put them in the prompt. The model generates answers grounded in real sources — not guesses.
RAG is the production evolution of the single-prompt technique Give It the Source. Where that technique manually pastes context into a prompt, RAG automates the process: finding, ranking, and inserting the right context dynamically.
It connects to ReAct when agents decide what to retrieve (Agentic RAG), to Plan-and-Execute when complex queries require multi-step retrieval strategies, and to Self-Ask when the system generates sub-questions to retrieve different aspects of an answer.