home/

RAG 101: Retrieve, then Generate

Exit
lesson preview
empty graph — advance to add nodes
1. Why RAG exists
2. Anatomy of a RAG pipeline
3. Build it: Input -> Retriever -> LLM -> Output
4. Tuning retrieval
5. Citations and grounding
6. Run it for real
Lesson overview

RAG 101: Retrieve, then Generate

Build a retrieval-augmented pipeline from scratch. Why naked LLMs hallucinate, what retrieval really does, and how to ship something you can trust.

The problem we are solving

Imagine you are building a support assistant for a SaaS product. Your users ask things like "how do I rotate an API key?" and "what's your data retention policy?" If you call GPT-4 directly, it will confidently answer using either generic SaaS knowledge or a hallucinated version of your docs. Both are wrong, and both will eventually generate a support ticket that says "your AI told me to do X and now my account is broken."

The problem is not that the LLM is dumb — it is that the LLM has no access to your specific facts. Your refund policy, your pricing, your API surface, your changelog: none of it is in the training data, or if it is, the model has no idea which version to trust.

We need a system that looks up the right passages from your corpus and forces the model to answer from those passages, with citations the user can verify. That is RAG.

Why this graph shape

The four-node graph — Input, Retriever, LLM, Output — is the smallest topology that captures the RAG pattern. Every box has one job. The Retriever is its own node (not folded into the LLM) because retrieval is a separate failure mode you will want to debug, swap, and monitor independently.

The question flows to both the Retriever and the LLM. This is intentional: the Retriever needs the question to search; the LLM needs the question to answer. If you fold them into a single edge, you lose the ability to swap retrievers without re-plumbing the LLM call.

The Input and Output nodes are explicit because real systems have I/O contracts. Pretending the graph "just runs" hides the most important interface in your application — the one between your service and everything else.

Prerequisites

  • Comfort with Python (we use LangChain/LangGraph Python in code samples)
  • Basic familiarity with HTTP APIs and JSON
  • Conceptual understanding of what an LLM is (token prediction over a context window)
  • No prior LangChain/LangGraph experience required

What you will learn

  • Explain why naked LLM calls hallucinate on domain-specific questions
  • Describe the five-stage RAG pipeline: chunk, embed, store, retrieve, generate
  • Build a working RAG graph in LangGraph with state, nodes, and edges
  • Choose reasonable defaults for chunk size, k, and embedding model
  • Add hybrid search and reranking when pure dense retrieval falls short
  • Force the model to produce verifiable citations

Common pitfalls

Embedding mismatch at query time

You must embed the query with the same model you embedded the corpus with. Mixing models silently produces garbage retrievals — the vectors are technically the same shape but live in incomparable spaces. Wrap the embedding model in a singleton and assert the model name at both ingest and query.

Chunk overlap matters more than chunk size

Teams tune chunk size endlessly and ignore overlap. Without overlap, the sentence that explains your refund policy gets cut in half on a chunk boundary and never appears whole in any chunk. Start with 15-20% overlap and only reduce it if context windows are tight.

Indexing PII without realizing it

If your corpus contains customer emails, support transcripts, or internal Slack, you are embedding PII into a vector database. Embeddings are not encryption — anyone with read access to the store can reverse-engineer the text or query for it. Redact before ingest, not after.

Trusting cosine similarity scores as relevance

Cosine similarity is a proxy for relevance, not relevance itself. A score of 0.82 in one corpus may be excellent; in another it is garbage. Never threshold on raw cosine. Either learn a per-corpus threshold from a labeled eval set or rerank and threshold on the reranker score.

No eval set means no improvements

If you do not have 50-200 labeled (query, expected answer, expected sources) tuples, you cannot tell whether a change improves the system or breaks it. Build the eval set on day one, even if it is tiny and hand-written. Every tuning decision flows through it.

Streaming hides retrieval latency

Users see the LLM stream tokens and assume the system is fast. Meanwhile, your retrieval is doing a 4-second hybrid search every call. Always log p50/p95 latency per stage, not just end-to-end. The slow stage is almost never the one you think.

Further reading

Overview