Build a retrieval-augmented pipeline from scratch. Why naked LLMs hallucinate, what retrieval really does, and how to ship something you can trust.
Imagine you are building a support assistant for a SaaS product. Your users ask things like "how do I rotate an API key?" and "what's your data retention policy?" If you call GPT-4 directly, it will confidently answer using either generic SaaS knowledge or a hallucinated version of your docs. Both are wrong, and both will eventually generate a support ticket that says "your AI told me to do X and now my account is broken."
The problem is not that the LLM is dumb — it is that the LLM has no access to your specific facts. Your refund policy, your pricing, your API surface, your changelog: none of it is in the training data, or if it is, the model has no idea which version to trust.
We need a system that looks up the right passages from your corpus and forces the model to answer from those passages, with citations the user can verify. That is RAG.
The four-node graph — Input, Retriever, LLM, Output — is the smallest topology that captures the RAG pattern. Every box has one job. The Retriever is its own node (not folded into the LLM) because retrieval is a separate failure mode you will want to debug, swap, and monitor independently.
The question flows to both the Retriever and the LLM. This is intentional: the Retriever needs the question to search; the LLM needs the question to answer. If you fold them into a single edge, you lose the ability to swap retrievers without re-plumbing the LLM call.
The Input and Output nodes are explicit because real systems have I/O contracts. Pretending the graph "just runs" hides the most important interface in your application — the one between your service and everything else.
You must embed the query with the same model you embedded the corpus with. Mixing models silently produces garbage retrievals — the vectors are technically the same shape but live in incomparable spaces. Wrap the embedding model in a singleton and assert the model name at both ingest and query.
Teams tune chunk size endlessly and ignore overlap. Without overlap, the sentence that explains your refund policy gets cut in half on a chunk boundary and never appears whole in any chunk. Start with 15-20% overlap and only reduce it if context windows are tight.
If your corpus contains customer emails, support transcripts, or internal Slack, you are embedding PII into a vector database. Embeddings are not encryption — anyone with read access to the store can reverse-engineer the text or query for it. Redact before ingest, not after.
Cosine similarity is a proxy for relevance, not relevance itself. A score of 0.82 in one corpus may be excellent; in another it is garbage. Never threshold on raw cosine. Either learn a per-corpus threshold from a labeled eval set or rerank and threshold on the reranker score.
If you do not have 50-200 labeled (query, expected answer, expected sources) tuples, you cannot tell whether a change improves the system or breaks it. Build the eval set on day one, even if it is tiny and hand-written. Every tuning decision flows through it.
Users see the LLM stream tokens and assume the system is fast. Meanwhile, your retrieval is doing a 4-second hybrid search every call. Always log p50/p95 latency per stage, not just end-to-end. The slow stage is almost never the one you think.