Fundamentals
1. Overview¶
Large Language Models are trained on a fixed snapshot of data; knowledge is baked into model weights and cannot change without retraining. This creates three hard problems: a knowledge cutoff (no access to post-training events), hallucinations (the model generates plausible-sounding but wrong answers when it lacks a fact), and poor domain specificity for proprietary or niche corpora.
Retrieval-Augmented Generation (RAG) decouples knowledge storage from language generation. Instead of forcing the model to memorise facts, it retrieves relevant external documents at inference time and conditions generation on that retrieved context.
2. The Core RAG Loop¶
-
User submits a query.
-
A retriever searches an external knowledge base and returns the top-k most relevant chunks.
-
Retrieved chunks are injected into the prompt as context.
-
The LLM generates an answer grounded in that context.
3. RAG vs. Fine-Tuning — Choosing the Right Tool¶
This is one of the most common conceptual interview questions. The short answer: use RAG when the problem is about knowledge access; use fine-tuning when the problem is about model behaviour.
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Knowledge type | Dynamic, large, frequently changing | Stable, compact |
| Primary goal | Access external facts at inference time | Change reasoning style or output format |
| Cost to update | Re-index documents (cheap) | Retrain or fine-tune model (expensive) |
| Explainability | Citations traceable to source chunks | Opaque — knowledge in weights |
| Hallucination risk | Reduced (grounded in retrieved text) | Not directly addressed |
| Common use case | Enterprise Q&A, support bots, doc search | Instruction following, code style, tone |
In practice RAG and fine-tuning are complementary. A model may be fine-tuned for instruction following, while RAG supplies factual grounding at runtime.
4. High-Level RAG Pipeline¶
A standard RAG system has four stages:
-
Indexing — documents are chunked, embedded, and stored in a vector index.
-
Retrieval — the user query is embedded and the most similar chunks are fetched.
-
Augmentation — retrieved chunks are injected into the prompt alongside the query.
-
Generation — the LLM generates a grounded answer conditioned on query + context.
5. Key Failure Modes of Vanilla RAG¶
| Failure Mode | Description |
|---|---|
| Poor recall | Relevant documents exist but are not retrieved |
| Poor precision | Retrieved documents are irrelevant or noisy |
| Chunking errors | Semantic meaning is fragmented across chunk boundaries |
| Context overflow | Retrieved context exceeds the model's context window |
| Model ignores context | LLM falls back on parametric knowledge despite good retrieval |
| No verification | System produces fluent but wrong answers with no detection |
Most real-world RAG systems extend vanilla RAG to explicitly address these failure modes by adding reranking, query rewriting, verification, and feedback loops.
6. RAG vs. Fine-Tuning vs. Long-Context LLM¶
With frontier models now supporting 128K–1M token context windows, a third option exists: simply include all knowledge directly in the context.
| Dimension | RAG | Fine-Tuning | Long-Context LLM |
|---|---|---|---|
| Knowledge update | Re-index (cheap, fast) | Retrain (expensive, slow) | Update document in prompt |
| Knowledge size | Unlimited (external store) | Bounded by training data | Bounded by context window |
| Latency | Retrieval adds latency | None at inference | Higher for very long prompts |
| Cost | Retrieval + generation | Training cost (one-time) | Token cost scales with context length |
| Explainability | Citations traceable | Opaque | Traceable if documents are numbered |
| Hallucination risk | Reduced (grounded) | Not directly addressed | Reduced (but lost-in-the-middle effect) |
| Best for | Large, changing knowledge bases | Style/behaviour change | Small, stable knowledge bases |
Decision framework:
-
Use RAG if the knowledge base is large (>100K tokens), changes frequently, or requires traceability/citations.
-
Use fine-tuning if the problem is about model behaviour (tone, format, reasoning style) rather than knowledge access.
-
Use long-context LLM if the knowledge base fits in the context window AND is stable (few updates) AND latency/cost are acceptable.
-
Combine all three for production systems: fine-tune for behaviour, RAG for dynamic knowledge retrieval, long-context for short static reference documents.