Fundamentals

1. Overview¶

Large Language Models are trained on a fixed snapshot of data; knowledge is baked into model weights and cannot change without retraining. This creates three hard problems: a knowledge cutoff (no access to post-training events), hallucinations (the model generates plausible-sounding but wrong answers when it lacks a fact), and poor domain specificity for proprietary or niche corpora.

Retrieval-Augmented Generation (RAG) decouples knowledge storage from language generation. Instead of forcing the model to memorise facts, it retrieves relevant external documents at inference time and conditions generation on that retrieved context.

2. The Core RAG Loop¶

User submits a query.
A retriever searches an external knowledge base and returns the top-k most relevant chunks.
Retrieved chunks are injected into the prompt as context.
The LLM generates an answer grounded in that context.

3. RAG vs. Fine-Tuning — Choosing the Right Tool¶

This is one of the most common conceptual interview questions. The short answer: use RAG when the problem is about knowledge access; use fine-tuning when the problem is about model behaviour.

Dimension	RAG	Fine-Tuning
Knowledge type	Dynamic, large, frequently changing	Stable, compact
Primary goal	Access external facts at inference time	Change reasoning style or output format
Cost to update	Re-index documents (cheap)	Retrain or fine-tune model (expensive)
Explainability	Citations traceable to source chunks	Opaque — knowledge in weights
Hallucination risk	Reduced (grounded in retrieved text)	Not directly addressed
Common use case	Enterprise Q&A, support bots, doc search	Instruction following, code style, tone

In practice RAG and fine-tuning are complementary. A model may be fine-tuned for instruction following, while RAG supplies factual grounding at runtime.

4. High-Level RAG Pipeline¶

A standard RAG system has four stages:

Indexing — documents are chunked, embedded, and stored in a vector index.
Retrieval — the user query is embedded and the most similar chunks are fetched.
Augmentation — retrieved chunks are injected into the prompt alongside the query.
Generation — the LLM generates a grounded answer conditioned on query + context.

5. Key Failure Modes of Vanilla RAG¶

Failure Mode	Description
Poor recall	Relevant documents exist but are not retrieved
Poor precision	Retrieved documents are irrelevant or noisy
Chunking errors	Semantic meaning is fragmented across chunk boundaries
Context overflow	Retrieved context exceeds the model's context window
Model ignores context	LLM falls back on parametric knowledge despite good retrieval
No verification	System produces fluent but wrong answers with no detection

Most real-world RAG systems extend vanilla RAG to explicitly address these failure modes by adding reranking, query rewriting, verification, and feedback loops.

6. RAG vs. Fine-Tuning vs. Long-Context LLM¶

With frontier models now supporting 128K–1M token context windows, a third option exists: simply include all knowledge directly in the context.

Dimension	RAG	Fine-Tuning	Long-Context LLM
Knowledge update	Re-index (cheap, fast)	Retrain (expensive, slow)	Update document in prompt
Knowledge size	Unlimited (external store)	Bounded by training data	Bounded by context window
Latency	Retrieval adds latency	None at inference	Higher for very long prompts
Cost	Retrieval + generation	Training cost (one-time)	Token cost scales with context length
Explainability	Citations traceable	Opaque	Traceable if documents are numbered
Hallucination risk	Reduced (grounded)	Not directly addressed	Reduced (but lost-in-the-middle effect)
Best for	Large, changing knowledge bases	Style/behaviour change	Small, stable knowledge bases

Decision framework:

Use RAG if the knowledge base is large (>100K tokens), changes frequently, or requires traceability/citations.
Use fine-tuning if the problem is about model behaviour (tone, format, reasoning style) rather than knowledge access.
Use long-context LLM if the knowledge base fits in the context window AND is stable (few updates) AND latency/cost are acceptable.
Combine all three for production systems: fine-tune for behaviour, RAG for dynamic knowledge retrieval, long-context for short static reference documents.