RAG Evaluation
1. Overview¶
Evaluation is one of the hardest aspects of RAG. The system has multiple interacting components, no single ground-truth output, and can fail silently — generating fluent but incorrect answers. Effective evaluation requires multiple layers covering retrieval quality, generation quality, and faithfulness.
2. Why RAG Evaluation Is Hard¶
Unlike traditional NLP tasks, RAG systems:
-
Do not have a single ground-truth output (multiple valid answers may exist).
-
Depend on external knowledge sources that can be wrong, stale, or irrelevant.
-
Can fail silently — generating fluent but factually wrong answers.
-
Have multiple interacting components: a failure in retrieval causes a failure in generation, but this is hard to detect end-to-end.
No single metric is sufficient. Effective evaluation requires layered coverage.
3. Component vs. End-to-End Evaluation¶
Component-Level Evaluation¶
Each module is tested independently.
-
Retrieval: Are relevant documents retrieved? Are they ranked correctly?
-
Generation: Given perfect context, can the model answer correctly?
| Pros | Easier to debug failures; clear error attribution; fast offline iteration |
| Cons | Does not capture compounding errors; may overestimate real-world performance |
End-to-End Evaluation¶
The full pipeline is tested from user query to final answer.
| Pros | Reflects real user experience; captures interaction effects between components |
| Cons | Hard to diagnose root causes; more expensive and noisy |
Strong systems use both — component evaluation during development, end-to-end evaluation before deployment.
4. Retrieval Metrics¶
Recall@k¶
Fraction of queries for which at least one relevant document appears in the top-k retrieved results.
-
Why it matters: If recall is low, generation cannot recover. Especially critical for factual QA.
-
Limitation: Binary notion of relevance; does not consider ranking quality within top-k.
Mean Reciprocal Rank (MRR)¶
Measures how early the first relevant document appears in the ranked list. Score = average of 1/rank across queries.
-
Why it matters: Rewards systems that rank relevant documents earlier; useful when only one document is needed.
-
Limitation: Ignores multiple relevant documents.
nDCG (Normalised Discounted Cumulative Gain)¶
Measures quality of the full ranked list using graded relevance, penalising relevant documents that appear lower.
-
Why it matters: More realistic for multi-document relevance; handles graded (not just binary) relevance labels.
-
Limitation: Requires graded relevance annotations; more complex to compute and interpret.
Precision@k¶
Fraction of top-k retrieved documents that are relevant.
-
Why it matters: Complements recall — high precision means less noisy context for the LLM.
-
Limitation: Must be considered alongside recall; a system can have high precision@3 with low recall.
High Recall@k is often more critical than precision in RAG retrieval, because the LLM can filter irrelevant context — but it cannot invent missing information.
5. Generation Quality Metrics¶
| Metric | What It Measures | Limitation |
|---|---|---|
| Exact Match (EM) | Exact string match with reference answer | Too strict for NL generation; penalises valid paraphrasing |
| F1 Score (token overlap) | Token-level overlap with reference | Surface-level; misses semantic equivalence |
| BLEU / ROUGE | N-gram overlap with reference text | Poorly correlates with factual correctness; rewards fluency |
| BERTScore | Semantic similarity via contextual embeddings | Better than n-gram but still not factuality-aware |
These metrics primarily measure fluency and surface similarity, not truthfulness. A hallucinated answer can score well if it is fluent and partially overlaps with the reference.
6. Faithfulness and Groundedness¶
The most critical RAG-specific evaluation dimension. A system can score well on generation metrics while still hallucinating — generating correct-sounding text not supported by the retrieved documents.
Faithfulness¶
Question: Is every claim in the answer supported by the retrieved context?
-
Measured by sentence-level entailment checks or LLM-as-a-judge prompting.
-
Failure mode: Correct-sounding claims that are not in any retrieved document.
Groundedness¶
Question: Does every specific claim in the answer trace back to a retrieved source?
-
Measured by claim extraction followed by source matching or citation validation.
-
Failure mode: Answers that are correct (by coincidence or parametric knowledge) but unsupported by retrieved text.
Answer Relevance¶
Question: Does the answer actually address the original question?
- Measured by LLM-as-a-judge or semantic similarity between answer and query.
Context Relevance¶
Question: Were the retrieved documents actually relevant to the query?
- Measured by LLM-based relevance scoring per retrieved chunk.
7. LLM-as-a-Judge (RAGAS Framework)¶
Using an LLM to evaluate RAG outputs is increasingly the standard approach. RAGAS is a widely-used framework evaluating four dimensions without requiring ground-truth labels for generation:
| Dimension | Question Answered |
|---|---|
| Faithfulness | Are all claims in the answer supported by retrieved context? |
| Answer Relevance | Is the answer on-topic for the original query? |
| Context Recall | Does the retrieved context contain what's needed to answer? (Needs reference answer) |
| Context Precision | Are retrieved documents relevant, or is there noise? |
Benefits: Scales without human labelling; captures semantic nuance beyond lexical overlap.
Risks:
-
Bias towards fluent answers — LLM judges may reward well-written hallucinations.
-
Sensitivity to prompt design — scoring rubrics matter significantly.
-
Self-preference bias — models tend to rate outputs from similar architectures more favourably.
Best practice: Validate LLM-as-a-judge scores against human judgments on a held-out set before trusting them for production decisions.
8. Human Evaluation Protocols¶
Human evaluation remains the gold standard for RAG systems.
Common criteria: Correctness, completeness, faithfulness, clarity, usefulness.
Protocol design:
-
Blind evaluation (evaluators don't know which system produced which answer).
-
Multiple annotators per example (typically 3).
-
Measure inter-annotator agreement (Cohen's Kappa).
Tradeoffs: High cost; low scalability; slow iteration — but irreplaceable for final validation.
9. Common RAG Failure Types and Root Causes¶
| Failure Type | Diagnosis | Fix |
|---|---|---|
| Relevant doc not retrieved | Low Recall@k | Improve embeddings; use hybrid retrieval; adjust chunk size |
| Retrieved but not used by LLM | Good recall, low faithfulness | Improve prompt; citation constraints; rerank more aggressively |
| Partial hallucinations | Mixed faithful + invented claims | Faithfulness scoring; instruction-tune with grounding |
| Over-reliance on parametric knowledge | Model ignores retrieved context | Explicit grounding instructions; Self-RAG |
| Stale or contradictory sources | Corpus not updated | Add timestamps; filter by recency; update index |
10. Agent Evaluation¶
For single-agent and multi-agent evaluation — metrics, benchmarks, outcome vs. trajectory evaluation, and multi-agent coordination failures — see the dedicated page: Agent Evaluation.
11. Building an Evaluation Dataset¶
Challenge: Creating reliable ground-truth labels for RAG is expensive.
Approaches:
-
Synthetic generation: Use an LLM to generate questions from your document corpus, creating question–context–answer triples. Fast and free, but synthetic questions may not match real user queries.
-
Production logging: Sample real user queries and have humans label relevant documents and correct answers. Most realistic, but requires live traffic.
-
Expert annotation: For high-stakes domains (medical, legal), pay subject matter experts to create a gold-standard test set. Expensive but highest quality.