RAG & Agents — Interview Q&A¶

A comprehensive collection of interview questions covering Retrieval-Augmented Generation, LLM Agents, and Context Engineering. Format mirrors the GenAI Interview Q&A document — each question includes a structured answer, follow-ups, and coding challenges where applicable.

Chapter 1: RAG Fundamentals¶

Q1. What problem does RAG solve that fine-tuning does not?

RAG solves the knowledge access problem. LLMs have a training cutoff and fixed parametric knowledge — they cannot access information that postdates training or is not in the training corpus. Fine-tuning changes how the model behaves (style, reasoning, output format) but does not allow the model to look up new facts. A fine-tuned model still hallucinates when asked about events or facts outside its training data.

RAG decouples knowledge storage from model weights: the model retrieves relevant documents at inference time and grounds its answer in that retrieved evidence. The knowledge base can be updated by re-indexing new documents — no retraining required.

Key distinction:

RAG: "What do the retrieved documents say about X?"
Fine-tuning: "How should I respond to questions about X?"

Follow-up: Can you use RAG and fine-tuning together?

Yes, and this is common in production. Fine-tuning handles "how to respond" (instruction following, output format, tone); RAG handles "what to respond with" (factual content). A model fine-tuned for citation-constrained generation + RAG for retrieval is a common production pattern.

Q2. Formally, how does RAG change the generation objective?

Vanilla LLM generation maximises P(y | q) — the probability of answer y given only the query q. The model relies entirely on parametric knowledge baked into its weights.

RAG conditions generation on both the query and retrieved documents:

P(y | q, d₁, d₂, ..., dₖ)

The model generates an answer grounded in the retrieved evidence d₁:k rather than relying solely on its internal parameters. This reduces hallucination because incorrect parametric beliefs can be overridden by retrieved facts.

Follow-up: What is the risk of this objective?

If retrieval fails (wrong documents retrieved), the model generates answers conditioned on irrelevant or incorrect evidence. The model cannot distinguish good retrieval from bad retrieval without explicit verification. This is why faithfulness evaluation and retrieval quality metrics are both necessary.

Q3. What are the main failure modes of vanilla RAG?

Failure Mode	Description	Primary Fix
Poor recall	Relevant documents exist but aren't retrieved	Hybrid retrieval; improve embeddings
Poor precision	Irrelevant documents retrieved and used	Reranking; CRAG
Chunking errors	Semantic meaning split across chunk boundaries	Recursive/semantic chunking; parent-doc retrieval
Context overflow	Retrieved context exceeds context window	Compression; selective retrieval
Model ignores context	LLM uses parametric knowledge instead of retrieved text	Explicit grounding instructions; Self-RAG
No verification	Fluent but wrong answer returned with no flag	Faithfulness scoring; citation constraints

Follow-up: Which failure is hardest to detect?

"Model ignores context" — the system appears to work (fluent answer, no error) but the answer is grounded in the model's parametric knowledge rather than retrieved documents. This is detectable only via faithfulness evaluation (checking that every claim in the answer is supported by a retrieved chunk).

Q4. When should you use a long-context LLM instead of RAG?

Use a long-context LLM when:

The knowledge base fits entirely in the context window (typically ≤ 200K tokens for current frontier models).
The knowledge base is stable — infrequent updates mean the overhead of a retrieval index is not justified.
Latency and cost for a large context are acceptable for the use case.

Use RAG when:

The knowledge base is larger than the context window (millions of documents).
The knowledge base changes frequently (re-indexing is cheaper than re-generating the full context).
Citations and traceability are required (RAG produces attributable sources).
Cost must scale with query rather than with the size of the entire knowledge base.

Follow-up: What are the risks of long-context LLMs?

The lost-in-the-middle effect — LLMs attend less to content in the middle of long contexts. Very long prompts are also expensive (cost scales with token count) and slow. For production systems with large, changing knowledge bases, RAG remains the preferred approach even as context windows grow.

Chapter 2: Retrieval and Indexing¶

Q5. What is the difference between sparse and dense retrieval?

Sparse retrieval (BM25, TF-IDF):

Represents documents as sparse vectors of term counts/weights.
Matches documents by lexical overlap (same words in query and document).
Fast, interpretable, excellent for exact-match queries and technical terms.
Fails on synonyms, paraphrases, or semantic similarity without lexical match.

Dense retrieval (bi-encoders):

Embeds query and documents into a shared dense vector space.
Matches by semantic similarity (cosine or dot product distance).
Handles synonyms and paraphrases; generalises across phrasings.
Slower index build (embedding each document); requires ANN index for fast retrieval.

Hybrid (BM25 + Dense + RRF):

Run both in parallel; fuse ranked lists with Reciprocal Rank Fusion.
Consistently outperforms either alone across retrieval benchmarks.
Industry standard for production RAG.

Follow-up: What is Reciprocal Rank Fusion (RRF)?

RRF merges multiple ranked lists without needing score normalisation. For each document, its RRF score is the sum of 1 / (k + rank_in_list_i) across all lists (k=60 is typical). Documents that rank well in multiple lists score highest. RRF is robust to score scale differences between BM25 and dense retrievers.

Q6. What is a cross-encoder and when is it used in RAG?

A cross-encoder takes a (query, document) pair as a single input and produces a relevance score. Unlike a bi-encoder that encodes query and document independently, a cross-encoder allows full attention between query and document tokens — enabling much more precise relevance judgement.

Trade-off: Full attention is expensive — cross-encoders cannot be pre-computed and indexed. They are used only as a reranker on the top-N candidates (typically top-50–100) from the initial fast retrieval step.

Pipeline:

All documents → ANN retrieval (fast, ~100ms) → top-100 candidates
    → Cross-encoder reranker (slow, ~500ms for 100 docs) → top-5–10
    → Pass to LLM

When to add a reranker: When initial retrieval precision is insufficient — retrieved documents are topically related but not precisely relevant to the query. Cross-encoders typically recover 5–15% additional precision over bi-encoder retrieval alone.

Chapter 3: Advanced RAG Techniques¶

Q7. What is HyDE and how does it bridge the query-document gap?

HyDE (Hypothetical Document Embeddings) uses an LLM to generate a hypothetical ideal answer to the query, then embeds that answer rather than the original query for retrieval.

Why it works: A dense, fluent passage (the hypothetical answer) lives much closer in vector space to real documents on the same topic than a short, sparse user query does — even when they cover the same concept. This bridges the vocabulary and length mismatch between queries and documents.

Query: "How does attention mechanism work?"
  ↓
LLM generates: "The attention mechanism computes query, key, and value
matrices from input embeddings and uses scaled dot-product..."
  ↓
Embed this passage → retrieve real documents most similar to it

Risk: If the LLM hallucinates in the hypothetical answer, those errors are encoded into the retrieval vector — potentially retrieving misleading documents. Use HyDE with models that are reliable for the domain.

Coding challenge:

def hyde_retrieval(query: str, llm, retriever) -> list[Document]:
    hypothetical = llm.invoke(
        f"Write a detailed answer to: {query}\nAnswer:"
    )
    return retriever.retrieve(hypothetical.content)

Q8. What is Self-RAG and how does it differ from standard RAG?

Standard RAG: Always retrieves; always uses what is retrieved; no verification.

Self-RAG: Trains the LLM to generate reflection tokens inline that control retrieval and self-assess groundedness:

Reflection Token	Question	Values
`[Retrieve]`	Should I retrieve?	yes / no
`[IsRel]`	Is the retrieved document relevant?	relevant / irrelevant
`[IsSup]`	Is my generated text supported?	fully / partially / not supported
`[IsUse]`	Is the response useful?	1–5

Self-RAG with Llama-2-13B outperforms GPT-4 on PopQA and ASQA. The model only retrieves when needed ([Retrieve]=yes), reducing latency for simple queries.

Key trade-off: Requires fine-tuning the generation model. Cannot be applied to black-box APIs. CRAG achieves similar corrective benefits without modifying the LLM.

Q9. How does GraphRAG handle queries that standard vector RAG fails on?

Standard vector RAG retrieves chunks similar to the query by cosine similarity. It fails on:

Global queries — "What are the main themes in this corpus?" No single chunk captures this.
Multi-hop queries — "How is entity A connected to event B?" The connection may span many chunks.

GraphRAG (Microsoft, 2024) builds a knowledge graph from the corpus:

Extract entities and relationships (LLM-based).
Apply Leiden algorithm for community detection (groups of tightly related entities).
Generate LLM summaries for each community.

Query time:

Global queries: map-reduce over community summaries.
Local queries: traverse graph from the most relevant entity.

Build cost is high (LLM calls for entity extraction + community summarisation). Best for: large corpora where thematic understanding or entity relationship queries are frequent.

Chapter 4: RAG Evaluation¶

Q10. Why is no single metric sufficient to evaluate a RAG system?

RAG has multiple failure modes that different metrics capture:

Metric Type	What It Catches	What It Misses
Recall@k	Whether relevant docs are retrieved	Ranking quality, precision
Faithfulness	Whether answer is grounded in retrieved docs	Whether retrieved docs are correct
Answer correctness	Whether answer matches ground truth	Whether grounding is the source
End-to-end F1	Final answer quality	Root cause of failures

A system can have: high recall + low faithfulness (retrieves well, then hallucinates); high faithfulness + wrong answer (correctly cites wrong source); perfect generation + poor retrieval (lucky parametric recall). All layers must be evaluated to diagnose root causes.

Q11. What are the RAGAS evaluation dimensions and what does each measure?

RAGAS (Es et al., 2023) provides reference-free evaluation of RAG systems using an LLM as a judge:

Dimension	Question	Requires Ground Truth?
Faithfulness	Are all claims in the answer supported by retrieved context?	No
Answer Relevance	Is the answer on-topic for the original query?	No
Context Recall	Does the retrieved context contain what's needed to answer?	Yes (reference answer)
Context Precision	Are retrieved documents relevant, or is there noise?	No

Key risks of LLM-as-a-judge:

Fluency bias — well-written hallucinations can score high.
Prompt sensitivity — scoring rubric significantly affects results.
Self-preference — models prefer outputs from the same model family.

Mitigation: Validate RAGAS scores against human judgments on a held-out set before using for production decisions.

Follow-up: How do you build an evaluation set without labelled data?

Three approaches: (1) Synthetic — LLM generates question-context-answer triples from your corpus; (2) Production logging — sample real user queries and have humans label correct answers; (3) Expert annotation — domain experts create a gold-standard test set (expensive, highest quality). Start synthetic for iteration speed; refine with human annotations before deployment.

Q12. How do you diagnose whether a RAG failure is in retrieval or generation?

Run an oracle experiment: manually find the correct document and inject it directly into the prompt, bypassing retrieval entirely.

If the model answers correctly with the oracle document → the failure is in retrieval (fix: improve embeddings, use hybrid retrieval, adjust chunking).
If the model still fails with the correct document in context → the failure is in generation (fix: improve prompting, citation constraints, grounding instructions, or model capability).

This isolates the failure mode cleanly and prevents the common error of optimising retrieval when the real problem is generation (or vice versa).

Chapter 5: LLM Agent Fundamentals¶

Q13. What is the difference between an LLM and an LLM agent?

An LLM takes a fixed input and produces a single output. It answers a question in one shot.

An LLM agent uses an LLM as its reasoning engine but wraps it in a loop:

while not done:
    thought = LLM.reason(state, history, tools)
    if thought.requires_tool:
        result = tool.call(thought.tool_name, thought.args)
        history.append(result)
    else:
        return thought.final_answer

The agent can call external tools, observe their results, and dynamically decide the next step — repeating until the task is complete.

Key differences:

Dimension	LLM	LLM Agent
Flow	Fixed: one input → one output	Iterative: reason → act → observe
External state	None	Maintains state across steps
Tool access	None	Calls APIs, databases, code interpreters
Planning	Single response	Dynamic, step-by-step

Q14. How does function calling work at the API level?

The caller passes a list of tool schemas (JSON with name, description, parameters) alongside the user message. If the model decides a tool call is needed, instead of generating text it emits a structured JSON object:

{
  "tool_call": {
    "name": "search_web",
    "arguments": {"query": "RAG benchmarks 2025", "max_results": 3}
  }
}

The calling framework executes the tool and appends the result as a tool message back to the conversation. The model then continues generating — either calling another tool or producing a final answer.

Key design rules for tool schemas:

Names must be descriptive — the model uses the name and description to decide when to call.
Avoid overlapping tool descriptions — ambiguity causes wrong tool selection.
Mark parameters required only if the tool genuinely cannot run without them.
Keep return formats concise and structured — avoid dumping raw HTML.

Q15. What is the ReAct pattern and why is it the dominant agent architecture?

ReAct (Reason + Act) (Yao et al., 2022) interleaves Thought, Action, and Observation steps before producing a final answer.

Thought: I need to find the CEO of Anthropic.
Action: search_web("Anthropic CEO 2025")
Observation: Dario Amodei is the CEO of Anthropic.
Thought: I have the answer.
Answer: Dario Amodei is the CEO of Anthropic.

Why it's dominant:

The Thought step (chain-of-thought) forces explicit reasoning before acting — reduces impulsive wrong tool calls.
Observations ground the model in real retrieved information, reducing hallucination.
The loop allows self-correction — if an observation is unexpected, the next thought can adapt.

Limitations: Linear (no backtracking); no exploration; prone to tool-call loops. Advanced alternatives: LATS (tree search), Reflexion (reflection + retry).

Chapter 6: Agent Memory and Planning¶

Q16. What are the four memory types in LLM agents?

Memory Type	What It Stores	Where	Example
Working (in-context)	Active context: task, history, tool outputs	Context window	Current conversation, scratchpad
Episodic	Specific past interactions with timestamps	Vector DB	"On task X, tool Y returned Z"
Semantic	General facts and domain knowledge	Weights or RAG index	World knowledge, domain facts
Procedural	Action policies and skills	System prompt or fine-tuned weights	"Always check docs before coding"

Key distinction — episodic vs. semantic: Episodic memory records what the agent experienced (a concrete interaction). Semantic memory records what the agent knows (general facts). Periodic reflection converts episodic memories into semantic ones, preventing the episodic store from flooding with low-level events.

Follow-up: How does MemGPT extend the context window?

MemGPT treats the context window like CPU RAM and external storage like disk. The agent is given explicit memory management functions (archival_memory_search, archival_memory_insert) it calls like any tool. When the context fills, old content is paged to external storage; the agent explicitly retrieves it when needed. This enables unbounded effective context.

Q17. What is LATS and how does it improve over ReAct?

LATS (Language Agent Tree Search) (Zhou et al., 2023) replaces ReAct's greedy single-path search with Monte Carlo Tree Search (MCTS):

Selection — pick the most promising node using UCB1.
Expansion — generate new action candidates.
Evaluation — score each with an LLM critic (not a random rollout).
Backpropagation — update scores of all nodes on the path.
Reflection — when a path fails, generate a verbal critique and use it to inform other branches.

Results: HotpotQA: LATS 73.2% vs. ReAct 35.1%. HumanEval (GPT-4): LATS 94.4%.

When to use: Tasks with high combinatorial complexity where the best answer matters more than speed. The cost is significantly more LLM calls.

Q18. What is Reflexion and how does verbal reinforcement work?

Reflexion (Shinn et al., 2023) adds a reflection step after a failed attempt: the agent generates a natural language critique of what went wrong, stores it in episodic memory, and uses it on the next attempt.

Attempt 1 → fail
Reflection: "I called the wrong API endpoint. I should check the documentation first."
  ↓ stored in memory
Attempt 2 → informed by reflection → succeed

Why verbal > numerical reward: Natural language critiques are directly interpretable by the LLM on the next attempt, providing specific corrective guidance rather than just a scalar signal.

Results: Improves HumanEval pass@1 from ~67% to ~88% with GPT-4.

Chapter 7: Agent Frameworks and Protocols¶

Q19. What is the core abstraction in LangGraph?

LangGraph models agent execution as a directed graph with cycles:

Nodes: Python functions (LLM calls, tool calls, state transforms).
Edges: Transitions between nodes — unconditional or conditional (based on state).
State: A typed dictionary passed through the graph; each node reads from and writes to it.

Cycles are what enable the ReAct loop — execution can return from the tools node back to the LLM node indefinitely. Conditional edges implement routing logic ("if the model called a tool, go to tools; else return to user"). Human-in-the-loop interrupt points pause execution at any node for external approval.

Q20. What problem does MCP solve?

Before MCP, every AI tool had to be written as a custom adapter for each framework — a tool for LangChain couldn't be used in Claude Desktop without reimplementation.

Model Context Protocol (MCP) defines a universal client-server protocol:

MCP Server: A tool (database, API, file system) exposes itself once as an MCP server.
MCP Client: Any MCP-compatible host (Claude, GPT-4, custom app) connects and discovers available tools automatically.

This reduces the integration problem from N×M (N models × M tools) to N+M (N hosts + M servers).

Protocol: JSON-RPC 2.0 over stdio (local) or HTTP+SSE (remote). Servers expose Tools, Resources, and Prompts.

Follow-up: What is A2A and how does it complement MCP?

A2A (Agent-to-Agent protocol, Google, April 2025) standardises agent-to-agent communication. MCP handles model→tool connections; A2A handles agent→agent delegation. In a production system: an orchestrator uses A2A to delegate to sub-agents; each sub-agent uses MCP to connect to its tools.

Chapter 8: Agent Safety and Production¶

Q21. What are the main agent failure modes in production?

Failure Mode	Description	Mitigation
Hallucinated actions	Agent fabricates tool results or invents non-existent tools	Log every tool call; compare agent's stated observations to actual returns
Scope creep	Agent expands beyond assigned task	Principle of least privilege; tool allowlists
Cascading errors	Early wrong observation poisons downstream reasoning	Verification steps between major stages; oracle experiments
Context loss	Key instructions forgotten in long tasks	Periodic constraint re-injection; MemGPT paging
Tool misuse	Right tool, wrong arguments	Schema validation before execution; few-shot examples
Prompt injection	Adversarial content in tool outputs hijacks behaviour	Sanitise tool outputs; safety classifier; privilege separation
Infinite loops	Same tool called repeatedly without progress	Loop detector; step budget; divergence detector

Follow-up: What is prompt injection specific to agents?

In agents, the attack surface expands from user input to any tool output (web pages, documents, emails). An attacker who controls a webpage the agent visits can inject instructions that redirect the agent's behaviour. Mitigation: treat all tool outputs as untrusted; sanitise before context injection; use privilege separation (read tools cannot trigger write-capable tools).

Q22. When should a production agent require human approval before acting?

Use this decision framework:

Action Type	Reversible?	HITL Required?
Read file / web search	Yes	No
Write to internal notes	Yes	No
Send email / Slack message	No	Yes
Delete file or DB record	No	Yes
Make financial transaction	No	Always
Deploy to production	No	Always

In LangGraph: graph.add_node("confirm", human_approval_node) inserted before any irreversible action node. Execution pauses; the human reviews the planned action and approves or rejects.

Chapter 9: Context Engineering¶

Q23. What is context engineering?

Context engineering is the practice of deliberately constructing the information environment (context window) that an LLM operates in — choosing what to include, how to compress it, how to sequence it, and how to isolate different components.

Four pillars:

Pillar	Key Techniques
Writing	System prompt design, structured output schemas, prompt templates
Selecting	RAG retrieval, hybrid search, reranking, memory retrieval
Compressing	LLMLingua (20× compression, 1.5% quality loss), RECOMP, summarisation, prompt caching
Isolating	Section delimiters, numbered citations, tool output sandboxing

Context engineering has become the primary engineering discipline for production LLM applications because the context window is now the primary artefact — not the prompt string.

Q24. How does LLMLingua compress prompts and what is the trade-off?

LLMLingua uses a small proxy LLM to score each token's conditional probability P(token | prefix). Tokens that are easily predictable (high probability) carry less information and are dropped.

Compression ratio: Up to 20×
Quality loss: ~1.5% on downstream tasks
Result: Compressed RAG context passed to the main LLM — lower latency, lower cost

Trade-off: Adds an LLM call for the compression step. Best when: retrieved documents are long and noisy; the main LLM's context window is a bottleneck; cost and latency matter.

Q25. What is the lost-in-the-middle effect and how do you mitigate it in RAG?

LLMs attend most strongly to content at the beginning and end of the context window. Content placed in the middle receives consistently less attention. Liu et al. (2023) showed 20+ percentage point accuracy drops when the key document is in the middle vs. at the start.

Mitigation for RAG:

Place the most relevant documents at the beginning of the retrieved context block (captures attention sink).
Place the second-most relevant at the end (captures recency bias).
Less relevant documents go in the middle.
Place the current query last in the overall prompt (recency bias puts the question at maximum attention).

Quick Reference Cheat Sheet¶

RAG Pipeline Stages¶

1. Indexing:    chunk → embed → store in vector index

2. Retrieval:   embed query → ANN search → rerank

3. Augmentation: inject top-k chunks into prompt

4. Generation:  LLM generates grounded answer

Key Retrieval Metrics¶

Metric	Formula	What It Measures
Recall@k	% queries with ≥1 relevant doc in top-k	Coverage
MRR	avg(1/rank of first relevant doc)	Rank of first relevant
nDCG	Weighted by position and graded relevance	Full ranking quality
Precision@k	% of top-k docs that are relevant	Noise level

RAGAS Dimensions¶

Dimension	Reference-Free?	Measures
Faithfulness	Yes	Answer grounded in retrieved context
Answer Relevance	Yes	Answer addresses the question
Context Recall	No	Retrieved context covers the answer
Context Precision	Yes	Retrieved docs are relevant

Agent Memory Types¶

Working   → context window (volatile, fast)
Episodic  → vector DB (past experiences, timestamped)
Semantic  → weights or RAG index (general facts)
Procedural → system prompt or fine-tuned weights (action policies)

Planning Methods at a Glance¶

ReAct        → greedy, single-path, adaptive
Plan-Execute → upfront plan, parallel steps, brittle to surprises
LATS         → tree search, expensive, best quality
Reflexion    → retry with reflection, medium cost

MCP vs. A2A¶

MCP → model ↔ tool  (filesystem, database, API)
A2A → agent ↔ agent (orchestrator ↔ sub-agent)

Agent Failure Mode Mitigations¶

Hallucinated actions  → log + compare actual tool returns
Scope creep           → least privilege tool allowlists
Cascading errors      → verification checkpoints
Context loss          → constraint re-injection; MemGPT
Prompt injection      → sanitise tool outputs; privilege separation
Infinite loops        → loop detector; step budget