Agent Evaluation
Overview¶
Evaluating agents is fundamentally harder than evaluating a single LLM response. A plain LLM call has one input and one output. An agent executes a sequence of decisions — choosing tools, interpreting results, planning next steps — and the quality of the final answer depends on every step in that chain.
This page covers evaluation for single-agent systems and multi-agent systems separately, since multi-agent introduces additional dimensions (coordination, inter-agent communication, emergent failures) that don't exist in single-agent settings.
1. Single-Agent Evaluation¶
1.1 Why It Is Harder Than Evaluating a Single LLM Call¶
| Challenge | Explanation |
|---|---|
| Non-determinism | Same prompt can produce different tool-call sequences across runs |
| Variable trajectory length | One agent solves a task in 3 steps; another takes 12 — both may be correct |
| Partial credit | An agent that retrieves the right document but formats the answer wrong failed — but how badly? |
| Tool call verification | Did the agent call the right tool with the right arguments, or get lucky with the output? |
| Environment dependency | Many tasks require a live browser, code interpreter, or database — hard to reproduce exactly |
1.2 Outcome vs. Trajectory Evaluation¶
Outcome-level: Did the agent produce the correct final answer? Binary — easy to automate but ignores how the agent got there. An agent that hallucinated the right answer without using tools looks identical to one that used tools correctly.
Trajectory-level: Was each intermediate step correct? Did the agent call the right tools, in the right order, with correct arguments? Much richer signal but requires annotating full traces — expensive at scale.
Use outcome metrics for benchmarking across models; trajectory metrics for debugging and improvement.
1.3 Key Metrics¶
| Metric | Definition | Notes |
|---|---|---|
| Task Completion Rate (TCR) | Fraction of tasks fully solved | Primary metric on most benchmarks |
| Step Efficiency | Avg. steps to solve a task | A system with high TCR but 3× the steps is costly in production |
| Tool Call Accuracy | Did the agent invoke the correct tool with correct arguments, per step? | Requires trajectory annotation |
| Grounding Rate | Fraction of agent claims traceable to tool outputs (not parametric knowledge) | Same concept as RAG faithfulness |
| Retry Rate | How often does the agent repeat the same action without progress? | High retry rate signals weak planning |
1.4 LLM-as-Judge for Agent Trajectories¶
For open-ended tasks (writing a report, debugging code), binary pass/fail is insufficient. An LLM judge evaluates:
- Was the final output correct and complete?
- Were tool calls reasonable given the task?
- Did the agent get stuck, hallucinate actions, or waste steps?
Risk: LLM judges reward fluent, well-structured trajectories even when factually wrong. Always validate judge scores against a human-labelled held-out set before using them for production decisions.
1.5 Single-Agent Benchmarks¶
| Benchmark | Task Type | What It Tests |
|---|---|---|
| WebArena (Zhou et al., 2023) | Web navigation (shopping, Reddit, GitLab) | Multi-step browser control on real websites |
| SWE-bench Verified (Jimenez et al., 2023) | GitHub issue resolution | Code agents; long-horizon software engineering |
| AgentBench (Liu et al., 2023) | 8 environments (OS, DB, browser, games) | Breadth across agent task types |
| GAIA (Mialon et al., 2023) | Real-world QA requiring tools | Factual grounding; tool selection |
| τ-bench (Yao et al., 2024) | Retail/airline customer service | Tool use + policy compliance in realistic workflows |
SWE-bench Verified is the current de facto standard for coding agents — task completion rate on it is now reported by every major agent system.
2. Multi-Agent Evaluation¶
2.1 Why Multi-Agent Is Even Harder¶
Single-agent evaluation has one trajectory to inspect. Multi-agent systems have N concurrent trajectories plus the communication between them. New failure modes emerge that don't exist in single-agent settings:
| Challenge | Explanation |
|---|---|
| Credit assignment | Which agent caused the final output to be correct or wrong? |
| Coordination failures | Agents may produce contradictory outputs, duplicate work, or deadlock waiting on each other |
| Sub-task specification quality | If the orchestrator decomposes the task incorrectly, all workers fail — but worker metrics look fine |
| Emergent errors | No individual agent is wrong, but their combined outputs produce an incorrect result |
| Communication overhead | Agents may pass malformed or incomplete context to each other, silently degrading quality |
2.2 Evaluation Levels¶
Multi-agent systems must be evaluated at three levels simultaneously:
System Level → Did the overall task succeed?
↓
Orchestrator Level → Was the task decomposition correct?
Were sub-tasks well-specified?
↓
Worker Level → Did each agent complete its sub-task correctly?
Did it use the right tools?
Evaluating only at the system level misses orchestrator failures. Evaluating only at the worker level misses coordination failures.
2.3 Metrics¶
System-level:
| Metric | Definition |
|---|---|
| End-to-end TCR | Fraction of high-level tasks fully solved by the system |
| Coordination efficiency | Steps taken / minimum steps needed (measures redundant or duplicated work) |
| Time-to-completion | Wall-clock time accounting for parallel agent execution |
Orchestrator-level:
| Metric | Definition |
|---|---|
| Decomposition correctness | Are the sub-tasks necessary and sufficient for solving the high-level task? |
| Sub-task specification quality | Are worker instructions precise enough that each worker can execute without ambiguity? |
| Synthesis quality | Does the orchestrator correctly combine worker outputs into a coherent final answer? |
Worker-level:
Same metrics as single-agent: TCR per sub-task, tool call accuracy, grounding rate.
Inter-agent communication:
| Metric | Definition |
|---|---|
| Message faithfulness | Does what one agent communicates to another accurately reflect what it found? |
| Redundancy rate | What fraction of tool calls across agents duplicate work already done by another agent? |
| Contradiction rate | How often do workers produce outputs that directly contradict each other? |
2.4 Coordination Failure Patterns¶
Deadlock: Agent A waits for Agent B's output before proceeding; Agent B waits for Agent A. No progress is made.
Redundant execution: Two workers independently call the same tool with the same arguments because the orchestrator didn't share intermediate results.
Contradictory outputs: Worker A concludes X; Worker B concludes ¬X. The orchestrator synthesises an incoherent answer or picks arbitrarily.
Context loss at handoff: Worker A produces a correct result but communicates it to Worker B incompletely (truncation, missing metadata). Worker B makes decisions on degraded information.
Detection: Log all inter-agent messages. Compare what each agent claims to have received vs. what the sending agent actually produced.
2.5 Multi-Agent Benchmarks¶
Dedicated multi-agent benchmarks are still emerging. Current options:
| Benchmark | Notes |
|---|---|
| GAIA (multi-agent mode) | Tasks too complex for one agent; measures system-level completion |
| AgentBench (multi-agent tracks) | Some environments support multi-agent execution |
| ChatDev / SWE-bench (pipeline agents) | Software development with specialised coder/reviewer/tester agents |
| CAMEL role-playing dataset | Evaluates cooperative reasoning between two agents assigned complementary roles |
Multi-agent benchmarks are significantly less mature than single-agent benchmarks. Most production teams build custom internal evaluation suites using their actual task distribution.
2.6 Practical Evaluation Strategy¶
Because public multi-agent benchmarks are limited, a pragmatic approach is:
-
Decompose into sub-task unit tests. For each worker agent type, build a targeted eval set for its sub-task (e.g., "does the research agent reliably retrieve the right document?"). These are cheap and fast.
-
Sample full end-to-end traces. Run the full multi-agent system on a representative task set and human-review complete traces for a sample of failures.
-
Inject faults. Deliberately provide a worker with a malformed input or wrong tool result, and check whether the orchestrator detects and recovers — or propagates the error silently.
-
Monitor in production. Log all inter-agent messages and outcomes. Contradiction rate, retry rate, and step count are the most informative signals for ongoing health.