Verification Metrics

Overview¶

Verification metrics measure whether a model's output can be checked for correctness without human judgment. They are the backbone of scalable evaluation for coding, mathematics, and formal reasoning — and increasingly, the reward signal in RL training pipelines.

1. pass@k¶

Origin: Chen et al. (2021), introduced with the HumanEval benchmark for code generation.

Definition: The probability that at least one of k independently sampled completions passes all test cases for a given problem.

Unbiased estimator (from the original paper):

\[\text{pass@}k = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\]

Where:

\(n\) = total samples generated per problem (typically \(n \geq k\))
\(c\) = number of samples that pass all tests
\(k\) = number of samples shown to the user

Why not just compute fraction correct? A naïve fraction is biased when \(c/n\) is used as a point estimate. The formula above gives an unbiased estimate by averaging over all possible subsets of size \(k\).

pass@1 vs pass@k¶

Metric	What it measures	When to use
pass@1	Single-shot correctness	Production: user gets one answer
pass@k (k>1)	Coverage / diversity of solutions	Evaluation or Best-of-N settings

Typical values on HumanEval:

Early GPT-3: pass@1 ~28%
GPT-4: pass@1 ~85%+
Frontier models (2025): pass@1 90–95%+

2. Majority Voting (maj@k / Self-Consistency)¶

Sample \(k\) independent completions; take the most common final answer as the prediction.

Formal definition:

\[\text{maj@}k = \mathbf{1}\!\left[\text{mode}\{y_1, \ldots, y_k\} = y^*\right]\]

Key insight (Wang et al., 2023): For reasoning tasks with a fixed final answer (math problems, multiple-choice), majority voting over chain-of-thought samples is significantly more accurate than greedy decoding alone.

When it works:

Tasks with discrete, verifiable final answers (GSM8K, MATH, AIME)
When errors in individual samples are uncorrelated

When it fails:

Open-ended generation (no natural "mode")
When the model is systematically wrong (correlated errors dominate)

3. Best-of-N (BoN) with a Verifier¶

Generate \(N\) candidates; a verifier selects the best one. The metric is the accuracy of the selected candidate.

Variants:

Verifier type	Selection criterion	Use case
Oracle	Ground truth answer	Upper-bound measurement
ORM (Outcome Reward Model)	Reward assigned to complete solution	Math, code
PRM (Process Reward Model)	Aggregated step-level scores	Multi-step reasoning
LLM judge	Preference score	Open-ended tasks
Test suite	pass@1 on held-out tests	Code generation

Scaling behavior: Best-of-N accuracy follows a log-linear relationship with \(N\) — doubling \(N\) gives a consistent additive improvement in accuracy. This is the empirical basis for test-time compute scaling.

See ORMs & PRMs for detailed coverage of reward-model verifiers.

4. Functional Correctness¶

Code evaluation uses functional correctness: does the generated code produce the right output on a test suite?

HumanEval¶

164 hand-written Python problems with docstrings and unit tests
Metric: pass@k (k = 1, 10, 100)
Limitation: Small dataset; now saturated by frontier models

SWE-bench Verified¶

Real GitHub issues requiring understanding an existing codebase, identifying the root cause, and producing a patch that passes the repo's test suite
Harder than HumanEval by an order of magnitude
Status (2025): Claude Opus 4.6 leads at ~80.8% resolve rate; considered the current gold standard for coding ability

LiveCodeBench¶

Continuously updated with new competitive programming problems post-training cutoff
Addresses contamination by using only problems published after model training
More reliable for frontier model comparisons than static HumanEval

5. Reinforcement Learning with Verifiable Rewards (RLVR)¶

Core idea: Use a deterministic verifier as the reward signal in RL training, eliminating the need for a trained reward model.

Training loop:

Model generates a chain-of-thought and final answer
Verifier checks the answer against ground truth (or test cases)
Binary or graded reward is returned
Policy is updated (typically via GRPO or PPO)

Verifiable domains:

Domain	Verification method
Mathematics	Symbolic answer matching (e.g., `sympy.simplify(a - b) == 0`)
Code	Test suite execution
Formal proofs	Proof assistant (Lean 4, Coq)
Multiple-choice	Exact string match

Why RLVR outperforms learned reward models for reasoning:

No reward hacking: the verifier is ground truth, not a learned approximation
No distribution shift: verification is always accurate regardless of what the policy generates
Cheap to scale: verification is computationally trivial compared to reward model inference

Used in: DeepSeek-R1, Qwen QwQ, recent o-series reasoning models.

6. Process-Level Verification¶

Standard pass@k and functional correctness only check answers. Process-level metrics check whether the reasoning steps are correct.

CoT-Pass@k¶

Extends pass@k by requiring both correct intermediate steps and a correct final answer. A model that reaches the right answer through invalid reasoning is marked incorrect.

Why it matters: Models can "shortcut" to correct answers via pattern matching without genuine reasoning. CoT-Pass@k penalizes this.

Step-Level Reward (PRM Signal)¶

A PRM assigns a score \(r_t \in [0,1]\) to each reasoning step \(t\). The aggregate score is:

\[R = \prod_{t=1}^{T} r_t \quad \text{(product aggregation)}\]

or

\[R = \min_{t} r_t \quad \text{(min aggregation)}\]

Min aggregation penalizes any single incorrect step; product aggregation penalizes chains with many weak steps. Both have been used in practice; min is more conservative.

7. Math Benchmarks as Verification Targets¶

Benchmark	Difficulty	Notes
GSM8K	Elementary/middle school	Saturated (~99% for frontier models)
MATH-500	Competition math (AMC/AIME)	Symbolic + multi-step; still discriminative
AIME 2025/2026	High-school olympiad	Current frontier; top models 85–94%
GPQA Diamond	PhD-level (bio/phys/chem)	Google-proof; human experts ~65%
HLE (Humanity's Last Exam)	Expert-level, cross-domain	Designed to be unsolvable by current LLMs

Saturation trajectory: Each generation of frontier models forces the community to a harder benchmark. GPQA Diamond and AIME are the current frontier; HLE is positioned as the next.

8. Failure Modes and Limitations¶

Metric Gaming¶

Models fine-tuned on RLVR can learn to produce answers matching the format the verifier expects without correct reasoning (e.g., guessing answer format for math)
Mitigation: require full solution trace, use PRM to verify steps

Test Suite Incompleteness¶

A code solution can pass all provided test cases but fail on edge cases
Mitigation: generate diverse tests (property-based testing, fuzzing)

Contamination¶

If AIME 2024 problems appear in training data, pass@k on AIME 2024 is inflated
Mitigation: use post-cutoff benchmarks (LiveBench, LiveCodeBench); track contamination with n-gram overlap detection