ORMs & PRMs
1. Overview¶
ORM (Outcome Reward Model) and PRM (Process Reward Model) are two paradigms for scoring model outputs during test-time compute. They differ in granularity: ORMs judge the final answer only; PRMs judge each intermediate reasoning step.
Both are used to guide Best-of-N sampling and tree search at inference time.
2. ORM vs PRM¶
| Aspect | ORM | PRM |
|---|---|---|
| Granularity | Final answer only | Each reasoning step |
| Feedback | Single scalar | Vector of per-step scores |
| Supervision cost | Low (binary correct/incorrect) | High (step-level human annotation) |
| Error localisation | Cannot identify where reasoning fails | Identifies the failing step |
| Tree/beam search | Poor — only terminal rewards | Excellent — prunes bad paths early |
| Interpretability | Low | High |
Analogy: ORM grades only the final answer on a test. PRM grades the work shown for each step.
3. How They Work¶
ORM¶
def orm_score(problem, solution):
return 1.0 if verify_final_answer(solution) else 0.0
Single forward pass per complete solution. Cheap at inference time.
PRM¶
def prm_score(problem, reasoning_steps):
return [verify_step(step, context) for step in reasoning_steps]
One forward pass per step — O(n) passes where n = number of steps. PRM scores are aggregated (min, mean, or product) across steps to rank candidates.
4. Training Data Requirements¶
ORM: (problem, complete_solution) → binary label. Straightforward to collect.
PRM: (problem, step_1, ..., step_n) → step-level labels. Expensive. Two approaches to reduce annotation cost:
- Math-Shepherd (2024): Automated step-level annotation using Monte Carlo rollouts — estimate step value by sampling completions and checking final answer correctness.
- Weak-to-strong supervision: Use a weaker model to generate step labels for a stronger model.
5. Role in Test-Time Search¶
PRMs unlock tree search at inference time:
Query
│
▼
Generate k partial reasoning paths
│
▼
PRM scores each step → prune low-scoring branches
│
▼
Extend surviving paths → repeat until complete
│
▼
ORM re-scores final candidates → return best
This is more efficient than BoN: bad reasoning paths are abandoned early rather than completed before scoring.
6. Key Research¶
- "Let's Verify Step by Step" (OpenAI, 2023): PRMs substantially outperform ORMs on MATH benchmark. Best-of-N with PRM >> Best-of-N with ORM.
- Math-Shepherd (2024): Automated PRM data collection; reduces human labelling cost.
- DeepSeek R1 (2025): Trains with RLVR (RL from Verifiable Rewards) — effectively learns an internal ORM signal without a separate model.
7. When to Use Each¶
Use ORM when:
- Task has clear right/wrong answers
- Limited annotation budget
- Fast inference is required
Use PRM when:
- Multi-step reasoning (math, code, planning)
- Tree search is in the pipeline
- Interpretability and step-level debugging matter
- Combining with Best-of-N sampling for complex tasks