Compute Optimal Inference
1. Overview¶
Test-time compute scaling (also called inference-time scaling) allocates additional computational resources during inference — rather than training — to improve performance on complex tasks. Instead of scaling model parameters, it uses techniques like multiple sampling, extended reasoning chains, or search algorithms at inference time.
Core Principle¶
A smaller model with smart inference can outperform a larger model with standard decoding. Research shows this is often more cost-effective than training larger models for hard reasoning tasks.
2. Why It Matters¶
| Claim | Evidence |
|---|---|
| Smaller model + test-time search > larger model | 7B model with BoN/MCTS can match 34B greedy decoding on MATH |
| Inference compute can substitute for training compute | OpenAI o1/o3, DeepSeek R1 |
| System 2 thinking is unlockable at inference | Extended CoT enables deliberate step-by-step reasoning |
3. Core Techniques¶
3.1 Parallel Scaling¶
Best-of-N (BoN): Generate N candidate outputs, score each with a reward model or verifier, return the highest-scoring one. Error reduction follows approximately error ∝ e^(-cN). See Best-of-N Sampling.
Majority Voting / Self-Consistency: Sample N reasoning paths, select the most frequent answer. No reward model required — works well for multiple-choice and math tasks.
Weighted Voting: Assign confidence weights to candidates before aggregating.
3.2 Sequential Scaling¶
Chain-of-Thought (CoT): Decompose into intermediate steps. Longer reasoning chains improve performance on hard tasks.
Self-Refinement: Model critiques and revises its own output iteratively.
Tree Search (MCTS, Beam Search + PRM): Explore the reasoning tree with a Process Reward Model scoring intermediate steps. Prunes bad paths early; more efficient than BoN for structured problems.
3.3 Hybrid¶
Generate multiple chains in parallel, refine each sequentially, select the best final answer. Used internally by o1/o3 and R1.
4. Compute-Optimal Inference¶
Performance follows a power law: Performance ∝ Compute^α up to a saturation point N* beyond which gains diminish.
Allocation strategy by budget:
| Budget | Recommended Approach |
|---|---|
| Low | Better base model, greedy decoding |
| Medium | Best-of-N (N=4–16) or CoT |
| High | Tree search + PRM; extended CoT |
Memory bottleneck: Long reasoning chains stress the KV-cache. Sparse attention (top-k or block-sparse) mitigates this.
5. Training for Test-Time Scaling¶
Models need to be trained to exploit extra compute effectively.
Supervised Fine-Tuning (SFT): Train on long CoT examples (synthetic or distilled). Models learn to imitate extended reasoning.
Reinforcement Learning (RL): Train with outcome rewards on verifiable tasks (math, code). RL discovers reasoning strategies without labeled chains. DeepSeek R1-Zero achieved strong reasoning from pure RL with no SFT.
RLVR (RL from Verifiable Rewards): Automatic reward generation in domains with ground truth — enables scalable RL training without human labeling.
6. Key Models (2024–2025)¶
| Model | Approach | Notable Result |
|---|---|---|
| OpenAI o1 (Sept 2024) | Hidden CoT + test-time search | Strong STEM benchmarks |
| OpenAI o3 (Dec 2024) | Deliberative alignment | 87.7% GPQA Diamond, 87.5% ARC-AGI |
| DeepSeek R1 (Jan 2025) | Pure RL, visible <think> tags, MoE (671B / 37B active) |
97.3% MATH-500, ~20× cheaper than o1 |
7. When to Use¶
Use test-time scaling for:
- Complex multi-step reasoning (math, code, planning)
- Tasks with verifiable correctness
- When a smaller model must match a larger one
Avoid when:
- Latency is the primary constraint
- Task is simple factual lookup
- Compute budget is very tight
Known failure mode — Overthinking: Beyond a saturation point, additional inference compute yields diminishing returns while latency keeps growing. Models trained with RL rewards on reasoning length can also learn to generate redundant chains that look thorough but add no correctness signal. Mitigation: adaptive early stopping once a confidence threshold is met, and length penalties during RL training.
8. Interaction with Other Inference Topics¶
- Speculative decoding can reduce latency of long reasoning chains — see LLM-Inference-Speed repo
- KV-cache pressure from long CoT is managed via Paged Attention (see LLM-Inference-Speed repo)
- BoN and PRM are the two core primitives — covered in this section