Compute Optimal Inference

1. Overview¶

Test-time compute scaling (also called inference-time scaling) allocates additional computational resources during inference — rather than training — to improve performance on complex tasks. Instead of scaling model parameters, it uses techniques like multiple sampling, extended reasoning chains, or search algorithms at inference time.

Core Principle¶

A smaller model with smart inference can outperform a larger model with standard decoding. Research shows this is often more cost-effective than training larger models for hard reasoning tasks.

2. Why It Matters¶

Claim	Evidence
Smaller model + test-time search > larger model	7B model with BoN/MCTS can match 34B greedy decoding on MATH
Inference compute can substitute for training compute	OpenAI o1/o3, DeepSeek R1
System 2 thinking is unlockable at inference	Extended CoT enables deliberate step-by-step reasoning

3. Core Techniques¶

3.1 Parallel Scaling¶

Best-of-N (BoN): Generate N candidate outputs, score each with a reward model or verifier, return the highest-scoring one. Error reduction follows approximately error ∝ e^(-cN). See Best-of-N Sampling.

Majority Voting / Self-Consistency: Sample N reasoning paths, select the most frequent answer. No reward model required — works well for multiple-choice and math tasks.

Weighted Voting: Assign confidence weights to candidates before aggregating.

3.2 Sequential Scaling¶

Chain-of-Thought (CoT): Decompose into intermediate steps. Longer reasoning chains improve performance on hard tasks.

Self-Refinement: Model critiques and revises its own output iteratively.

Tree Search (MCTS, Beam Search + PRM): Explore the reasoning tree with a Process Reward Model scoring intermediate steps. Prunes bad paths early; more efficient than BoN for structured problems.

3.3 Hybrid¶

Generate multiple chains in parallel, refine each sequentially, select the best final answer. Used internally by o1/o3 and R1.

4. Compute-Optimal Inference¶

Performance follows a power law: Performance ∝ Compute^α up to a saturation point N* beyond which gains diminish.

Allocation strategy by budget:

Budget	Recommended Approach
Low	Better base model, greedy decoding
Medium	Best-of-N (N=4–16) or CoT
High	Tree search + PRM; extended CoT

Memory bottleneck: Long reasoning chains stress the KV-cache. Sparse attention (top-k or block-sparse) mitigates this.

5. Training for Test-Time Scaling¶

Models need to be trained to exploit extra compute effectively.

Supervised Fine-Tuning (SFT): Train on long CoT examples (synthetic or distilled). Models learn to imitate extended reasoning.

Reinforcement Learning (RL): Train with outcome rewards on verifiable tasks (math, code). RL discovers reasoning strategies without labeled chains. DeepSeek R1-Zero achieved strong reasoning from pure RL with no SFT.

RLVR (RL from Verifiable Rewards): Automatic reward generation in domains with ground truth — enables scalable RL training without human labeling.

6. Key Models (2024–2025)¶

Model	Approach	Notable Result
OpenAI o1 (Sept 2024)	Hidden CoT + test-time search	Strong STEM benchmarks
OpenAI o3 (Dec 2024)	Deliberative alignment	87.7% GPQA Diamond, 87.5% ARC-AGI
DeepSeek R1 (Jan 2025)	Pure RL, visible `<think>` tags, MoE (671B / 37B active)	97.3% MATH-500, ~20× cheaper than o1

7. When to Use¶

Use test-time scaling for:

Complex multi-step reasoning (math, code, planning)
Tasks with verifiable correctness
When a smaller model must match a larger one

Avoid when:

Latency is the primary constraint
Task is simple factual lookup
Compute budget is very tight

Known failure mode — Overthinking: Beyond a saturation point, additional inference compute yields diminishing returns while latency keeps growing. Models trained with RL rewards on reasoning length can also learn to generate redundant chains that look thorough but add no correctness signal. Mitigation: adaptive early stopping once a confidence threshold is met, and length penalties during RL training.

8. Interaction with Other Inference Topics¶

Speculative decoding can reduce latency of long reasoning chains — see LLM-Inference-Speed repo
KV-cache pressure from long CoT is managed via Paged Attention (see LLM-Inference-Speed repo)
BoN and PRM are the two core primitives — covered in this section