Best-of-N Sampling
1. Overview¶
Best-of-N (BoN) sampling generates N candidate outputs from a model and selects the best one using a reward model or scoring function. It is the simplest form of test-time compute scaling.
x* = argmax R(xᵢ) for i ∈ {1, ..., N}
Error reduction is approximately exponential: error ∝ e^(-cN) up to a saturation point.
2. Core Mechanics¶
Algorithm:
- Sample N completions from the model (with temperature sampling for diversity)
- Score each with a reward function R(x)
- Return the highest-scoring completion
Sampling for diversity: Use temperature 0.7–1.0 or top-p nucleus sampling. Too-low temperature produces near-identical candidates; diversity is essential for BoN to work.
Reward function options:
| Type | Description | Use case |
|---|---|---|
| Trained reward model | Neural net trained on human preference data | General quality |
| Rule-based | Length, format compliance, keyword checks | Structured outputs |
| LLM-as-judge | Another LLM scores quality | Subjective tasks |
| Process Reward Model (PRM) | Scores each reasoning step | Math, code, multi-step |
3. Computational Cost¶
- Time: O(N × T) where T = generation time per sample
- Space: O(N × L) where L = average response length
- Parallelizable: All N samples can be generated in parallel given sufficient GPU memory
Practical N values: N=4–16 provides a good quality-cost balance. Beyond N=64–100, gains diminish significantly. Measure empirically on your task.
4. BoN vs. Other Techniques¶
| Technique | Training Cost | Inference Cost | Best Use Case |
|---|---|---|---|
| Best-of-N | None (uses existing reward model) | High (N×) | Quick quality boost, verifiable tasks |
| Beam Search | None | Medium | Deterministic, high-probability output |
| RLHF fine-tuning | Very high | Low | Production at scale |
| Speculative decoding | None | Lower latency | Speed, not quality improvement |
BoN vs. beam search: Beam search is deterministic — it tracks the top-k highest-probability prefixes. BoN samples independently with diversity and uses an external quality metric beyond log-probability.
5. Variants¶
Weighted BoN: Combine multiple high-scoring candidates via weighted averaging rather than hard selection — improves robustness when the top-1 reward model score is noisy.
Adaptive N: Determine N based on prompt difficulty or reward variance. Easy prompts: N=2–4. Hard reasoning: N=32–64+.
Process-Based BoN: Score using a PRM rather than an ORM — evaluates reasoning steps, enabling better selection for math and code. See ORM vs PRM.
Self-Consistency: A special case of BoN without a reward model — generate N reasoning chains, take majority vote on the final answer. Works well for classification and math without needing a separate scorer.
6. Failure Modes¶
Reward hacking: Model exploits weaknesses in the reward model to score highly without genuine quality. Mitigations: ensemble reward models, KL regularisation from base model, diverse reward training data.
Diminishing returns: Beyond N*, additional samples provide negligible improvement. Always plot a quality-vs-N curve before committing to large N.
Latency in production: N × generation time is too slow for real-time applications. Use speculative decoding (see LLM-Inference-Speed repo) or distil BoN-selected examples into a fine-tuned model for deployment.