Best-of-N Sampling

1. Overview¶

Best-of-N (BoN) sampling generates N candidate outputs from a model and selects the best one using a reward model or scoring function. It is the simplest form of test-time compute scaling.

x* = argmax R(xᵢ)  for i ∈ {1, ..., N}

Error reduction is approximately exponential: error ∝ e^(-cN) up to a saturation point.

2. Core Mechanics¶

Algorithm:

Sample N completions from the model (with temperature sampling for diversity)
Score each with a reward function R(x)
Return the highest-scoring completion

Sampling for diversity: Use temperature 0.7–1.0 or top-p nucleus sampling. Too-low temperature produces near-identical candidates; diversity is essential for BoN to work.

Reward function options:

Type	Description	Use case
Trained reward model	Neural net trained on human preference data	General quality
Rule-based	Length, format compliance, keyword checks	Structured outputs
LLM-as-judge	Another LLM scores quality	Subjective tasks
Process Reward Model (PRM)	Scores each reasoning step	Math, code, multi-step

3. Computational Cost¶

Time: O(N × T) where T = generation time per sample
Space: O(N × L) where L = average response length
Parallelizable: All N samples can be generated in parallel given sufficient GPU memory

Practical N values: N=4–16 provides a good quality-cost balance. Beyond N=64–100, gains diminish significantly. Measure empirically on your task.

4. BoN vs. Other Techniques¶

Technique	Training Cost	Inference Cost	Best Use Case
Best-of-N	None (uses existing reward model)	High (N×)	Quick quality boost, verifiable tasks
Beam Search	None	Medium	Deterministic, high-probability output
RLHF fine-tuning	Very high	Low	Production at scale
Speculative decoding	None	Lower latency	Speed, not quality improvement

BoN vs. beam search: Beam search is deterministic — it tracks the top-k highest-probability prefixes. BoN samples independently with diversity and uses an external quality metric beyond log-probability.

5. Variants¶

Weighted BoN: Combine multiple high-scoring candidates via weighted averaging rather than hard selection — improves robustness when the top-1 reward model score is noisy.

Adaptive N: Determine N based on prompt difficulty or reward variance. Easy prompts: N=2–4. Hard reasoning: N=32–64+.

Process-Based BoN: Score using a PRM rather than an ORM — evaluates reasoning steps, enabling better selection for math and code. See ORM vs PRM.

Self-Consistency: A special case of BoN without a reward model — generate N reasoning chains, take majority vote on the final answer. Works well for classification and math without needing a separate scorer.

6. Failure Modes¶

Reward hacking: Model exploits weaknesses in the reward model to score highly without genuine quality. Mitigations: ensemble reward models, KL regularisation from base model, diverse reward training data.

Diminishing returns: Beyond N*, additional samples provide negligible improvement. Always plot a quality-vs-N curve before committing to large N.

Latency in production: N × generation time is too slow for real-time applications. Use speculative decoding (see LLM-Inference-Speed repo) or distil BoN-selected examples into a fine-tuned model for deployment.