RLOO
1. Overview¶
REINFORCE Leave-One-Out (RLOO) is a multi-sample variant of REINFORCE that uses the mean reward of the other samples in a group as the baseline for each individual sample. It was applied to LLM fine-tuning by Ahmadian et al. (2024) in "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs" (ACL 2024).
Key result: RLOO consistently outperforms PPO, DPO, and RAFT across all tested models and datasets while using 50–70% less GPU memory and running 2–3× faster than PPO.
2. The Big Picture: From REINFORCE to RLOO¶
Vanilla REINFORCE estimates the gradient from a single sample per prompt, leading to high variance. The solution: sample \(k\) responses per prompt and use the other \(k-1\) responses as a baseline.
| Method | Samples per prompt | Baseline | Variance | Critic needed |
|---|---|---|---|---|
| REINFORCE | 1 | Running mean (batched) | High | No |
| RLOO | k | Leave-one-out mean | Low | No |
| PPO | 1 | Trained value function | Lowest | Yes |
| GRPO | k | Group-normalized | Low | No |
3. Formulation¶
3.1 RLOO Gradient Estimator¶
For a prompt \(x\), sample \(k\) responses \(\{y_1, \ldots, y_k\}\) from the current policy. The gradient for the \(i\)-th sample uses the mean reward of the remaining \(k-1\) samples as baseline:
where \(R_i = r_\phi(x, y_i) - \beta \log \frac{\pi_\theta(y_i|x)}{\pi_{\text{ref}}(y_i|x)}\) is the KL-penalized reward.
3.2 Why This Baseline Is Effective¶
The leave-one-out baseline is:
- Unbiased: \(\mathbb{E}[\hat{A}_i] = \mathbb{E}[R_i - b_i] = \mathbb{E}[R_i] - \mathbb{E}[b_i]\), and since \(y_i \perp \{y_j\}_{j \neq i}\), the baseline does not correlate with the gradient term.
- Low variance: The baseline is computed on-the-fly for each sample at each step — no moving average lag or critic bias.
- Free: No extra model or training required — just multiple forward passes.
3.3 Relationship to GRPO¶
RLOO and GRPO both sample \(k\) responses per prompt and use group-based baselines. The key difference is in the clipping:
| RLOO | GRPO | |
|---|---|---|
| Base algorithm | REINFORCE (no clipping) | PPO (with clipping) |
| Advantage | Leave-one-out mean | Group mean (z-score normalized) |
| KL constraint | Explicit additive penalty | Explicit additive penalty |
| Bias | Unbiased | Group normalization introduces slight bias |
RLOO's gradient is theoretically unbiased; GRPO's z-score normalization changes the effective gradient scale.
4. Implementation¶
def rloo_loss(policy, ref_policy, reward_model, batch, k=4, beta=0.05):
"""
batch: dict with 'prompts' (B,) and 'responses' (B, k) —
k responses already sampled per prompt.
"""
prompts = batch['prompts'] # (B,)
responses = batch['responses'] # (B, k)
B = len(prompts)
rewards = torch.zeros(B, k)
logp = torch.zeros(B, k)
logp_ref = torch.zeros(B, k)
for i in range(k):
rewards[:, i] = reward_model(prompts, responses[:, i])
logp[:, i] = policy.log_prob(prompts, responses[:, i])
logp_ref[:, i] = ref_policy.log_prob(prompts, responses[:, i])
# KL-penalized returns: (B, k)
returns = rewards - beta * (logp - logp_ref).detach()
# Leave-one-out baseline for sample i: mean of the other k-1 returns
total = returns.sum(dim=1, keepdim=True) # (B, 1)
loo_baseline = (total - returns) / (k - 1) # (B, k)
advantages = (returns - loo_baseline).detach()
# Policy gradient loss
loss = -(logp * advantages).mean()
return loss
Key details:
logp_refmust be computed withtorch.no_grad()— reference policy is frozenadvantagesmust be detached before multiplying bylogp— baseline must not receive gradients- Use gradient clipping (max norm 1.0) for stability
5. Why RLOO Instead of PPO?¶
Ahmadian et al. (2024) argue that most of PPO's complexity was designed for RL from scratch, which does not apply to LLM fine-tuning:
PPO's value function is unnecessary: A pre-trained LLM already generates high-quality text. The variance of single-sample REINFORCE is low enough that the bias introduced by a learned value function outweighs the variance reduction benefit.
Empirical results (Ahmadian et al., 2024):
- RLOO outperforms PPO on helpfulness and harmlessness benchmarks
- RLOO uses 50–70% less vRAM than PPO
- RLOO runs 2× faster (1B models) to 3× faster (6.9B models) than PPO
- RLOO outperforms DPO and RAFT on all tested datasets
6. Comparison with Other Methods¶
| Aspect | RLOO | PPO | GRPO | DPO |
|---|---|---|---|---|
| Critic needed | No | Yes | No | No |
| Samples per prompt | k | 1 | k (G) | N/A (offline) |
| Baseline | Leave-one-out | Value function | Group z-score | Implicit |
| Unbiased gradient | Yes | No | No | N/A |
| Memory vs PPO | −50–70% | baseline | −50% | −50% |
| Speed vs PPO | 2–3× faster | baseline | ~2× faster | ~2× faster |
| Clipping | None | Yes | Yes | N/A |
| Best for | General RLHF | Stable RL | Verifiable rewards | Preference data |
7. Limitations¶
No clipping: Without a trust-region constraint, RLOO can take large gradient steps on high-variance batches. Gradient clipping is the primary safeguard, but it is less principled than PPO's clip ratio.
Memory scales with k: Storing \(k\) responses and their log-probabilities per prompt increases memory linearly with \(k\). Typical \(k = 4\)–8 is manageable.
Online sampling required: Like all on-policy methods, RLOO must generate new samples at each training step. This is more expensive than DPO, which trains on a static offline dataset.
No token-level credit assignment: The same advantage is applied to every token in a sequence. For long chain-of-thought responses, this means early tokens receive incorrect credit signals. See DAPO for token-level loss.