Skip to content

RLOO

1. Overview

REINFORCE Leave-One-Out (RLOO) is a multi-sample variant of REINFORCE that uses the mean reward of the other samples in a group as the baseline for each individual sample. It was applied to LLM fine-tuning by Ahmadian et al. (2024) in "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs" (ACL 2024).

Key result: RLOO consistently outperforms PPO, DPO, and RAFT across all tested models and datasets while using 50–70% less GPU memory and running 2–3× faster than PPO.


2. The Big Picture: From REINFORCE to RLOO

Vanilla REINFORCE estimates the gradient from a single sample per prompt, leading to high variance. The solution: sample \(k\) responses per prompt and use the other \(k-1\) responses as a baseline.

Method Samples per prompt Baseline Variance Critic needed
REINFORCE 1 Running mean (batched) High No
RLOO k Leave-one-out mean Low No
PPO 1 Trained value function Lowest Yes
GRPO k Group-normalized Low No

3. Formulation

3.1 RLOO Gradient Estimator

For a prompt \(x\), sample \(k\) responses \(\{y_1, \ldots, y_k\}\) from the current policy. The gradient for the \(i\)-th sample uses the mean reward of the remaining \(k-1\) samples as baseline:

\[ \nabla_\theta J(\theta) = \frac{1}{k} \sum_{i=1}^k \nabla_\theta \log \pi_\theta(y_i | x) \cdot \underbrace{\left( R_i - \frac{1}{k-1} \sum_{j \neq i} R_j \right)}_{\text{leave-one-out advantage } \hat{A}_i} \]

where \(R_i = r_\phi(x, y_i) - \beta \log \frac{\pi_\theta(y_i|x)}{\pi_{\text{ref}}(y_i|x)}\) is the KL-penalized reward.

3.2 Why This Baseline Is Effective

The leave-one-out baseline is:

  • Unbiased: \(\mathbb{E}[\hat{A}_i] = \mathbb{E}[R_i - b_i] = \mathbb{E}[R_i] - \mathbb{E}[b_i]\), and since \(y_i \perp \{y_j\}_{j \neq i}\), the baseline does not correlate with the gradient term.
  • Low variance: The baseline is computed on-the-fly for each sample at each step — no moving average lag or critic bias.
  • Free: No extra model or training required — just multiple forward passes.

3.3 Relationship to GRPO

RLOO and GRPO both sample \(k\) responses per prompt and use group-based baselines. The key difference is in the clipping:

RLOO GRPO
Base algorithm REINFORCE (no clipping) PPO (with clipping)
Advantage Leave-one-out mean Group mean (z-score normalized)
KL constraint Explicit additive penalty Explicit additive penalty
Bias Unbiased Group normalization introduces slight bias

RLOO's gradient is theoretically unbiased; GRPO's z-score normalization changes the effective gradient scale.


4. Implementation

def rloo_loss(policy, ref_policy, reward_model, batch, k=4, beta=0.05):
    """
    batch: dict with 'prompts' (B,) and 'responses' (B, k) —
           k responses already sampled per prompt.
    """
    prompts = batch['prompts']          # (B,)
    responses = batch['responses']      # (B, k)

    B = len(prompts)
    rewards = torch.zeros(B, k)
    logp = torch.zeros(B, k)
    logp_ref = torch.zeros(B, k)

    for i in range(k):
        rewards[:, i] = reward_model(prompts, responses[:, i])
        logp[:, i] = policy.log_prob(prompts, responses[:, i])
        logp_ref[:, i] = ref_policy.log_prob(prompts, responses[:, i])

    # KL-penalized returns: (B, k)
    returns = rewards - beta * (logp - logp_ref).detach()

    # Leave-one-out baseline for sample i: mean of the other k-1 returns
    total = returns.sum(dim=1, keepdim=True)          # (B, 1)
    loo_baseline = (total - returns) / (k - 1)        # (B, k)

    advantages = (returns - loo_baseline).detach()

    # Policy gradient loss
    loss = -(logp * advantages).mean()
    return loss

Key details:

  • logp_ref must be computed with torch.no_grad() — reference policy is frozen
  • advantages must be detached before multiplying by logp — baseline must not receive gradients
  • Use gradient clipping (max norm 1.0) for stability

5. Why RLOO Instead of PPO?

Ahmadian et al. (2024) argue that most of PPO's complexity was designed for RL from scratch, which does not apply to LLM fine-tuning:

PPO's value function is unnecessary: A pre-trained LLM already generates high-quality text. The variance of single-sample REINFORCE is low enough that the bias introduced by a learned value function outweighs the variance reduction benefit.

Empirical results (Ahmadian et al., 2024):

  • RLOO outperforms PPO on helpfulness and harmlessness benchmarks
  • RLOO uses 50–70% less vRAM than PPO
  • RLOO runs 2× faster (1B models) to 3× faster (6.9B models) than PPO
  • RLOO outperforms DPO and RAFT on all tested datasets

6. Comparison with Other Methods

Aspect RLOO PPO GRPO DPO
Critic needed No Yes No No
Samples per prompt k 1 k (G) N/A (offline)
Baseline Leave-one-out Value function Group z-score Implicit
Unbiased gradient Yes No No N/A
Memory vs PPO −50–70% baseline −50% −50%
Speed vs PPO 2–3× faster baseline ~2× faster ~2× faster
Clipping None Yes Yes N/A
Best for General RLHF Stable RL Verifiable rewards Preference data

7. Limitations

No clipping: Without a trust-region constraint, RLOO can take large gradient steps on high-variance batches. Gradient clipping is the primary safeguard, but it is less principled than PPO's clip ratio.

Memory scales with k: Storing \(k\) responses and their log-probabilities per prompt increases memory linearly with \(k\). Typical \(k = 4\)–8 is manageable.

Online sampling required: Like all on-policy methods, RLOO must generate new samples at each training step. This is more expensive than DPO, which trains on a static offline dataset.

No token-level credit assignment: The same advantage is applied to every token in a sequence. For long chain-of-thought responses, this means early tokens receive incorrect credit signals. See DAPO for token-level loss.