REINFORCE

1. Overview¶

REINFORCE is the foundational policy gradient algorithm (Williams, 1992) and the conceptual basis for all modern RL-based LLM fine-tuning. It directly optimizes the expected reward by following the gradient of the log-probability of generated sequences, weighted by their reward.

While PPO and GRPO dominate current RLHF pipelines, recent work (Ahmadian et al., 2024) showed that REINFORCE and its multi-sample variant RLOO consistently outperform PPO when starting from a strong pre-trained LLM — challenging the assumption that PPO's complexity is necessary.

2. The Big Picture: REINFORCE in RLHF¶

Stage	Role
1. SFT	Initialize policy π_θ from supervised fine-tuning
2. Reward model	Score completions: r(x, y)
3. REINFORCE	Update π_θ to maximize E[r(x, y)] − β·KL

No critic/value network is trained. The reward signal flows directly back to the policy through the log-probability gradient.

3. Formulation¶

3.1 Policy Gradient Theorem¶

For a language model generating sequence \(y = (y_1, \ldots, y_T)\) given prompt \(x\):

\[ \nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot|x)} \left[ \nabla_\theta \log \pi_\theta(y|x) \cdot R(x, y) \right] \]

where \(R(x, y)\) is the total return — typically the reward model score minus a KL penalty:

\[ R(x, y) = r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} \]

3.2 Variance Reduction with a Baseline¶

The gradient is unbiased for any constant baseline \(b\):

\[ \nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(y|x) \cdot \big(R(x, y) - b\big) \right] \]

Common baseline choices:

Baseline	Computation	Notes
Zero (no baseline)	None	Highest variance
Running mean	Mean over recent rewards	Simple, slight bias
Per-prompt mean	Mean over multiple samples from same prompt	On-the-fly, unbiased
Value function (PPO)	Trained critic network	Lowest variance, adds bias and complexity

3.3 Token-Level vs Sequence-Level¶

Sequence-level: One gradient signal per sequence.

\[ \nabla_\theta J = \nabla_\theta \log \pi_\theta(y|x) \cdot R(x,y) = \sum_{t=1}^T \nabla_\theta \log \pi_\theta(y_t | x, y_{<t}) \cdot R(x,y) \]

Token-level (per-step): Each token receives an individual credit assignment signal. Used in DAPO; important for long chain-of-thought responses where early tokens are far from the final reward.

4. Why REINFORCE Works Well for LLMs¶

PPO was designed for RL from scratch with unstable, non-stationary policies. LLMs violate these assumptions:

Strong initialization: A pre-trained LLM already generates fluent, reasonable text — the policy doesn't need a critic to stay stable.
Concentrated distributions: At each step, the LLM assigns high probability mass to a few tokens, so the variance from single-sample estimation is lower than in tabular RL.
Short optimization horizon: RLHF fine-tunes for relatively few steps on top of SFT, so the complex value-function bootstrapping of PPO adds little benefit.

5. Implementation¶

def reinforce_loss(policy, ref_policy, reward_model, batch, beta=0.05):
    prompts, responses = batch['prompts'], batch['responses']

    # Reward model scores
    rewards = reward_model(prompts, responses)

    # KL penalty
    logp = policy.log_prob(prompts, responses)
    logp_ref = ref_policy.log_prob(prompts, responses)
    kl = logp - logp_ref

    # Total return with baseline (running mean over batch)
    returns = rewards - beta * kl
    baseline = returns.mean().detach()
    advantages = returns - baseline

    # Policy gradient loss
    loss = -(logp * advantages).mean()
    return loss

Key implementation details:

The reference policy is frozen — no gradient flows through logp_ref
Gradient clipping (max norm 1.0) is essential to prevent instability
Using detach() on the baseline is mandatory — the baseline must not affect the gradient

6. REINFORCE vs PPO¶

Aspect	REINFORCE	PPO
Critic required	No	Yes
Memory	~1× model size	~2× (policy + critic)
Variance	Higher (baseline helps)	Lower (value estimates)
Stability	Good with LLM initialization	Sensitive to critic quality
Implementation	Simple	Complex
Compute	Fast	Slow
Bias	Unbiased gradient	Biased (value function)

7. Limitations¶

High variance: Without a learned value function, gradient estimates can be noisy — especially on tasks where reward varies widely across samples. Mitigated by baselines and multi-sample averaging (see RLOO).

No clipping: Vanilla REINFORCE has no trust-region constraint. A single unlucky batch can cause large, destabilizing updates. Mitigated by gradient clipping and small learning rates.

Sequence-level credit assignment: The same gradient signal is applied to every token in the sequence, regardless of which tokens actually contributed to the reward. Mitigated by token-level loss (see DAPO).