PPO

1. Overview¶

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm widely used in fine-tuning Large Language Models (LLMs) under the Reinforcement Learning from Human Feedback (RLHF) framework. It helps bridge the gap between human preferences and LLM outputs by optimizing the model's responses to align with what humans find helpful, safe, or relevant.

Key Insight: PPO enables LLMs to learn from scalar rewards (derived from human preferences) while maintaining training stability through controlled policy updates.

2. RLHF Pipeline¶

RLHF typically consists of three stages:

Stage 1: Supervised Fine-Tuning (SFT)¶

Train a base LLM on high-quality human demonstration data (prompt–response pairs)
Creates a model that can follow instructions but may not align perfectly with preferences
Output: SFT model that serves as the initialization for PPO

Stage 2: Reward Model (RM) Training¶

Collect human preference data: show pairs of responses and ask humans which is better
Train a model to assign scalar rewards to outputs based on human preferences
The RM learns to predict which responses humans would prefer
Output: Reward model that can score any model output

Stage 3: Reinforcement Learning (PPO)¶

Fine-tune the policy (SFT model) to maximize predicted rewards from the RM
Use PPO to balance reward maximization with maintaining similarity to the original model
Output: Aligned LLM that generates preferred responses

💡 Intuition: PPO teaches the LLM to generate preferred responses indirectly, using the reward model as scalable feedback instead of requiring human labels for every output.

3. Why PPO Instead of Direct Human Feedback?¶

Direct human labeling for all outputs is impractical and noisy. PPO helps by:

Scaling feedback: Reward models generalize human preferences to unseen outputs
Credit assignment: Uses value function and advantage to propagate sequence-level rewards to tokens
Stable updates: Ensures the model does not deviate too far from its original behavior (preventing mode collapse)
Efficient optimization: Can generate multiple trajectories and learn from them without constant human annotation

4. PPO Key Concepts¶

4.1 Components¶

Component	Description	Role in Training
Policy Model (π_θ)	The trainable LLM generating responses	Being optimized to maximize rewards
Reward Model (R_ϕ)	Evaluates outputs, providing scalar rewards	Provides learning signal
Reference Model (π_θ_ref)	Frozen snapshot of policy before update	Prevents excessive deviation via KL penalty
Value Function (V_θ)	Estimates expected reward for a given prompt	Reduces variance in advantage estimation
Advantage (A_t)	Measures how much better an action is than expected: `A = R - V_θ(s)`	Guides the direction and magnitude of updates

4.2 Intuition¶

PPO adjusts the LLM to improve rewards without drastic changes:

Generates outputs → reward model evaluates → advantage guides update → policy improves
The clipped objective prevents extreme updates and maintains stability
The KL penalty keeps the model close to the reference policy to prevent reward hacking

5. PPO Objective Function¶

The Proximal Policy Optimization (PPO) algorithm optimizes a policy model π_θ while constraining how much it can diverge from a reference (old) policy π_θ_ref.

5.1. Probability Ratio¶

\[ r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{ref}}(a_t | s_t)} \]

The ratio measures how much the new policy's likelihood of an action changes compared to the reference policy.

Interpretation:

\(r_t > 1\): New policy assigns higher probability to this action
\(r_t < 1\): New policy assigns lower probability to this action
\(r_t ≈ 1\): Policies are similar for this action

This ratio quantifies the magnitude and direction of policy change for each sampled token or action.

5.2. Clipped PPO Objective¶

The clipped surrogate loss ensures stable updates by penalizing large deviations in \(r_t(θ)\):

\[ L^{PPO}(\theta) = \mathbb{E}_t \left[\min\left(r_t(\theta) A_t,\ \text{clip}(r_t(\theta),\ 1-\epsilon,\ 1+\epsilon)\ A_t\right)\right] \]

Where:

A_t: Advantage function — how much better an action is than expected
ε: Clipping threshold (typically 0.1–0.2)
The min operation limits large, destabilizing updates

Why Clipping Works:

If A_t > 0 (good action): encourages increase in probability, but clips at (1+ε)
If A_t < 0 (bad action): encourages decrease in probability, but clips at (1-ε)
Prevents the policy from changing too dramatically in a single update

Why do you need Clipping and KL-divergence

In the original PPO paper by Schulman et al. (2017), two variants were proposed as alternatives to each other, not as mechanisms meant to be used together:

Variant 1: PPO-Clip: Uses the clipped surrogate objective. The probability ratio r(θ) = π_θ / π_old is clipped to [1−ε, 1+ε], which prevents the optimization step from making too large a policy update. This is the version most people think of as "PPO."
Variant 2: PPO-Penalty: Instead of clipping, it adds a KL divergence penalty term directly to the objective, with an adaptive coefficient that increases if KL gets too large and decreases if it's too small. This acts as a soft constraint on how far the new policy can drift.

In the RLHF/LLM fine-tuning context (like InstructGPT / ChatGPT training), practitioners often use PPO-Clip plus an additional KL penalty against a frozen reference model. This is because they're solving a somewhat different problem than standard RL:

Clipping prevents the policy from changing too much within a single update step relative to the previous iteration (π_old). It's a local, per-step safeguard.
KL penalty against the reference prevents the policy from drifting too far cumulatively over the entire training run from the original pretrained model. It's a global safeguard.

Clipping alone doesn't prevent the model from slowly drifting very far from the original pretrained model over many small, clipped steps — each step is small, but they compound. The KL term against the frozen reference catches this cumulative drift, which matters because you want the model to retain its general language capabilities, not collapse into some degenerate policy that only maximizes the reward model.

6. Value Function, Advantage, and Reward Computation¶

The PPO algorithm relies on several auxiliary components that ensure stable and meaningful policy updates.

6.1. Cumulative Reward (Return)¶

The cumulative reward (or return) represents the total discounted reward starting from time t:

\[ R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} \]

\(r_t\): reward received at time t (from the reward model in RLHF)
\(γ\): discount factor (typically 0.95–0.99)

Reward Simplification in RLHF:

In language model fine-tuning, the setup is simplified:

A prompt acts as the state s
The model's response (a sequence of tokens) is treated as the action a
A reward model (RM) assigns a single scalar reward \(r(s, a)\) for the entire sequence

Therefore: \(R = r(s, a)\)

This eliminates the need to sum discounted rewards across timesteps, simplifying PPO training.

6.2. Value Function¶

The value function estimates the expected return given a state (or prompt context):

\[ V_\theta(s_t) \approx \mathbb{E}[R_t \mid s_t] \]

The value loss penalizes inaccurate predictions:

\[ L^{value}(\theta) = \frac{1}{2} \left(V_\theta(s_t) - R_t\right)^2 \]

Implementation Details:

In practice, the value function is implemented as a learned neural network head attached to the policy model.

During training:

The reward model provides rewards \(r_t\) for each sequence
The cumulative discounted reward \(R_t\) is computed
The value head learns to predict \(V_θ(s_t)\) to match the observed return \(R_t\)

There are two common approaches:

Monte Carlo estimate: directly use full episode returns \(R_t\) (common in RLHF)
Bootstrapped estimate: use \(r_t + γ V_θ(s_{t+1})\) to reduce variance

The value function serves as a baseline for computing the advantage.

6.3. Advantage Function¶

The advantage quantifies how much better an action \(a_t\) was compared to the expected baseline:

\[ A_t = R_t - V_\theta(s_t) \]

In practice, PPO often uses Generalized Advantage Estimation (GAE) for smoother and lower-variance estimates:

\[ A_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} \]

where:

\(δ_t = r_t + γ V_θ(s_{t+1}) - V_θ(s_t)\)
\(λ\) is the GAE smoothing factor (typically 0.9–0.97)

Advantage in Practice for LLMs:

In LLM fine-tuning with PPO, the advantage is typically computed at the sequence level:

For each prompt \(s\), the model generates a sequence \(a = (a_1, a_2, ..., a_T)\)
The reward model provides a scalar reward \(r(s, a)\) for the whole sequence
The value head predicts \(V_θ(s)\), estimating the expected reward before generation
The advantage is computed as: \(A = r(s, a) - V_θ(s)\)

When Token-Level Advantages Are Used:

Some implementations compute token-level advantages to better attribute credit:

Assign the same scalar reward to all tokens in a sequence
Use GAE to smooth the signal: \(A_t = GAE(r_t, V_θ(s_t))\)
Provides more stable gradients and finer control during backpropagation

Summary:

Sequence-level PPO: \(A = r(s, a) - V_θ(s)\) → simpler, effective for sparse rewards
Token-level PPO: Uses GAE for propagating reward information across tokens

6.4. Entropy Bonus (Exploration Term)¶

The entropy loss encourages the policy to explore rather than prematurely converge:

\[ H[\pi_\theta] = - \sum_a \pi_\theta(a|s_t) \log \pi_\theta(a|s_t) \]

Higher entropy = more exploration and diversity in generated responses.

Why Entropy Matters:

Prevents the model from becoming too deterministic
Maintains diversity in outputs
Helps avoid mode collapse where the model only generates a few "safe" responses

6.5. Combined PPO Loss¶

The full training objective combines all three components:

\[ L_{total}(\theta) = -L^{PPO}(\theta) + c_1 \cdot L^{value}(\theta) - c_2 \cdot H[\pi_\theta] \]

Where:

\(H[π_θ]\): entropy term promoting exploration
\(c_1\): value loss coefficient (typically 0.5–1.0)
\(c_2\): entropy coefficient (typically 0.01–0.1)

Additional: KL Penalty Term

In practice, many implementations add a KL divergence penalty to prevent the policy from drifting too far from the reference model:

\[ L_{total}(\theta) = -L^{PPO}(\theta) + c_1 \cdot L^{value}(\theta) - c_2 \cdot H[\pi_\theta] + c_3 \cdot D_{KL}(\pi_\theta || \pi_{ref}) \]

Where:

\(c_3\): KL penalty coefficient (adaptive or fixed, typically 0.01–0.1)
\(D_{KL}\): KL divergence between current and reference policy

7. Iterative PPO Update Flow¶

The training loop follows these steps:

Generate response with current policy model
Compute reward using reward model
Compute log probabilities from both current and reference policy
Estimate value using value head
Compute advantage (A = R - V)
Compute probability ratio (r_t = π_new / π_ref)
Update policy using clipped surrogate loss
Update value function to better predict returns
Apply entropy bonus to maintain exploration
Apply KL penalty to prevent excessive drift
Periodically update reference model (every few iterations or epochs)

✅ Intuition: PPO only updates when new behavior is better and within a controlled region, ensuring stable learning.

8. Implementation Example (Pseudocode)¶

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        prompts = batch['prompts']

        # 1. Generate responses with current policy
        responses = policy_model.generate(prompts)

        # 2. Compute reward from reward model (sequence-level)
        rewards = reward_model(prompts, responses)

        # 3. Compute log probabilities
        logprobs_ref = ref_model.logprobs(prompts, responses)  # frozen
        logprobs_policy = policy_model.logprobs(prompts, responses)

        # 4. Compute value estimates
        values = value_head(prompts)  # V_theta(s)

        # 5. Compute advantages
        advantages = rewards - values  # sequence-level
        # Optional: normalize advantages for stability
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        # Mini-batch updates (multiple epochs on same data)
        for _ in range(ppo_epochs):
            # 6. Compute probability ratio
            ratio = torch.exp(logprobs_policy - logprobs_ref)

            # 7. Compute clipped surrogate loss
            clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
            policy_loss = -torch.mean(
                torch.min(ratio * advantages, clipped_ratio * advantages)
            )

            # 8. Compute value loss
            value_loss = 0.5 * torch.mean((values - rewards) ** 2)

            # 9. Compute entropy bonus
            entropy = -torch.sum(torch.exp(logprobs_policy) * logprobs_policy)

            # 10. Compute KL divergence penalty
            kl_div = torch.mean(
                torch.exp(logprobs_ref) * (logprobs_ref - logprobs_policy)
            )

            # 11. Combine losses
            total_loss = (
                policy_loss + 
                c1 * value_loss - 
                c2 * entropy + 
                c3 * kl_div
            )

            # 12. Backpropagate and update
            optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(policy_model.parameters(), max_grad_norm)
            optimizer.step()

    # 13. Periodically update reference model
    if (epoch + 1) % update_ref_interval == 0:
        ref_model.load_state_dict(policy_model.state_dict())

9. Limitations and Challenges of PPO in LLM Training¶

🧩 1. KL Divergence Sensitivity¶

PPO adds a KL penalty to prevent the model from drifting too far:

\[ L = L^{PPO} - \beta D_{KL}(\pi_{\theta} || \pi_{ref}) \]

Challenges:

Too small \(β\): model diverges, may collapse to degenerate solutions
Too large \(β\): very slow learning, model stays too close to initialization
Solution: Adaptive KL control adjusts \(β\) based on observed KL divergence

⏳ 2. High Training Cost¶

Computational Requirements:

Multiple models in memory: policy, reference, reward model, value head
Fine-tuning large LLMs can require thousands of GPU-hours
Need to generate samples, compute rewards, and train simultaneously
Typically requires distributed training across many GPUs

Memory Challenges:

Reference model is often a frozen copy of the policy
Reward model may be as large as the policy model
Requires efficient batching and gradient accumulation

⚠️ 3. Reward Hacking¶

The Problem:

LLM may over-optimize for the reward model instead of true human preferences
Exploits weaknesses or biases in the reward model
Can result in responses that "game" the reward model

Common Examples:

Overly verbose or repetitive responses (if length correlates with reward)
Excessive politeness or flattery
Technically correct but misleading or unhelpful responses
Responses that avoid controversial topics even when appropriate

Mitigations:

Regularization through KL penalty
Diverse and robust reward model training
Iterative improvement of reward models
Human evaluation of final outputs

🧮 4. Sparse or Noisy Rewards¶

Sparse Rewards:

One reward per sequence makes credit assignment harder
Difficult to determine which tokens contributed to high/low reward
Increases variance in gradient estimates

Noisy Rewards:

Subjective or inconsistent human preferences
Reward model uncertainty
Can lead to unstable updates and poor convergence

Solutions:

Token-level advantage estimation (GAE)
Larger batch sizes to reduce variance
Reward model ensembles
Value function as a learned baseline

🔁 5. Credit Assignment Problem¶

Challenge:

Per-token updates but per-sequence rewards create ambiguity
Which specific tokens led to high/low rewards?
Early tokens affect later generation but get same reward signal

Approaches:

GAE for token-level credit assignment
Shaped rewards (e.g., intermediate rewards for partial sequences)
Curriculum learning (start with simpler tasks)

⚖️ 6. Exploration vs Alignment Trade-off¶

The Dilemma:

Encouraging exploration may generate unsafe or off-policy outputs
Too little exploration leads to mode collapse
Need to balance diversity with safety and alignment

Mitigations:

Carefully tuned entropy coefficient
Safety constraints in reward model
Filtered sampling (reject unsafe outputs before training)

🔍 7. Implementation Complexity¶

Technical Challenges:

Multiple models with different update schedules
Careful hyperparameter tuning (ε, c_1, c_2, c_3, learning rate)
Numerical stability (log probabilities, ratio clipping)
Can be unstable if any component is suboptimal

Engineering Challenges:

Distributed training coordination
Efficient sampling and reward computation
Memory management for large models
Reproducibility across runs

🎯 8. Reward Model Quality Bottleneck¶

Issue:

PPO is only as good as the reward model
Garbage in, garbage out: poor reward model → poor aligned model
Reward model may not capture all aspects of human preference

Implications:

Need high-quality preference data for reward model training
Reward model must generalize beyond its training distribution
Continuous iteration on reward model alongside policy training

📊 9. Distribution Shift¶

Problem:

As the policy improves, it generates outputs different from the initial SFT model
Reward model may not generalize to these new outputs (out-of-distribution)
Can lead to reward model exploits or failures

Solutions:

Online reward model updates with new samples
Conservative updates (small ε, high KL penalty)
Iterative data collection and reward model retraining

10. Alternative Approaches and Recent Developments¶

Direct Preference Optimization (DPO)¶

Eliminates the separate reward model and PPO training
Directly optimizes policy from preference data
Simpler and more stable than PPO
Lower computational cost

RLAIF (RL from AI Feedback)¶

Uses AI model instead of humans to provide feedback
More scalable but potentially less aligned with human values
Can be combined with human feedback

Constitutional AI¶

Uses principles and critiques to guide behavior
Can reduce need for extensive human preference data
Complementary to RLHF/PPO

10. Best Practices for PPO in LLM Training¶

Hyperparameter Tuning¶

Start with conservative values (small ε, learning rate)
Use learning rate warmup (gradually increase from 0)
Monitor KL divergence and adjust β adaptively
Normalize advantages for stable training

Data Quality¶

Ensure diverse, high-quality prompts
Balance prompt distribution across topics
Regularly update preference data
Filter out low-quality or adversarial examples

Monitoring and Debugging¶

Track multiple metrics: reward, KL, entropy, value loss
Log sample generations at regular intervals
Monitor for reward hacking patterns
Use tensorboard or wandb for visualization

Computational Efficiency¶

Use gradient checkpointing for memory
Mixed precision training (FP16/BF16)
Distributed training across GPUs
Batch prompts of similar lengths together

Safety and Alignment¶

Regular human evaluation
Red-team testing throughout training
Maintain capability benchmarks
Implement safety filters and guardrails