Skip to content

PPO

1. Overview

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm widely used in fine-tuning Large Language Models (LLMs) under the Reinforcement Learning from Human Feedback (RLHF) framework. It helps bridge the gap between human preferences and LLM outputs by optimizing the model's responses to align with what humans find helpful, safe, or relevant.

Key Insight: PPO enables LLMs to learn from scalar rewards (derived from human preferences) while maintaining training stability through controlled policy updates.



2. RLHF Pipeline

RLHF typically consists of three stages:

Stage 1: Supervised Fine-Tuning (SFT)

  • Train a base LLM on high-quality human demonstration data (prompt–response pairs)
  • Creates a model that can follow instructions but may not align perfectly with preferences
  • Output: SFT model that serves as the initialization for PPO

Stage 2: Reward Model (RM) Training

  • Collect human preference data: show pairs of responses and ask humans which is better
  • Train a model to assign scalar rewards to outputs based on human preferences
  • The RM learns to predict which responses humans would prefer
  • Output: Reward model that can score any model output

Stage 3: Reinforcement Learning (PPO)

  • Fine-tune the policy (SFT model) to maximize predicted rewards from the RM
  • Use PPO to balance reward maximization with maintaining similarity to the original model
  • Output: Aligned LLM that generates preferred responses

💡 Intuition: PPO teaches the LLM to generate preferred responses indirectly, using the reward model as scalable feedback instead of requiring human labels for every output.



3. Why PPO Instead of Direct Human Feedback?

Direct human labeling for all outputs is impractical and noisy. PPO helps by:

  • Scaling feedback: Reward models generalize human preferences to unseen outputs
  • Credit assignment: Uses value function and advantage to propagate sequence-level rewards to tokens
  • Stable updates: Ensures the model does not deviate too far from its original behavior (preventing mode collapse)
  • Efficient optimization: Can generate multiple trajectories and learn from them without constant human annotation


4. PPO Key Concepts

4.1 Components

Component Description Role in Training
Policy Model (π_θ) The trainable LLM generating responses Being optimized to maximize rewards
Reward Model (R_ϕ) Evaluates outputs, providing scalar rewards Provides learning signal
Reference Model (π_θ_ref) Frozen snapshot of policy before update Prevents excessive deviation via KL penalty
Value Function (V_θ) Estimates expected reward for a given prompt Reduces variance in advantage estimation
Advantage (A_t) Measures how much better an action is than expected: A = R - V_θ(s) Guides the direction and magnitude of updates

4.2 Intuition

PPO adjusts the LLM to improve rewards without drastic changes:

  • Generates outputs → reward model evaluates → advantage guides update → policy improves
  • The clipped objective prevents extreme updates and maintains stability
  • The KL penalty keeps the model close to the reference policy to prevent reward hacking


5. PPO Objective Function

The Proximal Policy Optimization (PPO) algorithm optimizes a policy model π_θ while constraining how much it can diverge from a reference (old) policy π_θ_ref.

5.1. Probability Ratio

\[ r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{ref}}(a_t | s_t)} \]

The ratio measures how much the new policy's likelihood of an action changes compared to the reference policy.

Interpretation:

  • \(r_t > 1\): New policy assigns higher probability to this action
  • \(r_t < 1\): New policy assigns lower probability to this action
  • \(r_t ≈ 1\): Policies are similar for this action

This ratio quantifies the magnitude and direction of policy change for each sampled token or action.


5.2. Clipped PPO Objective

The clipped surrogate loss ensures stable updates by penalizing large deviations in \(r_t(θ)\):

\[ L^{PPO}(\theta) = \mathbb{E}_t \left[\min\left(r_t(\theta) A_t,\ \text{clip}(r_t(\theta),\ 1-\epsilon,\ 1+\epsilon)\ A_t\right)\right] \]

Where:

  • A_t: Advantage function — how much better an action is than expected
  • ε: Clipping threshold (typically 0.1–0.2)
  • The min operation limits large, destabilizing updates

Why Clipping Works:

  • If A_t > 0 (good action): encourages increase in probability, but clips at (1+ε)
  • If A_t < 0 (bad action): encourages decrease in probability, but clips at (1-ε)
  • Prevents the policy from changing too dramatically in a single update

Why do you need Clipping and KL-divergence

In the original PPO paper by Schulman et al. (2017), two variants were proposed as alternatives to each other, not as mechanisms meant to be used together:

  • Variant 1: PPO-Clip: Uses the clipped surrogate objective. The probability ratio r(θ) = π_θ / π_old is clipped to [1−ε, 1+ε], which prevents the optimization step from making too large a policy update. This is the version most people think of as "PPO."
  • Variant 2: PPO-Penalty: Instead of clipping, it adds a KL divergence penalty term directly to the objective, with an adaptive coefficient that increases if KL gets too large and decreases if it's too small. This acts as a soft constraint on how far the new policy can drift.

In the RLHF/LLM fine-tuning context (like InstructGPT / ChatGPT training), practitioners often use PPO-Clip plus an additional KL penalty against a frozen reference model. This is because they're solving a somewhat different problem than standard RL:

  • Clipping prevents the policy from changing too much within a single update step relative to the previous iteration (π_old). It's a local, per-step safeguard.
  • KL penalty against the reference prevents the policy from drifting too far cumulatively over the entire training run from the original pretrained model. It's a global safeguard.

Clipping alone doesn't prevent the model from slowly drifting very far from the original pretrained model over many small, clipped steps — each step is small, but they compound. The KL term against the frozen reference catches this cumulative drift, which matters because you want the model to retain its general language capabilities, not collapse into some degenerate policy that only maximizes the reward model.



6. Value Function, Advantage, and Reward Computation

The PPO algorithm relies on several auxiliary components that ensure stable and meaningful policy updates.

6.1. Cumulative Reward (Return)

The cumulative reward (or return) represents the total discounted reward starting from time t:

\[ R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k} \]
  • \(r_t\): reward received at time t (from the reward model in RLHF)
  • \(γ\): discount factor (typically 0.95–0.99)

Reward Simplification in RLHF:

In language model fine-tuning, the setup is simplified:

  • A prompt acts as the state s
  • The model's response (a sequence of tokens) is treated as the action a
  • A reward model (RM) assigns a single scalar reward \(r(s, a)\) for the entire sequence

Therefore: \(R = r(s, a)\)

This eliminates the need to sum discounted rewards across timesteps, simplifying PPO training.


6.2. Value Function

The value function estimates the expected return given a state (or prompt context):

\[ V_\theta(s_t) \approx \mathbb{E}[R_t \mid s_t] \]

The value loss penalizes inaccurate predictions:

\[ L^{value}(\theta) = \frac{1}{2} \left(V_\theta(s_t) - R_t\right)^2 \]

Implementation Details:

In practice, the value function is implemented as a learned neural network head attached to the policy model.

During training:

  1. The reward model provides rewards \(r_t\) for each sequence
  2. The cumulative discounted reward \(R_t\) is computed
  3. The value head learns to predict \(V_θ(s_t)\) to match the observed return \(R_t\)

There are two common approaches:

  • Monte Carlo estimate: directly use full episode returns \(R_t\) (common in RLHF)
  • Bootstrapped estimate: use \(r_t + γ V_θ(s_{t+1})\) to reduce variance

The value function serves as a baseline for computing the advantage.


6.3. Advantage Function

The advantage quantifies how much better an action \(a_t\) was compared to the expected baseline:

\[ A_t = R_t - V_\theta(s_t) \]

In practice, PPO often uses Generalized Advantage Estimation (GAE) for smoother and lower-variance estimates:

\[ A_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} \]

where:

  • \(δ_t = r_t + γ V_θ(s_{t+1}) - V_θ(s_t)\)
  • \(λ\) is the GAE smoothing factor (typically 0.9–0.97)

Advantage in Practice for LLMs:

In LLM fine-tuning with PPO, the advantage is typically computed at the sequence level:

  1. For each prompt \(s\), the model generates a sequence \(a = (a_1, a_2, ..., a_T)\)
  2. The reward model provides a scalar reward \(r(s, a)\) for the whole sequence
  3. The value head predicts \(V_θ(s)\), estimating the expected reward before generation
  4. The advantage is computed as: \(A = r(s, a) - V_θ(s)\)

When Token-Level Advantages Are Used:

Some implementations compute token-level advantages to better attribute credit:

  • Assign the same scalar reward to all tokens in a sequence
  • Use GAE to smooth the signal: \(A_t = GAE(r_t, V_θ(s_t))\)
  • Provides more stable gradients and finer control during backpropagation

Summary:

  • Sequence-level PPO: \(A = r(s, a) - V_θ(s)\) → simpler, effective for sparse rewards
  • Token-level PPO: Uses GAE for propagating reward information across tokens

6.4. Entropy Bonus (Exploration Term)

The entropy loss encourages the policy to explore rather than prematurely converge:

\[ H[\pi_\theta] = - \sum_a \pi_\theta(a|s_t) \log \pi_\theta(a|s_t) \]

Higher entropy = more exploration and diversity in generated responses.

Why Entropy Matters:

  • Prevents the model from becoming too deterministic
  • Maintains diversity in outputs
  • Helps avoid mode collapse where the model only generates a few "safe" responses

6.5. Combined PPO Loss

The full training objective combines all three components:

\[ L_{total}(\theta) = -L^{PPO}(\theta) + c_1 \cdot L^{value}(\theta) - c_2 \cdot H[\pi_\theta] \]

Where:

  • \(H[π_θ]\): entropy term promoting exploration
  • \(c_1\): value loss coefficient (typically 0.5–1.0)
  • \(c_2\): entropy coefficient (typically 0.01–0.1)

Additional: KL Penalty Term

In practice, many implementations add a KL divergence penalty to prevent the policy from drifting too far from the reference model:

\[ L_{total}(\theta) = -L^{PPO}(\theta) + c_1 \cdot L^{value}(\theta) - c_2 \cdot H[\pi_\theta] + c_3 \cdot D_{KL}(\pi_\theta || \pi_{ref}) \]

Where:

  • \(c_3\): KL penalty coefficient (adaptive or fixed, typically 0.01–0.1)
  • \(D_{KL}\): KL divergence between current and reference policy


7. Iterative PPO Update Flow

The training loop follows these steps:

  1. Generate response with current policy model
  2. Compute reward using reward model
  3. Compute log probabilities from both current and reference policy
  4. Estimate value using value head
  5. Compute advantage (A = R - V)
  6. Compute probability ratio (r_t = π_new / π_ref)
  7. Update policy using clipped surrogate loss
  8. Update value function to better predict returns
  9. Apply entropy bonus to maintain exploration
  10. Apply KL penalty to prevent excessive drift
  11. Periodically update reference model (every few iterations or epochs)

Intuition: PPO only updates when new behavior is better and within a controlled region, ensuring stable learning.



8. Implementation Example (Pseudocode)

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        prompts = batch['prompts']

        # 1. Generate responses with current policy
        responses = policy_model.generate(prompts)

        # 2. Compute reward from reward model (sequence-level)
        rewards = reward_model(prompts, responses)

        # 3. Compute log probabilities
        logprobs_ref = ref_model.logprobs(prompts, responses)  # frozen
        logprobs_policy = policy_model.logprobs(prompts, responses)

        # 4. Compute value estimates
        values = value_head(prompts)  # V_theta(s)

        # 5. Compute advantages
        advantages = rewards - values  # sequence-level
        # Optional: normalize advantages for stability
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        # Mini-batch updates (multiple epochs on same data)
        for _ in range(ppo_epochs):
            # 6. Compute probability ratio
            ratio = torch.exp(logprobs_policy - logprobs_ref)

            # 7. Compute clipped surrogate loss
            clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
            policy_loss = -torch.mean(
                torch.min(ratio * advantages, clipped_ratio * advantages)
            )

            # 8. Compute value loss
            value_loss = 0.5 * torch.mean((values - rewards) ** 2)

            # 9. Compute entropy bonus
            entropy = -torch.sum(torch.exp(logprobs_policy) * logprobs_policy)

            # 10. Compute KL divergence penalty
            kl_div = torch.mean(
                torch.exp(logprobs_ref) * (logprobs_ref - logprobs_policy)
            )

            # 11. Combine losses
            total_loss = (
                policy_loss + 
                c1 * value_loss - 
                c2 * entropy + 
                c3 * kl_div
            )

            # 12. Backpropagate and update
            optimizer.zero_grad()
            total_loss.backward()
            torch.nn.utils.clip_grad_norm_(policy_model.parameters(), max_grad_norm)
            optimizer.step()

    # 13. Periodically update reference model
    if (epoch + 1) % update_ref_interval == 0:
        ref_model.load_state_dict(policy_model.state_dict())


9. Limitations and Challenges of PPO in LLM Training

🧩 1. KL Divergence Sensitivity

PPO adds a KL penalty to prevent the model from drifting too far:

\[ L = L^{PPO} - \beta D_{KL}(\pi_{\theta} || \pi_{ref}) \]

Challenges:

  • Too small \(β\): model diverges, may collapse to degenerate solutions
  • Too large \(β\): very slow learning, model stays too close to initialization
  • Solution: Adaptive KL control adjusts \(β\) based on observed KL divergence

⏳ 2. High Training Cost

Computational Requirements:

  • Multiple models in memory: policy, reference, reward model, value head
  • Fine-tuning large LLMs can require thousands of GPU-hours
  • Need to generate samples, compute rewards, and train simultaneously
  • Typically requires distributed training across many GPUs

Memory Challenges:

  • Reference model is often a frozen copy of the policy
  • Reward model may be as large as the policy model
  • Requires efficient batching and gradient accumulation

⚠️ 3. Reward Hacking

The Problem:

  • LLM may over-optimize for the reward model instead of true human preferences
  • Exploits weaknesses or biases in the reward model
  • Can result in responses that "game" the reward model

Common Examples:

  • Overly verbose or repetitive responses (if length correlates with reward)
  • Excessive politeness or flattery
  • Technically correct but misleading or unhelpful responses
  • Responses that avoid controversial topics even when appropriate

Mitigations:

  • Regularization through KL penalty
  • Diverse and robust reward model training
  • Iterative improvement of reward models
  • Human evaluation of final outputs

🧮 4. Sparse or Noisy Rewards

Sparse Rewards:

  • One reward per sequence makes credit assignment harder
  • Difficult to determine which tokens contributed to high/low reward
  • Increases variance in gradient estimates

Noisy Rewards:

  • Subjective or inconsistent human preferences
  • Reward model uncertainty
  • Can lead to unstable updates and poor convergence

Solutions:

  • Token-level advantage estimation (GAE)
  • Larger batch sizes to reduce variance
  • Reward model ensembles
  • Value function as a learned baseline

🔁 5. Credit Assignment Problem

Challenge:

  • Per-token updates but per-sequence rewards create ambiguity
  • Which specific tokens led to high/low rewards?
  • Early tokens affect later generation but get same reward signal

Approaches:

  • GAE for token-level credit assignment
  • Shaped rewards (e.g., intermediate rewards for partial sequences)
  • Curriculum learning (start with simpler tasks)

⚖️ 6. Exploration vs Alignment Trade-off

The Dilemma:

  • Encouraging exploration may generate unsafe or off-policy outputs
  • Too little exploration leads to mode collapse
  • Need to balance diversity with safety and alignment

Mitigations:

  • Carefully tuned entropy coefficient
  • Safety constraints in reward model
  • Filtered sampling (reject unsafe outputs before training)

🔍 7. Implementation Complexity

Technical Challenges:

  • Multiple models with different update schedules
  • Careful hyperparameter tuning (ε, c_1, c_2, c_3, learning rate)
  • Numerical stability (log probabilities, ratio clipping)
  • Can be unstable if any component is suboptimal

Engineering Challenges:

  • Distributed training coordination
  • Efficient sampling and reward computation
  • Memory management for large models
  • Reproducibility across runs

🎯 8. Reward Model Quality Bottleneck

Issue:

  • PPO is only as good as the reward model
  • Garbage in, garbage out: poor reward model → poor aligned model
  • Reward model may not capture all aspects of human preference

Implications:

  • Need high-quality preference data for reward model training
  • Reward model must generalize beyond its training distribution
  • Continuous iteration on reward model alongside policy training

📊 9. Distribution Shift

Problem:

  • As the policy improves, it generates outputs different from the initial SFT model
  • Reward model may not generalize to these new outputs (out-of-distribution)
  • Can lead to reward model exploits or failures

Solutions:

  • Online reward model updates with new samples
  • Conservative updates (small ε, high KL penalty)
  • Iterative data collection and reward model retraining

10. Alternative Approaches and Recent Developments

Direct Preference Optimization (DPO)

  • Eliminates the separate reward model and PPO training
  • Directly optimizes policy from preference data
  • Simpler and more stable than PPO
  • Lower computational cost

RLAIF (RL from AI Feedback)

  • Uses AI model instead of humans to provide feedback
  • More scalable but potentially less aligned with human values
  • Can be combined with human feedback

Constitutional AI

  • Uses principles and critiques to guide behavior
  • Can reduce need for extensive human preference data
  • Complementary to RLHF/PPO


10. Best Practices for PPO in LLM Training

Hyperparameter Tuning

  • Start with conservative values (small ε, learning rate)
  • Use learning rate warmup (gradually increase from 0)
  • Monitor KL divergence and adjust β adaptively
  • Normalize advantages for stable training

Data Quality

  • Ensure diverse, high-quality prompts
  • Balance prompt distribution across topics
  • Regularly update preference data
  • Filter out low-quality or adversarial examples

Monitoring and Debugging

  • Track multiple metrics: reward, KL, entropy, value loss
  • Log sample generations at regular intervals
  • Monitor for reward hacking patterns
  • Use tensorboard or wandb for visualization

Computational Efficiency

  • Use gradient checkpointing for memory
  • Mixed precision training (FP16/BF16)
  • Distributed training across GPUs
  • Batch prompts of similar lengths together

Safety and Alignment

  • Regular human evaluation
  • Red-team testing throughout training
  • Maintain capability benchmarks
  • Implement safety filters and guardrails