DAPO
1. Overview¶
DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) is an RL fine-tuning algorithm for LLMs developed by ByteDance Seed and Tsinghua AIR (Yu et al., 2025). It extends GRPO with four targeted fixes that address failure modes observed when scaling RL to long chain-of-thought (CoT) reasoning tasks.
Key result: DAPO achieves 50 points on AIME 2024 using Qwen2.5-32B (vs. 47 for DeepSeek-R1-Zero-32B) using 50% fewer training steps, with fully open-sourced training code.
2. The Big Picture: Why GRPO Needs Fixing at Scale¶
GRPO works well in small-scale settings but exhibits three failure modes when training long-CoT reasoning models at scale:
| Problem | Root cause | DAPO fix |
|---|---|---|
| Entropy collapse | Symmetric clip penalizes exploration | Clip-Higher (asymmetric clip) |
| Wasted compute on trivial prompts | Groups with all-correct/all-wrong get zero gradient | Dynamic Sampling |
| Short-response bias | Sequence-level loss weights short/long equally by count | Token-level Policy Gradient Loss |
| Reward noise from truncation | Truncated responses get harsh penalties | Overlong Reward Shaping |
3. DAPO Formulation¶
DAPO starts from the GRPO objective but modifies each component:
where \(\varepsilon_{\text{low}} \neq \varepsilon_{\text{high}}\) (decoupled clipping) and the sum is over tokens rather than sequences.
The KL penalty term is removed in DAPO — entropy is controlled by the asymmetric clip instead.
4. The Four Key Innovations¶
4.1 Clip-Higher (Decoupled Clip Ratio)¶
Problem: GRPO uses a symmetric clip \([1-\varepsilon, 1+\varepsilon]\). The upper clip constrains how much the model can increase the probability of good responses, suppressing exploration and causing entropy collapse — the model converges prematurely to a narrow output distribution.
Fix: Use separate thresholds for the lower and upper clip bounds:
Typical values: \(\varepsilon_{\text{low}} = 0.2\), \(\varepsilon_{\text{high}} = 0.28\).
Effect: The model can increase probability of correct responses more aggressively (via the looser upper bound), while still being constrained from catastrophically reducing probability of any response (via the tighter lower bound). This maintains entropy and promotes diversity.
4.2 Dynamic Sampling¶
Problem: When all \(G\) samples for a prompt have the same reward (all correct or all incorrect), the group advantage is zero — no gradient signal is generated. These prompts waste compute.
Fix: During each training batch, filter out prompt groups where all samples produce the same outcome:
Keep group i only if: 0 < accuracy_i < 1
Sampling continues until the batch has the required number of effective (non-degenerate) groups.
Effect: Every gradient step uses only prompts that provide a learning signal. This improves both training efficiency and stability.
Hyperparameter: Minimum and maximum group accuracy thresholds. Typical: filter groups with accuracy = 0 or accuracy = 1.
4.3 Token-Level Policy Gradient Loss¶
Problem: GRPO and most RL methods use a sequence-level loss — the advantage for the entire sequence is computed once and applied equally to all tokens:
This means a group of \(G\) short sequences contributes the same total gradient magnitude as a group of \(G\) long sequences. In long-CoT tasks, short responses dominate training because they appear more often in batches.
Fix: Normalize loss by total token count rather than by sequence count:
Effect: Long reasoning chains contribute proportionally more to the gradient, preventing the model from learning to give short non-reasoning answers.
4.4 Overlong Reward Shaping¶
Problem: When a generated response is truncated at the maximum length limit, it typically receives a large negative reward (incomplete answer = wrong). This creates noisy, unstable gradient signals and discourages the model from generating longer reasoning chains even when they are beneficial.
Fix: Apply a soft length penalty instead of hard truncation reward:
Responses that exceed \(L_{\max}\) receive a gradually increasing penalty rather than an abrupt large negative reward.
Effect: Smoother reward landscape; more stable gradients for long-CoT training.
5. Implementation Sketch¶
# DAPO training step
def dapo_step(policy, policy_old, reward_fn, prompts, G,
eps_low=0.2, eps_high=0.28):
effective_groups = []
for prompt in prompts:
# Sample G responses
responses = [policy_old.generate(prompt) for _ in range(G)]
rewards = [reward_fn(prompt, r) for r in responses]
# Dynamic sampling: skip trivial groups
accuracy = sum(r > 0 for r in rewards) / G
if accuracy == 0.0 or accuracy == 1.0:
continue
effective_groups.append((prompt, responses, rewards))
total_loss = 0.0
total_tokens = 0
for prompt, responses, rewards in effective_groups:
mean_r = sum(rewards) / len(rewards)
std_r = torch.std(torch.tensor(rewards)) + 1e-8
advantages = [(r - mean_r) / std_r for r in rewards]
for resp, adv in zip(responses, advantages):
tokens = tokenize(prompt, resp)
for t, token in enumerate(tokens):
# Token-level policy gradient
ratio = (policy.log_prob_token(token) -
policy_old.log_prob_token(token)).exp()
# Asymmetric clip
clipped = torch.clamp(ratio, 1 - eps_low, 1 + eps_high)
pg = -torch.min(ratio * adv, clipped * adv)
total_loss += pg
total_tokens += 1
# Token-level normalization
return total_loss / total_tokens
6. DAPO vs GRPO vs PPO¶
| Aspect | DAPO | GRPO | PPO |
|---|---|---|---|
| Critic needed | No | No | Yes |
| Clip ratio | Asymmetric (ε_low ≠ ε_high) | Symmetric | Symmetric |
| KL penalty | No (entropy via clip) | Yes | Yes |
| Loss level | Token | Sequence | Token |
| Trivial group filtering | Yes (dynamic sampling) | No | No |
| Long-CoT stability | High | Moderate | Moderate |
| Entropy collapse prevention | Explicit (Clip-Higher) | Partial (KL) | Partial (entropy bonus) |
| AIME 2024 (32B) | 50 pts | 47 pts (DeepSeek R1-Zero) | — |
7. Limitations¶
More hyperparameters: \(\varepsilon_{\text{low}}\), \(\varepsilon_{\text{high}}\), dynamic sampling thresholds, overlong shaping parameters — each adds a tuning dimension. The original paper provides defaults but these may not transfer across model families.
Dynamic sampling changes batch composition: Filtering trivial groups means the effective batch size varies. If a difficulty-calibrated dataset is used and most prompts are trivial (or all hard), sampling can stall. Careful dataset curation is required.
No formal convergence guarantee: Removing the KL penalty and using asymmetric clipping breaks the theoretical bounds from PPO's trust-region analysis. Empirical stability is well-demonstrated but not theoretically proven.
Requires verifiable rewards: Like GRPO and RLVR, DAPO depends on a binary or graded correctness signal. It is not directly applicable to open-ended generation tasks without a reward model.