Deepseek RL Finetuning
1. Overview¶
This document provides a comprehensive overview of the DeepSeek-R1 strategy for fine-tuning and preference-tuning large language models (LLMs). It covers the RL methods, distinctions from traditional approaches, the GRPO optimization algorithm, multi-stage training pipeline, reward design, model distillation, and additional technical details.
DeepSeek-R1 introduces a novel approach to improving reasoning capabilities and general instruction-following in large language models using reinforcement learning (RL). Rather than relying solely on large human-annotated supervised fine-tuning datasets and learned reward models, DeepSeek emphasizes verifiable tasks (particularly reasoning, mathematics, and code), multi-stage pipelines, and knowledge distillation to smaller models.
1.1 Key Variants¶
- DeepSeek-R1-Zero: RL-only variant without initial supervised fine-tuning (SFT). Uses verifiable reasoning tasks (e.g., math, code, logic) with automatically computable reward signals.
- DeepSeek-R1: Multi-stage pipeline starting with a "cold-start" SFT, followed by reasoning-oriented RL, generation of an SFT dataset from high-quality RL outputs, further SFT fine-tuning, and then a second RL stage for broader instruction-following.
2. The GRPO Algorithm: Overview¶
This section provides a summary of the core optimization algorithm used in DeepSeek-R1: Group Relative Policy Optimization (GRPO).
For full mathematical details and derivation, see the dedicated GRPO Algorithm document.
2.1 Intuition¶
Instead of using a value network (critic) as in PPO, GRPO operates by sampling multiple candidate outputs for each prompt and comparing their performance within the group through ranking. This approach encourages responses that outperform peers in the same group, emphasizing relative improvement rather than absolute reward magnitudes. GRPO offers more stable training for LLMs at scale by avoiding the complexity and instability of critic/value training.
2.2 Key Features¶
- Group-based candidate sampling: Group size \(G\) typically ranges from 8–16
-
Advantage computation: \(\(A_i = \frac{r_i - \mathrm{mean}(r_{1..G})}{\mathrm{std}(r_{1..G})}\)\) where \(r_i\) is the reward of candidate \(o_i\)
-
PPO-style clipped ratio: Applied to new policy versus old policy for each candidate
- KL regularization: Prevents drift from a reference policy
- No explicit value function: Critical for large-scale LLM fine-tuning efficiency
3. Reward Design¶
DeepSeek divides reward design into two main domains: reasoning-oriented tasks and general instruction-following tasks.
3.1 Reasoning-Oriented Tasks¶
The reward function for reasoning tasks includes:
- Correctness: Automatically verified through solvers for math answers or compilers/tests for code solutions
- Chain-of-Thought (CoT) / Format Enforcement: Encourages structured reasoning via tags or designated reasoning segments
- Language Consistency / Style: Penalizes language mixing (e.g., mixing English and Mandarin) or incoherent formatting
- Weighted Sum: The overall reward combines correctness with readability and style metrics
3.2 General Instruction-Following Tasks¶
For broader tasks, DeepSeek employs:
- Preference models or mixtures of rule-based checks for helpfulness, harmlessness, and style
- Learned reward models for tasks beyond verifiable reasoning domains
- Integration after the reasoning-oriented RL stage to develop comprehensive instruction-following capabilities
4. Multi-Stage Training Pipeline¶
The DeepSeek-R1 training strategy follows a systematic multi-stage approach:
| Stage | Description |
|---|---|
| Stage 1: Cold-Start SFT | A small curated dataset of chain-of-thought reasoning examples bootstraps the model, stabilizing the initial policy before intensive RL. |
| Stage 2: Reasoning-Oriented RL | GRPO applied to verifiable reasoning tasks (math, code, logic) drives emergent reasoning capability—either via RL only (DeepSeek-R1-Zero) or RL after SFT (DeepSeek-R1). |
| Stage 3: Rejection Sampling → SFT Dataset | RL-generated outputs are filtered by quality and readability to create a high-quality SFT dataset, addressing issues like language mixing or readability observed in R1-Zero. |
| Stage 4: Second RL Stage (General Instruction-Following) | Expands prompt coverage to include broad instructions and incorporates general reward signals (helpfulness, style, harmlessness) to generalize beyond reasoning tasks. |
| Stage 5: Distillation to Smaller Models | Uses the high-capability RL-trained model as a teacher to generate reasoning-rich data, then fine-tunes smaller student models via SFT on that data (rather than performing full RL on smaller models). |
4.1 Pipeline Highlights¶
- The reasoning-only RL variant (R1-Zero) demonstrates that emergent reasoning can arise via RL alone (without SFT) but suffers from readability and language consistency issues
- For R1 proper, the cold-start SFT "kick-starts" the policy, improving readability and general language handling before RL
- Distilled models are available in multiple sizes: 1.5B, 7B, 8B, 14B, 32B, and 70B parameters, based on Qwen2.5 and Llama3 series
- According to public sources, R1 achieved reasoning performance comparable to OpenAI's o1-1217 model on reasoning and multitask benchmarks
5. Distinctive Features Compared to Traditional Methods¶
| Feature | Conventional RLHF / SFT + RL | DeepSeek-R1 Strategy |
|---|---|---|
| Initial SFT | Often uses large human-annotated datasets | R1-Zero: none; R1: small cold-start SFT |
| Reward Source | Learned reward model (often from human preferences) | Reasoning tasks: rule-based correctness + ranking; General tasks: mixture |
| Policy Optimization | PPO (with value network/critic, learned rewards) | GRPO (group ranking + clipped ratio + KL penalty) |
| Domain Focus | Broad instruction-following from the start | Emphasis on reasoning first → then general instructions |
| Post-RL Dataset Generation | Sometimes limited | RL outputs → filtered → SFT dataset → distillation |
| Distillation to Smaller Models | Optional / less emphasized | Explicit large → dataset → smaller models path |
| Emergence of Reasoning | Often via SFT + RL; may require large annotated data | Demonstrated via RL alone (R1-Zero), then refined by SFT + RL |
6. Technical & Training Details¶
6.1 Model Sizes and Releases¶
- Base models: R1-Zero and R1 are built on a 37B-activated parameter MoE architecture with a total of 671B parameters
- Distilled variants: Available in 1.5B, 7B, 8B, 14B, 32B, and 70B parameter configurations based on Qwen2.5 and Llama3 series
6.2 Reward Function & Sampling Details¶
- Reasoning tasks: The reward function is largely rule-based, checking final answers for correctness, format tags for reasoning sections, and penalizing language mixing
- Process vs. outcome rewards: Outcome rewards (correct answer) proved more effective than weaker signals like process rewards (number of reasoning steps)
- Distillation sampling: For R1-Distill models, generation settings included temperature 0.6, top-p 0.95, with 64 responses per query used for pass@1 estimation
6.3 Training and Distillation Strategy¶
The distillation process leverages the high-capability teacher model to generate large "reasoning-rich" datasets. Student models are then fine-tuned via SFT (not full RL) to inherit reasoning patterns. While smaller models may underperform the teacher, they offer substantial cost and efficiency improvements.
6.4 Observed Strengths & Weaknesses¶
Strengths:
- Emergent reasoning capability via RL
- High performance on reasoning benchmarks
- Efficient multi-stage training approach
Weaknesses:
- R1-Zero exhibits readability and language mixing issues due to skipping SFT
- Distilled models experience some performance degradation compared to the full model
- General instruction-following may lag in smaller or early-stage variants
7. Summary Table¶
| Component | Role | Example / Notes |
|---|---|---|
| Policy Model (LLM) | Learns improved policy via RL | DeepSeek-R1, DeepSeek-R1-Zero |
| Reference Model | Provides KL regularization baseline | Frozen SFT model (in GRPO) |
| Reward Function | Scores responses | Correctness, readability, chain-of-thought format |
| Group Size (G) | Sampling granularity for GRPO | 8–16 outputs per prompt (typical) |
| Advantage (\(A_i\)) | Relative performance metric within group | Normalization: \((r_i - \text{mean}) / \text{std}\) |
| Objective (GRPO) | PPO-style surrogate + KL penalty | See GRPO doc for full derivation |
| Training Pipeline | Multi-stage (Cold-SFT → RL → SFT → RL → Distill) | Reasoning first, then broad instruction |
| Distillation | Transfer reasoning to smaller models | Student models 1.5B–70B params |
| Goal | Efficient reasoning/instruction fine-tuning | Stable RL fine-tuning for large LLMs |
8. Advantages & Limitations¶
Advantages¶
- Emergent reasoning: R1-Zero demonstrates reasoning capability via RL without relying solely on large human-annotated SFT datasets
- Efficient training: Multi-stage strategy combining SFT, RL, filtering, and distillation
- Verifiable rewards: Correctness and format-based signals reduce noise and training instability
- Scalable deployment: Distillation enables smaller, deployable models with reasoning capability for cost-effective production use
Limitations¶
- Limited transparency: Full details on datasets, hyperparameters, and training costs are not publicly available
- Instruction-following gaps: General instruction-following beyond reasoning may lag, especially in R1-Zero and smaller distilled variants
- Distillation trade-offs: SFT-only post-distillation may not fully retain RL-derived benefits in smaller models
- Filtering dependency: Effective reward design and RL output filtering remain critical; low-quality RL outputs create bottlenecks