Skip to content

Deepseek RL Finetuning

1. Overview

This document provides a comprehensive overview of the DeepSeek-R1 strategy for fine-tuning and preference-tuning large language models (LLMs). It covers the RL methods, distinctions from traditional approaches, the GRPO optimization algorithm, multi-stage training pipeline, reward design, model distillation, and additional technical details.

DeepSeek-R1 introduces a novel approach to improving reasoning capabilities and general instruction-following in large language models using reinforcement learning (RL). Rather than relying solely on large human-annotated supervised fine-tuning datasets and learned reward models, DeepSeek emphasizes verifiable tasks (particularly reasoning, mathematics, and code), multi-stage pipelines, and knowledge distillation to smaller models.

1.1 Key Variants

  • DeepSeek-R1-Zero: RL-only variant without initial supervised fine-tuning (SFT). Uses verifiable reasoning tasks (e.g., math, code, logic) with automatically computable reward signals.
  • DeepSeek-R1: Multi-stage pipeline starting with a "cold-start" SFT, followed by reasoning-oriented RL, generation of an SFT dataset from high-quality RL outputs, further SFT fine-tuning, and then a second RL stage for broader instruction-following.


2. The GRPO Algorithm: Overview

This section provides a summary of the core optimization algorithm used in DeepSeek-R1: Group Relative Policy Optimization (GRPO).

For full mathematical details and derivation, see the dedicated GRPO Algorithm document.

2.1 Intuition

Instead of using a value network (critic) as in PPO, GRPO operates by sampling multiple candidate outputs for each prompt and comparing their performance within the group through ranking. This approach encourages responses that outperform peers in the same group, emphasizing relative improvement rather than absolute reward magnitudes. GRPO offers more stable training for LLMs at scale by avoiding the complexity and instability of critic/value training.


2.2 Key Features

  • Group-based candidate sampling: Group size \(G\) typically ranges from 8–16
  • Advantage computation: \(\(A_i = \frac{r_i - \mathrm{mean}(r_{1..G})}{\mathrm{std}(r_{1..G})}\)\) where \(r_i\) is the reward of candidate \(o_i\)

  • PPO-style clipped ratio: Applied to new policy versus old policy for each candidate

  • KL regularization: Prevents drift from a reference policy
  • No explicit value function: Critical for large-scale LLM fine-tuning efficiency


3. Reward Design

DeepSeek divides reward design into two main domains: reasoning-oriented tasks and general instruction-following tasks.

3.1 Reasoning-Oriented Tasks

The reward function for reasoning tasks includes:

  • Correctness: Automatically verified through solvers for math answers or compilers/tests for code solutions
  • Chain-of-Thought (CoT) / Format Enforcement: Encourages structured reasoning via tags or designated reasoning segments
  • Language Consistency / Style: Penalizes language mixing (e.g., mixing English and Mandarin) or incoherent formatting
  • Weighted Sum: The overall reward combines correctness with readability and style metrics

3.2 General Instruction-Following Tasks

For broader tasks, DeepSeek employs:

  • Preference models or mixtures of rule-based checks for helpfulness, harmlessness, and style
  • Learned reward models for tasks beyond verifiable reasoning domains
  • Integration after the reasoning-oriented RL stage to develop comprehensive instruction-following capabilities


4. Multi-Stage Training Pipeline

The DeepSeek-R1 training strategy follows a systematic multi-stage approach:

Stage Description
Stage 1: Cold-Start SFT A small curated dataset of chain-of-thought reasoning examples bootstraps the model, stabilizing the initial policy before intensive RL.
Stage 2: Reasoning-Oriented RL GRPO applied to verifiable reasoning tasks (math, code, logic) drives emergent reasoning capability—either via RL only (DeepSeek-R1-Zero) or RL after SFT (DeepSeek-R1).
Stage 3: Rejection Sampling → SFT Dataset RL-generated outputs are filtered by quality and readability to create a high-quality SFT dataset, addressing issues like language mixing or readability observed in R1-Zero.
Stage 4: Second RL Stage (General Instruction-Following) Expands prompt coverage to include broad instructions and incorporates general reward signals (helpfulness, style, harmlessness) to generalize beyond reasoning tasks.
Stage 5: Distillation to Smaller Models Uses the high-capability RL-trained model as a teacher to generate reasoning-rich data, then fine-tunes smaller student models via SFT on that data (rather than performing full RL on smaller models).

4.1 Pipeline Highlights

  • The reasoning-only RL variant (R1-Zero) demonstrates that emergent reasoning can arise via RL alone (without SFT) but suffers from readability and language consistency issues
  • For R1 proper, the cold-start SFT "kick-starts" the policy, improving readability and general language handling before RL
  • Distilled models are available in multiple sizes: 1.5B, 7B, 8B, 14B, 32B, and 70B parameters, based on Qwen2.5 and Llama3 series
  • According to public sources, R1 achieved reasoning performance comparable to OpenAI's o1-1217 model on reasoning and multitask benchmarks


5. Distinctive Features Compared to Traditional Methods

Feature Conventional RLHF / SFT + RL DeepSeek-R1 Strategy
Initial SFT Often uses large human-annotated datasets R1-Zero: none; R1: small cold-start SFT
Reward Source Learned reward model (often from human preferences) Reasoning tasks: rule-based correctness + ranking; General tasks: mixture
Policy Optimization PPO (with value network/critic, learned rewards) GRPO (group ranking + clipped ratio + KL penalty)
Domain Focus Broad instruction-following from the start Emphasis on reasoning first → then general instructions
Post-RL Dataset Generation Sometimes limited RL outputs → filtered → SFT dataset → distillation
Distillation to Smaller Models Optional / less emphasized Explicit large → dataset → smaller models path
Emergence of Reasoning Often via SFT + RL; may require large annotated data Demonstrated via RL alone (R1-Zero), then refined by SFT + RL


6. Technical & Training Details

6.1 Model Sizes and Releases

  • Base models: R1-Zero and R1 are built on a 37B-activated parameter MoE architecture with a total of 671B parameters
  • Distilled variants: Available in 1.5B, 7B, 8B, 14B, 32B, and 70B parameter configurations based on Qwen2.5 and Llama3 series

6.2 Reward Function & Sampling Details

  • Reasoning tasks: The reward function is largely rule-based, checking final answers for correctness, format tags for reasoning sections, and penalizing language mixing
  • Process vs. outcome rewards: Outcome rewards (correct answer) proved more effective than weaker signals like process rewards (number of reasoning steps)
  • Distillation sampling: For R1-Distill models, generation settings included temperature 0.6, top-p 0.95, with 64 responses per query used for pass@1 estimation

6.3 Training and Distillation Strategy

The distillation process leverages the high-capability teacher model to generate large "reasoning-rich" datasets. Student models are then fine-tuned via SFT (not full RL) to inherit reasoning patterns. While smaller models may underperform the teacher, they offer substantial cost and efficiency improvements.


6.4 Observed Strengths & Weaknesses

Strengths:

  • Emergent reasoning capability via RL
  • High performance on reasoning benchmarks
  • Efficient multi-stage training approach

Weaknesses:

  • R1-Zero exhibits readability and language mixing issues due to skipping SFT
  • Distilled models experience some performance degradation compared to the full model
  • General instruction-following may lag in smaller or early-stage variants


7. Summary Table

Component Role Example / Notes
Policy Model (LLM) Learns improved policy via RL DeepSeek-R1, DeepSeek-R1-Zero
Reference Model Provides KL regularization baseline Frozen SFT model (in GRPO)
Reward Function Scores responses Correctness, readability, chain-of-thought format
Group Size (G) Sampling granularity for GRPO 8–16 outputs per prompt (typical)
Advantage (\(A_i\)) Relative performance metric within group Normalization: \((r_i - \text{mean}) / \text{std}\)
Objective (GRPO) PPO-style surrogate + KL penalty See GRPO doc for full derivation
Training Pipeline Multi-stage (Cold-SFT → RL → SFT → RL → Distill) Reasoning first, then broad instruction
Distillation Transfer reasoning to smaller models Student models 1.5B–70B params
Goal Efficient reasoning/instruction fine-tuning Stable RL fine-tuning for large LLMs


8. Advantages & Limitations

Advantages

  • Emergent reasoning: R1-Zero demonstrates reasoning capability via RL without relying solely on large human-annotated SFT datasets
  • Efficient training: Multi-stage strategy combining SFT, RL, filtering, and distillation
  • Verifiable rewards: Correctness and format-based signals reduce noise and training instability
  • Scalable deployment: Distillation enables smaller, deployable models with reasoning capability for cost-effective production use

Limitations

  • Limited transparency: Full details on datasets, hyperparameters, and training costs are not publicly available
  • Instruction-following gaps: General instruction-following beyond reasoning may lag, especially in R1-Zero and smaller distilled variants
  • Distillation trade-offs: SFT-only post-distillation may not fully retain RL-derived benefits in smaller models
  • Filtering dependency: Effective reward design and RL output filtering remain critical; low-quality RL outputs create bottlenecks