Supervised Fine-Tuning (SFT)¶

Supervised Fine-Tuning (SFT) transforms a pre-trained base model into a useful assistant that follows instructions, respects formats, and exhibits desired interaction behavior. While pre-training builds a broad world model, SFT shapes how that knowledge is expressed.

Core concept: Behavioral alignment via supervised learning.

1. Conceptual Foundation¶

1.1 What SFT Optimizes¶

SFT maps broad knowledge into a consistent, controllable interface.

Conceptual shift:

Pre-training: "Continue this text"
SFT: "Respond appropriately to a user instruction"

This transformation happens through supervised learning, not reinforcement learning.

1.2 Core Components¶

Instruction Tuning: Defines what behavior you teach
Task Formatting: Defines how that behavior is presented

2. Training Mechanics¶

2.1 Training Objective¶

Standard cross-entropy loss with selective loss masking.

Given:

\(x\) = prompt tokens (system + user)
\(y\) = assistant response tokens

SFT minimizes:

\[ \mathcal{L}_{\text{SFT}} = -\sum_{t=1}^{|y|} \log P(y_t \mid x, y_{<t}) \]

Key: Tokens in \(x\) are excluded from the loss.

Why masking matters:

Prevents prompt memorization
Ensures gradients only optimize response generation
Stabilizes alignment behavior

2.2 Task Formatting¶

Modern SFT uses structured role-based templates (ChatML, LLaMA-style):

<|system|> You are a helpful assistant. <|end_of_text|>
<|user|> Summarize this article. <|end_of_text|>
<|assistant|> The article discusses...

Why formatting is critical:

Implicit policy learning: Role tokens act as soft behavioral constraints
Gradient routing: Loss masking + role tokens shape response behavior
Inference controllability: Enables injection of safety/tools without retraining
Multi-turn state compression: Helps model compress dialogue history into latent states

3. Data Strategy¶

3.1 Prompt Diversity as Regularization¶

Diversity prevents shortcut learning and maintains circuit coverage.

Semantic diversity:

Math and symbolic reasoning
Code synthesis
Creative generation
Factual recall
Conversational grounding

Structural diversity:

Paraphrased intents → prevents lexical memorization
Variable verbosity → avoids length priors
Explicit vs implicit constraints → forces instruction parsing

Key insight: Insufficient diversity causes models to learn response templates instead of instruction semantics.

3.2 The LIMA Hypothesis¶

"Less Is More for Alignment"

~1,000 high-quality examples can outperform tens of thousands of noisy ones.

Implications:

Curation > scale
Labeler expertise is critical
Reduces overfitting and style bias

3.3 Typical SFT Data Mix¶

Reasoning and step-by-step explanations
Creative and stylistic writing
Coding and math problems
Safety and refusal examples
Multi-turn conversations

4. Synthetic Data Generation¶

4.1 Self-Instruct¶

Start with small human-curated seed set
Use strong model to generate new instructions/responses
Filter for quality

4.2 Evol-Instruct as Curriculum Learning¶

Creates implicit difficulty progression:

Base instruction
Added constraints
Multi-hop reasoning
Strict formatting/safety requirements

Improves:

Instruction decomposition
Constraint satisfaction
Planning depth

4.3 Rejection Sampling (Best-of-N)¶

Process:

Generate \(K\) responses per prompt
Score using reward model or stronger reference model
Select best response
Fine-tune on selected outputs

Benefits:

Sharpens instruction adherence without RL
Reduces variance vs RLHF
Biases toward high-reward modes

Risks:

Over-optimization toward reward model
Reduced output diversity
Reward hacking with weak scorers

4.4 Reasoning Distillation¶

A variant specific to reasoning models: run a large System 2 model (o1, DeepSeek-R1) with high inference compute, collect its verified reasoning traces, then SFT a smaller model on those traces. This has enabled 7B–14B models to exhibit structured step-by-step reasoning previously limited to frontier-scale systems.

Unlike standard rejection sampling (which samples from the model being trained), reasoning distillation uses a stronger teacher — the small model learns reasoning patterns it could not discover independently.

5. Training Optimizations¶

Technique	Purpose	Explanation
Packing	Throughput	Concatenates short samples into single context to avoid padding waste
Loss Masking	Correct gradients	Computes loss only on assistant tokens
NEFTune	Generalization	Adds noise to embeddings to prevent token-level overfitting
Low LR	Stability	Typical: \(10^{-6}\) to \(5 \times 10^{-6}\)
Dropout	Regularization	Reduces stylistic memorization

5.1 Packing vs Padding¶

Padding: All sequences padded to longest length

Wastes compute on [PAD] tokens
Low token utilization
Simple implementation

Packing: Multiple samples concatenated into one sequence

[Prompt₁ → Response₁ <EOS> Prompt₂ → Response₂ <EOS> ...]

Benefits:

2-3x throughput improvement
Higher gradient signal density
Better GPU utilization

Implementation requirements:

Loss masking at sample boundaries
Attention masking to prevent cross-example leakage
Proper EOS handling

5.2 NEFTune (Noisy Embeddings Fine-Tuning)¶

Core idea: Inject controlled noise into embeddings during training.

Why it works:

Prevents token memorization
Improves OOD robustness
Smooths loss landscape

How:

Add Gaussian noise \(\epsilon \sim \mathcal{N}(0, \sigma^2)\) to embeddings
Keep \(\sigma\) small to preserve semantics
Noise shapes gradient updates continuously

Recent trends (2025-2026):

Standard in instruction-tuning pipelines
Layer-wise noise schedules common
Combined with packing for long-context dialogues

6. Common Challenges¶

6.1 Catastrophic Forgetting¶

Model loses pre-training knowledge.

Causes:

Narrow SFT domain
High learning rates
Full fine-tuning on small datasets

Mitigations:

Mix 5-10% pre-training data
Use LoRA/PEFT methods
Lower learning rates

6.2 Overfitting¶

Model learns labeler style, not task intent.

Symptoms:

Over-politeness
Repetitive phrasing
Template-like answers

Mitigations:

Prompt diversity
Early stopping
NEFTune noise injection

6.3 Increased Hallucinations¶

Caused by knowledge contradiction between SFT and pre-training data.

Mitigations:

Fact-consistent SFT data
Retrieval-augmented generation
Post-SFT preference optimization

7. LoRA vs Full Fine-Tuning¶

LoRA (PEFT)¶

✅ Low compute cost
✅ Preserves base model knowledge
✅ Lower forgetting risk
Use for: Behavior/style changes

Full Fine-Tuning¶

✅ Handles large domain shifts
⚠️ Higher forgetting risk
⚠️ Requires careful regularization
Use for: Knowledge changes

Rule of thumb: Behavior change → LoRA | Knowledge change → Full fine-tuning

8. SFT vs Pre-training¶

Aspect	Pre-training	Supervised Fine-Tuning
Objective	World modeling	Behavior alignment
Data scale	Trillions of tokens	10k-100k samples
Loss	Full sequence NTP	Masked response NTP
Compute	Massive	Moderate
Primary risk	Under-training	Overfitting, forgetting

10. Key Takeaways¶

SFT is about behavior, not knowledge – most capabilities come from pre-training
Quality > Quantity – LIMA showed 1k great examples beats 10k mediocre ones
Diversity is regularization – prevents template learning and shortcut heuristics
Implementation details matter – packing, masking, and NEFTune significantly impact results
Balance is critical – avoid forgetting pre-training knowledge while learning new behavior
Synthetic data is powerful – but requires careful quality control and diversity
Monitor for alignment taxes – hallucinations, over-refusal, and repetitive style

11. RL Post-Training and Reasoning Models¶

SFT is one form of post-training. RL-based post-training — using RLHF, PPO, DPO, or GRPO with outcome or process reward models — is what activates inference-time reasoning strategies in models like o1 and DeepSeek-R1. These techniques are covered in the LLM Alignment & Reasoning repo.

The reasoning distillation approach (§4.4) is the bridge back: RL-trained reasoning traces from a large model become SFT training data for smaller models.