ZeRO and DeepSpeed¶
1. The Core Problem¶
Standard Data Parallelism replicates everything on each GPU:
- Model parameters (Ψ)
- Gradients (Ψ)
- Optimizer states (12Ψ for Adam)
Total per GPU: ~16Ψ bytes
For a 7B model:
- 7B × 16 bytes = 112 GB (doesn't fit on 80GB GPU!)
ZeRO's insight: This redundancy is unnecessary. Share the memory burden.
2. ZeRO: Zero Redundancy Optimizer¶
Goal: Eliminate memory redundancy while keeping the efficiency of Data Parallelism.
The Three Stages¶
| Stage | Shards | Memory per GPU | Communication | Example (7B, 8 GPUs) |
|---|---|---|---|---|
| ZeRO-1 | Optimizer states only | ~12Ψ | Minimal (gather optimizer shards) | 14 + 14 + 84/8 = 38.5 GB |
| ZeRO-2 | Optimizer + Gradients | ~8Ψ | All-Reduce gradient shards | 14 + 14/8 + 84/8 = 26.25 GB |
| ZeRO-3 | Optimizer + Gradients + Parameters | ~2Ψ | Gather params on-the-fly | 14/8 + 14/8 + 84/8 = 14 GB |
Note: These numbers exclude activations, which can be substantial.
3. Stage-by-Stage Breakdown¶
ZeRO-1: Optimizer State Sharding¶
What's sharded: Only optimizer states (momentum, variance)
Each GPU stores:
- Full parameters (2Ψ)
- Full gradients (2Ψ)
- 1/N of optimizer states (12Ψ / N)
Communication:
- Standard All-Reduce for gradients
- Gather optimizer state shards when needed for update
Memory savings: 4× (from 16Ψ to 12Ψ)
Use case: Drop-in improvement over DDP with minimal overhead
ZeRO-2: + Gradient Sharding¶
What's sharded: Optimizer states + gradients
Each GPU stores:
- Full parameters (2Ψ)
- 1/N of gradients (2Ψ / N)
- 1/N of optimizer states (12Ψ / N)
Communication:
- Reduce-Scatter instead of All-Reduce during backward
- Each GPU receives only its gradient shard
- Gather optimizer shards for update (same as ZeRO-1)
Memory savings: 8× (from 16Ψ to ~8Ψ)
Use case: Models up to ~13B on 8×A100 (40GB)
ZeRO-3: + Parameter Sharding¶
What's sharded: Everything (optimizer states + gradients + parameters)
Each GPU stores:
- 1/N of parameters (2Ψ / N)
- 1/N of gradients (2Ψ / N)
- 1/N of optimizer states (12Ψ / N)
Communication:
Forward pass:
- All-Gather to reconstruct layer parameters
- Compute forward
- Discard parameters
Backward pass:
- All-Gather to reconstruct layer parameters (again)
- Compute backward
- Reduce-Scatter gradients to owning GPU
- Discard parameters
Memory savings: N× (linear scaling with GPUs)
Trade-off: Significantly more communication (2× All-Gather per layer)
Use case: Very large models (70B+ parameters)
4. ZeRO-3 Communication Pattern¶
# Pseudo-code for one layer forward/backward
# Forward
params = all_gather(param_shard) # Reconstruct full parameters
output = layer(input, params)
del params # Free memory immediately
# Backward
params = all_gather(param_shard) # Reconstruct again
grad_input, grad_params = backward(output_grad, params)
del params
grad_shard = reduce_scatter(grad_params) # Each GPU gets its shard
Key insight: Parameters are materialized only when needed, then immediately freed.
5. ZeRO Offload¶
Problem: Even ZeRO-3 may not fit very large models on GPU.
Solution: Offload to CPU RAM (or even NVMe).
CPU Offload¶
What's offloaded:
- Optimizer states (always)
- Parameters (optional)
- Gradients (optional)
Workflow:
GPU: Compute-heavy operations (forward, backward)
↕ PCIe transfer
CPU: Store optimizer states, occasionally parameters
When to use:
- Training 10B+ models on consumer GPUs (e.g., RTX 3090)
- Limited GPU memory
- Have sufficient CPU RAM
Trade-off: GPU memory ↓↓, Training speed ↓ (PCIe bandwidth: ~25 GB/s vs GPU memory: ~2000 GB/s)
NVMe Offload (ZeRO-Infinity)¶
Extreme offloading to SSD storage.
Use case: Trillion-parameter models (research, not production)
Trade-off: GPU memory ↓↓↓, Training speed ↓↓↓ (NVMe: ~7 GB/s)
6. DeepSpeed: The Library¶
DeepSpeed is Microsoft's implementation of ZeRO with many optimizations.
Key Features¶
-
ZeRO Stages 1-3: As described above
-
ZeRO-Offload: CPU and NVMe offloading
-
Optimized Kernels: Custom CUDA kernels for:
- Fused Adam optimizer
- Fused layer norm
-
Communication-computation overlap
-
Mixed Precision: FP16, BF16, FP8 support
-
Gradient Accumulation: Built-in with ZeRO
-
Activation Checkpointing: Integrated memory optimization
Configuration Example¶
{
"train_batch_size": 256,
"gradient_accumulation_steps": 16,
"gradient_clipping": 1.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"initial_scale_power": 16
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 500000000,
"stage3_prefetch_bucket_size": 500000000,
"stage3_param_persistence_threshold": 1000000
}
}
8. Memory Calculation Example¶
7B model, 8 GPUs, FP16, Adam optimizer
Without ZeRO (Standard DP)¶
Per GPU:
- Parameters: 14 GB
- Gradients: 14 GB
- Optimizer: 84 GB
- Total: 112 GB ❌ (doesn't fit on 80GB A100)
With ZeRO-1¶
Per GPU:
- Parameters: 14 GB
- Gradients: 14 GB
- Optimizer: 84/8 = 10.5 GB
- Total: 38.5 GB ✅
With ZeRO-2¶
Per GPU:
- Parameters: 14 GB
- Gradients: 14/8 = 1.75 GB
- Optimizer: 84/8 = 10.5 GB
- Total: 26.25 GB ✅✅
With ZeRO-3¶
Per GPU:
- Parameters: 14/8 = 1.75 GB
- Gradients: 14/8 = 1.75 GB
- Optimizer: 84/8 = 10.5 GB
- Total: 14 GB ✅✅✅
Note: These exclude activations! With batch_size=32, sequence_length=2048, activations can be 10-20 GB.
9. Practical Tips¶
1. Start with ZeRO-2¶
# Usually sufficient for most models
# Less communication overhead than ZeRO-3
ds_config = {
"zero_optimization": {"stage": 2}
}
2. Monitor Communication¶
# Profile with DeepSpeed timer
from deepspeed.runtime.utils import see_memory_usage
see_memory_usage("After forward", force=True)
3. Tune Bucket Sizes¶
# Larger buckets = less overhead, less overlap
# Smaller buckets = more overhead, more overlap
{
"reduce_bucket_size": 500_000_000, # 500MB
"allgather_bucket_size": 500_000_000
}
4. Enable Communication Overlap¶
{
"overlap_comm": true, # Essential for ZeRO-3
"contiguous_gradients": true
}
10. Integration with Hugging Face¶
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=4,
gradient_accumulation_steps=16,
fp16=True,
deepspeed="ds_config.json", # DeepSpeed config file
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
Key Takeaways¶
- ZeRO eliminates redundancy in Data Parallelism by sharding model states
- Three stages: Optimizer (Z1), + Gradients (Z2), + Parameters (Z3)
- ZeRO-2 is the sweet spot for most use cases (good memory savings, low overhead)
- ZeRO-3 for extreme scale (70B+) - trades communication for memory
- Offloading enables training on limited GPUs but significantly slows training
- DeepSpeed is the reference implementation with optimized kernels
- Communication-computation overlap is critical for ZeRO-3 efficiency
- Memory savings scale linearly with GPUs (Z3: 112GB → 14GB on 8 GPUs)