Pipeline Parallelism¶
1. Core Concept¶
Pipeline Parallelism splits the model by layer depth across devices.
Key Idea: Each GPU owns a contiguous block of layers (a "stage").
GPU 0: Layers 1-3 ─┐
GPU 1: Layers 4-6 ├─ Forward: 0→1→2→3
GPU 2: Layers 7-9 │ Backward: 3→2→1→0
GPU 3: Layers 10-12 ─┘
When to use: Very deep models where depth is the bottleneck.
2. The Pipeline Bubble Problem¶
Naive Pipeline (No Micro-Batching)¶
Timeline for 4 GPUs processing 1 batch:
GPU 0: [FWD] [BWD]
GPU 1: [FWD] [BWD]
GPU 2: [FWD] [BWD]
GPU 3: [FWD] [BWD]
████ Compute □□□□ Idle (bubble)
Problem: Only one GPU is active at a time → 75% idle time!
This idle time is called the "pipeline bubble".
3. Micro-Batching Solution¶
Split the global batch into M micro-batches that flow through the pipeline.
Example: 4 Micro-Batches¶
Timeline with 4 micro-batches (F=Forward, B=Backward):
GPU 0: F1 F2 F3 F4 B4 B3 B2 B1
GPU 1: F1 F2 F3 F4 B4 B3 B2 B1
GPU 2: F1 F2 F3 F4 B4 B3 B2 B1
GPU 3: F1 F2 F3 F4 B4 B3 B2 B1
████ Compute □ Small bubbles at start/end
Result: Much better GPU utilization!
4. Pipeline Bubble Calculation¶
Bubble fraction = (Number of pipeline stages - 1) / Number of micro-batches
For P pipeline stages and M micro-batches:
- Bubble time =
(P - 1) / M
Example:
- 4 stages, 1 micro-batch: 3/1 = 75% bubble ❌
- 4 stages, 8 micro-batches: 3/8 = 37.5% bubble ⚠️
- 4 stages, 16 micro-batches: 3/16 = 18.75% bubble ✅
Rule of thumb: Use at least 4× micro-batches as pipeline stages.
5. Pipeline Schedules¶
1. GPipe (Fill-Drain)¶
Simple schedule: Fill pipeline, then drain it.
F=Forward, B=Backward
Stage 0: F1 F2 F3 F4 -- -- B4 B3 B2 B1
Stage 1: -- F1 F2 F3 F4 -- -- B4 B3 B2
Stage 2: -- -- F1 F2 F3 F4 -- -- B4 B3
Stage 3: -- -- -- F1 F2 F3 F4 -- -- B4
Pros: Simple to implement
Cons: Large bubbles at start and end
2. PipeDream-Flush (1F1B)¶
One-Forward-One-Backward: Interleave forward and backward.
Stage 0: F1 F2 F3 F4 B1 B2 B3 B4
Stage 1: -- F1 F2 F3 B1 B2 B3 F4 B4
Stage 2: -- -- F1 F2 B1 B2 F3 F4 B3 B4
Stage 3: -- -- -- F1 B1 F2 F3 F4 B2 B3 B4
Pros: Smaller bubbles, earlier gradient computation
Cons: More complex synchronization
3. Interleaved 1F1B (Virtual Pipeline Stages)¶
Each GPU handles multiple non-contiguous stages.
Model split into 8 virtual stages on 4 GPUs:
GPU 0: Stages 1, 5
GPU 1: Stages 2, 6
GPU 2: Stages 3, 7
GPU 3: Stages 4, 8
Pros: Further reduces bubble
Cons: More complex, higher communication
6. Memory Trade-offs¶
Activation Memory¶
Unlike DP or TP, PP increases activation memory.
Why?
- Must store activations for all in-flight micro-batches
- Each micro-batch's activations held until backward pass
Memory: O(M × activation_size)
Where M = number of micro-batches.
Memory vs Utilization Trade-off¶
- More micro-batches: Better utilization, higher memory
- Fewer micro-batches: Lower memory, worse utilization
Example: GPT-3 175B
- 8 pipeline stages
- 32 micro-batches → Good utilization but ~4GB activations per stage
- 8 micro-batches → Poor utilization but ~1GB activations per stage
7. Communication Pattern¶
Forward Pass¶
- Send activations from stage i to stage i+1
- Point-to-point communication (Send/Recv)
Backward Pass¶
- Send gradients from stage i+1 to stage i
- Reverse order of forward pass
Bandwidth: Lower than TP (only inter-stage, not all-to-all)
Latency: Higher than DP (sequential dependency)
9. Practical Implementation¶
GPipe Style (PyTorch)¶
import torch.distributed as dist
class PipelineStage:
def __init__(self, module, stage_id, num_stages):
self.module = module
self.stage_id = stage_id
self.num_stages = num_stages
def forward(self, micro_batches):
outputs = []
# Forward pass for all micro-batches
for mb in micro_batches:
if self.stage_id > 0:
# Receive from previous stage
mb = recv_from_previous_stage()
out = self.module(mb)
if self.stage_id < self.num_stages - 1:
# Send to next stage
send_to_next_stage(out)
outputs.append(out)
# Backward pass (reversed order)
for out in reversed(outputs):
out.backward()
return outputs
10. Gradient Accumulation vs Micro-batching¶
Gradient Accumulation (DP)¶
for micro_batch in batches:
loss = model(micro_batch)
loss.backward() # Accumulate gradients
optimizer.step() # One update
Micro-batching (PP)¶
# Different micro-batches on different stages simultaneously
GPU 0: micro_batch_1 → GPU 1 → GPU 2 → GPU 3
GPU 0: micro_batch_2 → ...
Key difference: PP micro-batches are in-flight simultaneously across stages.
11. Debugging Tips¶
Issue: Poor Utilization¶
Symptoms: GPUs idle most of the time
Solutions:
- Increase number of micro-batches (4× pipeline stages minimum)
- Use 1F1B schedule instead of GPipe
- Profile bubble time
Issue: Out of Memory¶
Symptoms: OOM during training
Solutions:
- Reduce number of micro-batches (less activation memory)
- Use activation checkpointing
- Reduce batch size per micro-batch
Issue: Slow Communication¶
Symptoms: High time in Send/Recv
Solutions:
- Check inter-node bandwidth
- Ensure balanced stage sizes
- Consider hybrid TP+PP (TP within node, PP across nodes)
12. Advanced: Hybrid PP + TP¶
Most efficient setup for large models:
┌─────────────────┐
│ Pipeline Stage 0 │ ← 8-way Tensor Parallel
│ (Layers 1-12) │
│ [GPU 0-7] │
└─────────────────┘
↓ activation
┌─────────────────┐
│ Pipeline Stage 1 │ ← 8-way Tensor Parallel
│ (Layers 13-24) │
│ [GPU 8-15] │
└─────────────────┘
Benefits:
- TP reduces per-stage memory (use NVLink within node)
- PP reduces total parameter memory (across nodes)
- Minimizes cross-node communication (only inter-stage activations)
13. Key Takeaways¶
- PP splits model by depth - each GPU owns contiguous layers
- Pipeline bubble is unavoidable - use micro-batching to minimize (aim for <20%)
- Memory trade-off: ↓ Parameters, ↑ Activations
- More micro-batches → better utilization but higher memory
- 1F1B schedule is better than naive fill-drain
- Not ideal for inference due to latency overhead
- Often combined with TP - TP within stages, PP across stages
- Rule: M ≥ 4P (micro-batches ≥ 4× pipeline stages)