Pipeline Parallelism¶

1. Core Concept¶

Pipeline Parallelism splits the model by layer depth across devices.

Key Idea: Each GPU owns a contiguous block of layers (a "stage").

GPU 0: Layers 1-3   ─┐
GPU 1: Layers 4-6    ├─ Forward: 0→1→2→3
GPU 2: Layers 7-9    │  Backward: 3→2→1→0
GPU 3: Layers 10-12 ─┘

When to use: Very deep models where depth is the bottleneck.

2. The Pipeline Bubble Problem¶

Naive Pipeline (No Micro-Batching)¶

Timeline for 4 GPUs processing 1 batch:

GPU 0: [FWD]                    [BWD]
GPU 1:      [FWD]          [BWD]
GPU 2:           [FWD] [BWD]
GPU 3:                [FWD] [BWD]

       ████ Compute    □□□□ Idle (bubble)

Problem: Only one GPU is active at a time → 75% idle time!

This idle time is called the "pipeline bubble".

3. Micro-Batching Solution¶

Split the global batch into M micro-batches that flow through the pipeline.

Example: 4 Micro-Batches¶

Timeline with 4 micro-batches (F=Forward, B=Backward):

GPU 0: F1 F2 F3 F4          B4 B3 B2 B1
GPU 1:    F1 F2 F3 F4       B4 B3 B2 B1
GPU 2:       F1 F2 F3 F4    B4 B3 B2 B1
GPU 3:          F1 F2 F3 F4 B4 B3 B2 B1

       ████ Compute    □ Small bubbles at start/end

Result: Much better GPU utilization!

4. Pipeline Bubble Calculation¶

Bubble fraction = (Number of pipeline stages - 1) / Number of micro-batches

For P pipeline stages and M micro-batches:

Bubble time = (P - 1) / M

Example:

4 stages, 1 micro-batch: 3/1 = 75% bubble ❌
4 stages, 8 micro-batches: 3/8 = 37.5% bubble ⚠️
4 stages, 16 micro-batches: 3/16 = 18.75% bubble ✅

Rule of thumb: Use at least 4× micro-batches as pipeline stages.

5. Pipeline Schedules¶

1. GPipe (Fill-Drain)¶

Simple schedule: Fill pipeline, then drain it.

F=Forward, B=Backward

Stage 0: F1 F2 F3 F4 -- -- B4 B3 B2 B1
Stage 1: -- F1 F2 F3 F4 -- -- B4 B3 B2
Stage 2: -- -- F1 F2 F3 F4 -- -- B4 B3
Stage 3: -- -- -- F1 F2 F3 F4 -- -- B4

Pros: Simple to implement
Cons: Large bubbles at start and end

2. PipeDream-Flush (1F1B)¶

One-Forward-One-Backward: Interleave forward and backward.

Stage 0: F1 F2 F3 F4 B1 B2 B3 B4
Stage 1: -- F1 F2 F3 B1 B2 B3 F4 B4
Stage 2: -- -- F1 F2 B1 B2 F3 F4 B3 B4
Stage 3: -- -- -- F1 B1 F2 F3 F4 B2 B3 B4

Pros: Smaller bubbles, earlier gradient computation
Cons: More complex synchronization

3. Interleaved 1F1B (Virtual Pipeline Stages)¶

Each GPU handles multiple non-contiguous stages.

Model split into 8 virtual stages on 4 GPUs:
GPU 0: Stages 1, 5
GPU 1: Stages 2, 6
GPU 2: Stages 3, 7
GPU 3: Stages 4, 8

Pros: Further reduces bubble
Cons: More complex, higher communication

6. Memory Trade-offs¶

Activation Memory¶

Unlike DP or TP, PP increases activation memory.

Why?

Must store activations for all in-flight micro-batches
Each micro-batch's activations held until backward pass

Memory: O(M × activation_size)

Where M = number of micro-batches.

Memory vs Utilization Trade-off¶

More micro-batches: Better utilization, higher memory
Fewer micro-batches: Lower memory, worse utilization

Example: GPT-3 175B
- 8 pipeline stages
- 32 micro-batches → Good utilization but ~4GB activations per stage
- 8 micro-batches → Poor utilization but ~1GB activations per stage

7. Communication Pattern¶

Forward Pass¶

Send activations from stage i to stage i+1
Point-to-point communication (Send/Recv)

Backward Pass¶

Send gradients from stage i+1 to stage i
Reverse order of forward pass

Bandwidth: Lower than TP (only inter-stage, not all-to-all)
Latency: Higher than DP (sequential dependency)

9. Practical Implementation¶

GPipe Style (PyTorch)¶

import torch.distributed as dist

class PipelineStage:
    def __init__(self, module, stage_id, num_stages):
        self.module = module
        self.stage_id = stage_id
        self.num_stages = num_stages

    def forward(self, micro_batches):
        outputs = []

        # Forward pass for all micro-batches
        for mb in micro_batches:
            if self.stage_id > 0:
                # Receive from previous stage
                mb = recv_from_previous_stage()

            out = self.module(mb)

            if self.stage_id < self.num_stages - 1:
                # Send to next stage
                send_to_next_stage(out)

            outputs.append(out)

        # Backward pass (reversed order)
        for out in reversed(outputs):
            out.backward()

        return outputs

10. Gradient Accumulation vs Micro-batching¶

Gradient Accumulation (DP)¶

for micro_batch in batches:
    loss = model(micro_batch)
    loss.backward()  # Accumulate gradients
optimizer.step()  # One update

- All micro-batches on same GPU - No communication until optimizer step

Micro-batching (PP)¶

# Different micro-batches on different stages simultaneously
GPU 0: micro_batch_1 → GPU 1 → GPU 2 → GPU 3
GPU 0: micro_batch_2 → ...

- Each micro-batch flows through pipeline - Communication after each stage - Enables pipeline parallelism

Key difference: PP micro-batches are in-flight simultaneously across stages.

11. Debugging Tips¶

Issue: Poor Utilization¶

Symptoms: GPUs idle most of the time

Solutions:

Increase number of micro-batches (4× pipeline stages minimum)
Use 1F1B schedule instead of GPipe
Profile bubble time

Issue: Out of Memory¶

Symptoms: OOM during training

Solutions:

Reduce number of micro-batches (less activation memory)
Use activation checkpointing
Reduce batch size per micro-batch

Issue: Slow Communication¶

Symptoms: High time in Send/Recv

Solutions:

Check inter-node bandwidth
Ensure balanced stage sizes
Consider hybrid TP+PP (TP within node, PP across nodes)

12. Advanced: Hybrid PP + TP¶

Most efficient setup for large models:

┌─────────────────┐
│  Pipeline Stage 0  │  ← 8-way Tensor Parallel
│  (Layers 1-12)     │
│  [GPU 0-7]         │
└─────────────────┘
        ↓ activation
┌─────────────────┐
│  Pipeline Stage 1  │  ← 8-way Tensor Parallel
│  (Layers 13-24)    │
│  [GPU 8-15]        │
└─────────────────┘

Benefits:

TP reduces per-stage memory (use NVLink within node)
PP reduces total parameter memory (across nodes)
Minimizes cross-node communication (only inter-stage activations)

13. Key Takeaways¶

PP splits model by depth - each GPU owns contiguous layers
Pipeline bubble is unavoidable - use micro-batching to minimize (aim for <20%)
Memory trade-off: ↓ Parameters, ↑ Activations
More micro-batches → better utilization but higher memory
1F1B schedule is better than naive fill-drain
Not ideal for inference due to latency overhead
Often combined with TP - TP within stages, PP across stages
Rule: M ≥ 4P (micro-batches ≥ 4× pipeline stages)