Bottleneck Analysis

1. Understanding Bottlenecks¶

The Three Primary Bottlenecks¶

1. Compute-Bound

GPU cores underutilized
Not enough arithmetic operations
Common in: Prefill phase, large batches

2. Memory-Bound

GPU cores waiting for data
Memory bandwidth saturated
Common in: Decode phase, small batches

3. Overhead-Bound

Framework/system overhead dominates
Kernel launch latency
Common in: Very small models, batch=1

2. Roofline Model¶

Attainable Performance = min(Peak Compute, Arithmetic Intensity × Memory Bandwidth)

Arithmetic Intensity = FLOPs / Bytes Transferred

If Arithmetic Intensity < Compute/Bandwidth ratio → Memory-Bound
If Arithmetic Intensity > Compute/Bandwidth ratio → Compute-Bound

Example: H100 GPU¶

Peak FP16 Compute: 989 TFLOPS
Memory Bandwidth: 3,350 GB/s
Ratio: 295 FLOP/Byte

Operation with AI=100 FLOP/Byte → Memory-bound
Operation with AI=500 FLOP/Byte → Compute-bound

3. Identifying Bottlenecks¶

Method 1: GPU Utilization Metrics¶

Compute Utilization

nvidia-smi dmon -s u
# SM (Streaming Multiprocessor) utilization

High SM% (>80%) → Compute-bound
Low SM% (<40%) → Memory or overhead-bound

Memory Utilization

nvidia-smi dmon -s m
# Memory bandwidth utilization

High Mem% (>80%) → Memory-bound
Low Mem% (<40%) → Compute or overhead-bound

Method 2: Profiling Tools¶

NVIDIA Nsight Compute

ncu --set full -o profile python inference.py

Shows compute vs memory bottleneck per kernel
Identifies optimization opportunities

PyTorch Profiler

from torch.profiler import profile, ProfilerActivity

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
    model(input)

print(prof.key_averages().table(sort_by="cuda_time_total"))

Key Metrics to Check:

Kernel time distribution
Memory copy overhead
CPU-GPU sync points

Method 3: Microbenchmarks¶

Isolate Operations

# Test prefill vs decode separately
prefill_time = benchmark_prefill(prompt_tokens)
decode_time = benchmark_decode(num_output_tokens)

# Test different batch sizes
for batch_size in [1, 4, 8, 16, 32]:
    throughput[batch_size] = benchmark(batch_size)

Expected Results:

Decode: Throughput plateaus early → Memory-bound
Prefill: Throughput scales with batch → Compute-bound

4. Common Bottleneck Patterns¶

Pattern 1: Decode Phase (Memory-Bound)¶

Symptoms:

Low GPU compute utilization (20-40%)
High memory bandwidth usage
TPOT doesn't improve with smaller model quantization

Root Cause:

Single token generation = Load entire weight matrix
Arithmetic Intensity ≈ 1-2 FLOP/Byte (very low)

Solutions:

Weight quantization (INT8/INT4) → Reduce bytes transferred
Increase batch size → Amortize weight loading
Use higher memory bandwidth GPU (H100 vs A100)
Speculative decoding → Generate multiple tokens

Pattern 2: Prefill Phase (Compute-Bound)¶

Symptoms:

High GPU compute utilization (70-90%)
Attention computation dominates
Scales well with batch size

Root Cause:

Attention: O(n²d) operations
Long sequences = Quadratic compute growth

Solutions:

FlashAttention → Fused kernel, reduce memory access
Tensor parallelism → Split across GPUs
Reduce sequence length if possible
Use models with sliding window attention (Mistral)

Pattern 3: KV Cache Transfer (Memory-Bound)¶

Symptoms:

Performance degrades with sequence length
Memory copy time visible in profiler

Root Cause:

KV cache size = 2 × seq_len × layers × heads × dim × bytes
Long sequences = Large cache to copy

Solutions:

GQA/MQA → Reduce KV cache size
KV cache quantization (INT8) → 2x reduction
Paged attention (vLLM) → Better memory management

Pattern 4: Kernel Launch Overhead¶

Symptoms:

Low utilization despite small workload
Many small kernels in profiler
Performance doesn't scale with model size

Root Cause:

Each operation launches separate kernel
Overhead: ~5-20μs per kernel launch

Solutions:

Kernel fusion (FlashAttention, torch.compile)
Larger batch sizes
Use CUDA graphs → Eliminate launch overhead

Pattern 5: CPU-GPU Synchronization¶

Symptoms:

GPU idle time between operations
High "cudaDeviceSynchronize" time
Low pipeline parallelism

Root Cause:

Explicit sync points or implicit Python overhead
GPU waits for CPU to issue next operation

Solutions:

Asynchronous operations (CUDA streams)
Reduce Python overhead (torch.compile, C++ inference)
Pipeline parallelism

5. Systematic Analysis Framework¶

Step 1: Measure Baseline¶

Metrics to collect:
- Total latency (TTFT + decode time)
- Tokens per second (throughput)
- GPU utilization (SM%, Mem%)
- Memory usage (weights, KV cache, activations)

Step 2: Profile Critical Path¶

Use profiler to identify:
1. Which operations take most time?
2. Are they compute or memory-bound?
3. Where are sync points?

Step 3: Apply Targeted Optimizations¶

If memory-bound → Reduce data movement
If compute-bound → Optimize kernels or reduce ops
If overhead-bound → Fuse kernels or increase batch

Step 4: Validate Improvement¶

Measure again and compare
Check for regressions in quality
Ensure optimization applies to production workload

6. Profiling Example: LLaMA-2-7B¶

Baseline (Batch=1, Seq=512)¶

Operation          | Time (ms) | % Total | Bottleneck
-------------------|-----------|---------|------------
Attention          | 8.2       | 45%     | Memory
FFN                | 6.5       | 35%     | Memory
Layer Norm         | 1.8       | 10%     | Overhead
KV Cache Update    | 1.2       | 7%      | Memory
Misc               | 0.5       | 3%      | -
-------------------|-----------|---------|------------
Total              | 18.2      | 100%    | Memory-bound

After Optimization¶

Applied: FlashAttention, INT8 quantization, kernel fusion

Operation          | Time (ms) | % Total | Change
-------------------|-----------|---------|--------
Attention (Flash)  | 4.1       | 40%     | -50%
FFN (INT8)         | 3.8       | 37%     | -42%
Layer Norm (fused) | 0.9       | 9%      | -50%
KV Cache Update    | 1.0       | 10%     | -17%
Misc               | 0.4       | 4%      | -20%
-------------------|-----------|---------|--------
Total              | 10.2      | 100%    | -44%

7. Common Interview Questions¶

Q: How do you determine if inference is compute or memory-bound?

1. Check GPU metrics (SM% vs Mem%)
2. Profile with Nsight Compute (SOL Compute vs SOL Memory)
3. Test batch size scaling:
   - Compute-bound: Scales well with batch
   - Memory-bound: Plateaus quickly
4. Calculate arithmetic intensity vs hardware ratio

Q: GPU shows 100% utilization but throughput is low. Why?

Could be memory-bound (100% memory utilization)
Check if memory bandwidth saturated
Verify you're looking at the right metric (compute vs memory)
Could be inefficient kernels (high utilization, low throughput)

Q: Describe how you'd optimize a memory-bound decode phase

1. Profile to confirm bottleneck (low SM%, high Mem%)
2. Quantize weights (INT8) → 2x less data to transfer
3. Increase batch size → Better memory bandwidth utilization
4. Use H100 instead of A100 → 1.7x more bandwidth
5. Consider speculative decoding → Reduce number of decode steps

Q: What's the impact of FlashAttention on prefill vs decode?

Prefill (Compute-bound):
- Reduces memory access (no full attention matrix)
- Enables longer sequences without OOM
- 2-4x speedup typical

Decode (Memory-bound):
- Smaller benefit (already memory-bound on weights)
- Still helpful for very long context
- ~20-30% speedup

Q: How do you profile Python overhead vs GPU computation?

# Method 1: Compare with/without CUDA sync
import time

# With sync (includes Python overhead)
t0 = time.time()
output = model(input)
torch.cuda.synchronize()
t1 = time.time()

# With events (pure GPU time)
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output = model(input)
end.record()
torch.cuda.synchronize()
gpu_time = start.elapsed_time(end)

python_overhead = (t1 - t0) - (gpu_time / 1000)

Q: Explain the roofline model for LLM inference

Roofline: Performance = min(Compute Peak, Bandwidth × Arithmetic Intensity)

Example: Decode single token on H100
- Matmul: [1, 4096] × [4096, 4096]
- FLOPs: 2 × 1 × 4096 × 4096 ≈ 33M
- Bytes: (4096×4096 + 4096×4096) × 2 (FP16) ≈ 67MB
- AI: 33M / 67M ≈ 0.5 FLOP/Byte

H100: 989 TFLOPS, 3350 GB/s → 295 FLOP/Byte ratio
AI (0.5) << Ratio (295) → Memory-bound

8. Advanced Profiling¶

Tensor Core Utilization¶

ncu --metrics sm__sass_thread_inst_executed_op_dmma_inst,sm__sass_thread_inst_executed_op_hmma_inst

Check if matmuls use tensor cores (TC)
Low TC usage → Not using FP16/BF16 or improper dims

Memory Transaction Efficiency¶

ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum

Efficiency = Sectors / (Requests × 4)
Low efficiency → Uncoalesced memory access

9. Key Takeaways¶

Always profile before optimizing - Don't guess the bottleneck
Different phases have different bottlenecks: Prefill (compute), Decode (memory)
Use the right metric: SM% for compute, Mem% for memory
Batch size is a key diagnostic: Scaling behavior reveals bottleneck
Optimization must target the actual bottleneck: Memory optimization won't help compute-bound workload
Modern GPUs shift bottlenecks: H100's higher compute/bandwidth ratio changes optimization strategy