Latency vs Throughput
1. Core Concepts¶
Latency¶
- Time to complete a single request
- Measured in seconds or milliseconds
- Critical for interactive applications (chatbots, code completion)
- Key metrics: TTFT, TPOT, E2E latency
Throughput¶
- Number of requests processed per unit time
- Measured in tokens/sec or requests/sec
- Critical for batch processing, high-traffic services
- Maximize GPU utilization
2. The Fundamental Tradeoff¶
Latency ↑ as Throughput ↑
Higher batch size → Higher throughput, Higher latency per request
Lower batch size → Lower latency, Lower throughput
Why They Conflict¶
Batching Increases Throughput
- Process multiple requests simultaneously
- Better GPU utilization (more parallel work)
- Amortize weight loading overhead
But Hurts Latency
- Requests wait for entire batch to complete
- Queueing delays increase
- Stragglers slow down entire batch
3. Key Metrics¶
Latency = Queue Time + Processing Time
Throughput = Batch Size / Processing Time (ignoring queue)
Utilization = (Actual Throughput) / (Max Theoretical Throughput)
4. Optimization Strategies¶
For Low Latency (< 100ms)¶
Batch Size = 1 or Small
- Minimize queueing delay
- Accept lower GPU utilization
- Use smaller models (7B vs 70B)
- Quantization (INT8/INT4) for faster decode
Prefill Optimization
- FlashAttention for faster attention
- Tensor parallelism to split model across GPUs
Infrastructure
- Low-latency network
- GPU with high memory bandwidth (H100 > A100)
- Close to users (edge deployment)
For High Throughput¶
Large Batch Sizes
- Batch 32-128+ requests
- Maximize GPU compute utilization
- Accept seconds of latency per request
Continuous Batching
- Don't wait for all sequences to finish
- Insert new requests as others complete
- Used by vLLM, TensorRT-LLM
Paged Attention (vLLM)
- Reduce memory fragmentation
- Pack more sequences in memory
- Enable larger effective batch size
Chunked Prefill
- Split long prefills into chunks
- Interleave with decode steps
- Balance latency and throughput
5. Request-Level Batching Strategies¶
Static Batching¶
- Wait for batch to fill before processing
- Simple but high latency variance
- Wasted time if batch doesn't fill
Continuous Batching¶
t=0: Start batch [A, B, C]
t=1: A finishes → add D → [B, C, D]
t=2: B finishes → add E → [C, D, E]
- Dynamic batch composition
- Much better GPU utilization
- Lower average latency
Priority Queuing¶
- Process short/urgent requests first
- Separate queues for interactive vs batch
- SLO-aware scheduling
6. Hardware Considerations¶
A100 (80GB)¶
- 1,935 GB/s memory bandwidth
- Good for batch inference
- Throughput: ~2000 tokens/sec (LLaMA-2-7B, batch=32)
H100 (80GB)¶
- 3,350 GB/s memory bandwidth (1.7x A100)
- Better for both latency and throughput
- FlashAttention-3 support
- Throughput: ~3500 tokens/sec (same setup)
L40S / L4¶
- Lower cost, lower bandwidth
- Good for latency-optimized serving (small batch)
- Not ideal for high throughput
7. Common Interview Questions¶
Q: You have 1000 QPS (queries per second). Optimize for p99 latency < 200ms. How?
- Use continuous batching (vLLM)
- Target small effective batch (4-8)
- Replica scaling with load balancer
- Monitor queue depth, scale if needed
Q: Batch size 1 vs 32: compare latency and throughput
Batch=1:
- Latency: ~50ms
- Throughput: ~20 tokens/sec
- GPU utilization: ~15%
Batch=32:
- Latency: ~800ms (includes queueing)
- Throughput: ~500 tokens/sec
- GPU utilization: ~80%
Q: How does continuous batching improve over static?
- No waiting for batch to fill
- No wasted cycles when sequences finish at different times
- Typically, 2-3x better throughput at similar latency
Q: When would you choose latency over throughput?
- Real-time chat applications
- Code completion (100-200ms target)
- Interactive agents
- Premium API tiers
Q: When would you choose throughput over latency?
- Offline batch processing
- Data labeling/annotation
- Embedding generation
- Document summarization at scale
8. Production Patterns (2024-2025)¶
Multi-Tier Serving¶
Tier 1 (Latency): Small models, batch=1-4, edge deployment
Tier 2 (Balanced): Medium models, continuous batching, batch=8-16
Tier 3 (Throughput): Large models, large batches, datacenter
Speculative Decoding¶
- Draft model generates multiple tokens
- Target model verifies in parallel
- 2-3x speedup with same latency
- Best for latency-sensitive scenarios
Disaggregated Serving (Splitwise)¶
- Separate prefill and decode clusters
- Prefill: GPU compute optimized (A100)
- Decode: Memory bandwidth optimized (H100)
- Transfer KV cache between clusters
9. Key Metrics to Monitor¶
P50, P95, P99 Latency - Distribution matters
Throughput (tokens/sec) - Absolute capacity
Queue Depth - Leading indicator of overload
GPU Utilization - Efficiency metric
Cost per 1M tokens - Business metric
10. Benchmarking Tips¶
- Measure real production traffic patterns
- Include cold start times if relevant
- Test at different concurrency levels
- Monitor long-tail latency (p99, p99.9)
- Account for sequence length variance
11. Key Takeaways¶
- Latency and throughput are inversely related via batching
- Continuous batching is standard for production (vLLM, TRT-LLM)
- Different use cases need different optimization targets
- Hardware choice matters: H100 better for both metrics vs A100
- Monitor distributions (p99), not just averages