Latency vs Throughput

1. Core Concepts¶

Latency¶

Time to complete a single request
Measured in seconds or milliseconds
Critical for interactive applications (chatbots, code completion)
Key metrics: TTFT, TPOT, E2E latency

Throughput¶

Number of requests processed per unit time
Measured in tokens/sec or requests/sec
Critical for batch processing, high-traffic services
Maximize GPU utilization

2. The Fundamental Tradeoff¶

Latency ↑ as Throughput ↑

Higher batch size → Higher throughput, Higher latency per request
Lower batch size → Lower latency, Lower throughput

Why They Conflict¶

Batching Increases Throughput

Process multiple requests simultaneously
Better GPU utilization (more parallel work)
Amortize weight loading overhead

But Hurts Latency

Requests wait for entire batch to complete
Queueing delays increase
Stragglers slow down entire batch

3. Key Metrics¶

Latency = Queue Time + Processing Time
Throughput = Batch Size / Processing Time (ignoring queue)

Utilization = (Actual Throughput) / (Max Theoretical Throughput)

4. Optimization Strategies¶

For Low Latency (< 100ms)¶

Batch Size = 1 or Small

Minimize queueing delay
Accept lower GPU utilization
Use smaller models (7B vs 70B)
Quantization (INT8/INT4) for faster decode

Prefill Optimization

FlashAttention for faster attention
Tensor parallelism to split model across GPUs

Infrastructure

Low-latency network
GPU with high memory bandwidth (H100 > A100)
Close to users (edge deployment)

For High Throughput¶

Large Batch Sizes

Batch 32-128+ requests
Maximize GPU compute utilization
Accept seconds of latency per request

Continuous Batching

Don't wait for all sequences to finish
Insert new requests as others complete
Used by vLLM, TensorRT-LLM

Paged Attention (vLLM)

Reduce memory fragmentation
Pack more sequences in memory
Enable larger effective batch size

Chunked Prefill

Split long prefills into chunks
Interleave with decode steps
Balance latency and throughput

5. Request-Level Batching Strategies¶

Static Batching¶

Wait for batch to fill before processing
Simple but high latency variance
Wasted time if batch doesn't fill

Continuous Batching¶

t=0: Start batch [A, B, C]
t=1: A finishes → add D → [B, C, D]
t=2: B finishes → add E → [C, D, E]

Dynamic batch composition
Much better GPU utilization
Lower average latency

Priority Queuing¶

Process short/urgent requests first
Separate queues for interactive vs batch
SLO-aware scheduling

6. Hardware Considerations¶

A100 (80GB)¶

1,935 GB/s memory bandwidth
Good for batch inference
Throughput: ~2000 tokens/sec (LLaMA-2-7B, batch=32)

H100 (80GB)¶

3,350 GB/s memory bandwidth (1.7x A100)
Better for both latency and throughput
FlashAttention-3 support
Throughput: ~3500 tokens/sec (same setup)

L40S / L4¶

Lower cost, lower bandwidth
Good for latency-optimized serving (small batch)
Not ideal for high throughput

7. Common Interview Questions¶

Q: You have 1000 QPS (queries per second). Optimize for p99 latency < 200ms. How?

Use continuous batching (vLLM)
Target small effective batch (4-8)
Replica scaling with load balancer
Monitor queue depth, scale if needed

Q: Batch size 1 vs 32: compare latency and throughput

Batch=1:
- Latency: ~50ms
- Throughput: ~20 tokens/sec
- GPU utilization: ~15%

Batch=32:
- Latency: ~800ms (includes queueing)
- Throughput: ~500 tokens/sec
- GPU utilization: ~80%

Q: How does continuous batching improve over static?

No waiting for batch to fill
No wasted cycles when sequences finish at different times
Typically, 2-3x better throughput at similar latency

Q: When would you choose latency over throughput?

Real-time chat applications
Code completion (100-200ms target)
Interactive agents
Premium API tiers

Q: When would you choose throughput over latency?

Offline batch processing
Data labeling/annotation
Embedding generation
Document summarization at scale

8. Production Patterns (2024-2025)¶

Multi-Tier Serving¶

Tier 1 (Latency): Small models, batch=1-4, edge deployment
Tier 2 (Balanced): Medium models, continuous batching, batch=8-16
Tier 3 (Throughput): Large models, large batches, datacenter

Speculative Decoding¶

Draft model generates multiple tokens
Target model verifies in parallel
2-3x speedup with same latency
Best for latency-sensitive scenarios

Disaggregated Serving (Splitwise)¶

Separate prefill and decode clusters
Prefill: GPU compute optimized (A100)
Decode: Memory bandwidth optimized (H100)
Transfer KV cache between clusters

9. Key Metrics to Monitor¶

P50, P95, P99 Latency - Distribution matters
Throughput (tokens/sec) - Absolute capacity
Queue Depth - Leading indicator of overload
GPU Utilization - Efficiency metric
Cost per 1M tokens - Business metric

10. Benchmarking Tips¶

Measure real production traffic patterns
Include cold start times if relevant
Test at different concurrency levels
Monitor long-tail latency (p99, p99.9)
Account for sequence length variance

11. Key Takeaways¶

Latency and throughput are inversely related via batching
Continuous batching is standard for production (vLLM, TRT-LLM)
Different use cases need different optimization targets
Hardware choice matters: H100 better for both metrics vs A100
Monitor distributions (p99), not just averages