Memory Compute Tradeoffs
1. The Core Tradeoff¶
Memory Savings → Compute Overhead (Usually)
Techniques that reduce memory often require:
- Additional computation (quantization/dequantization)
- Recomputation instead of caching
- More complex kernels
2. Memory Bottlenecks in LLM Inference¶
1. Model Weights (Static)¶
- 70B model in FP16: 140 GB
- Must fit in GPU memory
- Loaded repeatedly during decode (memory bandwidth bound)
2. KV Cache (Dynamic)¶
- Grows with sequence length and batch size
- Often largest memory consumer in production
- Formula:
2 × B × S × L × H × D × bytes - B=batch, S=seq_len, L=layers, H=heads, D=head_dim
3. Activations (Temporary)¶
- Intermediate tensors during forward pass
- Recomputed in inference (no backprop needed)
- ~5-10% of total memory
3. Quantization: Trading Precision for Memory¶
Weight Quantization¶
FP16 → INT8 (8-bit)
- 2x memory reduction (2 bytes → 1 byte)
- Minimal accuracy loss (<1% typically)
- Faster on hardware with INT8 support (Tensor Cores)
- Compute: Dequantize to FP16 for matmul (overhead ~10%)
FP16 → INT4 (4-bit)
- 4x memory reduction
- Quality degradation possible (1-3% on benchmarks)
- Requires calibration data
- Compute: More dequant overhead (~20-30%)
Techniques:
Per-Tensor: Single scale for entire tensor
Per-Channel: Scale per output channel (better quality)
Group Quantization: Scale per 128 elements (GPTQ, AWQ)
GPTQ: Layer-wise quantization, minimizes error
AWQ: Activation-aware, protects important weights
KV Cache Quantization¶
- KV cache in INT8 instead of FP16
- 2x memory savings → 2x larger batch or sequence length
- Quality loss typically <0.5%
- Growing adoption in production (2024+)
Mixed Precision¶
- Keep critical layers in FP16 (first/last, attention)
- Quantize FFN layers to INT4
- Balance quality and memory
4. KV Cache Optimization¶
Multi-Query Attention (MQA)¶
Standard: num_kv_heads = num_query_heads (e.g., 32)
MQA: num_kv_heads = 1
Memory reduction: 32x fewer KV parameters
Tradeoff: Slight quality degradation
Used in: Falcon, StarCoder
Grouped Query Attention (GQA)¶
num_kv_heads < num_query_heads
Example: 8 KV heads, 32 query heads (4 queries per KV)
Memory reduction: 4x fewer KV parameters
Tradeoff: Minimal quality loss
Used in: LLaMA-2, Mistral, GPT-4 (rumored)
Paged Attention (vLLM)¶
- KV cache in non-contiguous "pages" (like OS virtual memory)
- Eliminates fragmentation
- Enables ~2x higher batch size for same memory
- Compute: Slight overhead for page lookup
Multi-Token Prediction¶
- Cache prefixes for common prompts
- Reduces redundant computation
- Memory: Store prompt KV cache (shared across requests)
4. Recomputation vs Caching¶
Activation Checkpointing (Training)¶
- Not used in inference (no backprop)
- Mentioned for completeness
Selective Recomputation¶
- Recompute cheap operations instead of storing
- Example: Recompute layer norm instead of caching
- Memory savings: ~10-20%
- Compute overhead: ~5-10%
5. Model Architecture Choices¶
Width vs Depth¶
Wide: More hidden dimensions, fewer layers
- More memory for weights
- Less memory for KV cache (fewer layers)
Deep: More layers, smaller hidden dimensions
- Less memory for weights
- More memory for KV cache (more layers)
FFN Expansion Ratio¶
- Standard:
d_ff = 4 × d_model - Smaller ratio (2x or 3x): Less memory, potential quality loss
- MoE: Sparse activation, more parameters but same compute
6. Hardware-Specific Tradeoffs¶
Memory Bandwidth vs Compute¶
A100: 1,935 GB/s bandwidth, 312 TFLOPS (FP16)
H100: 3,350 GB/s bandwidth, 989 TFLOPS (FP16)
Bandwidth-to-Compute Ratio:
A100: 6.2 GB/s per TFLOP
H100: 3.4 GB/s per TFLOP
Implication: H100 relatively more compute-bound, benefits more from quantization compute overhead
Tensor Core Utilization¶
- FP16: Full tensor core speed
- INT8: 2x faster on Ampere/Hopper with DP4A
- INT4: 4x faster (requires specialized kernels)
Tradeoff: Quantization compute overhead offset by faster matmul
7. Memory-Compute Decision Matrix¶
| Technique | Memory Saved | Compute Overhead | Quality Impact |
|---|---|---|---|
| INT8 Quantization | 2x | +10% | <1% |
| INT4 Quantization | 4x | +30% | 1-3% |
| GQA (4:1) | 4x KV cache | Minimal | <0.5% |
| MQA | 32x KV cache | Minimal | 1-2% |
| KV Cache INT8 | 2x KV cache | +5% | <0.5% |
| FlashAttention | Minimal | -30% latency | None |
8. Common Interview Questions¶
Q: You have a 70B model but only 40GB GPU memory. What do you do?
Options:
1. INT4 quantization: 140GB → 35GB ✓
2. INT8 + model parallelism across 2 GPUs
3. Offload layers to CPU (slow, not recommended)
4. Use smaller model variant (13B/7B)
Q: Explain the tradeoff in GQA (Grouped Query Attention)
- Save memory: Fewer KV heads → smaller KV cache
- Minimal compute overhead: Attention computation slightly changes
- Quality: Negligible impact (<0.5% on benchmarks)
- Production: Widely adopted (Mistral, LLaMA-2)
Q: Why is decode phase memory-bound?
- Single token generation: Low arithmetic intensity
- Must fetch entire weight matrix from memory
- Memory bandwidth saturated, compute underutilized
- Arithmetic Intensity: FLOPs / Bytes Loaded ≈ 1-2 (very low)
Q: When does quantization hurt performance?
- Small batch size: Dequant overhead dominates
- Compute-bound workloads: Adding compute makes it worse
- Old hardware: No INT8 tensor core support
- Generally: Decode phase on modern GPUs (H100) benefits from quantization
Q: Calculate KV cache size: LLaMA-2-70B, batch=16, seq=4096, FP16
GQA with 8 KV heads (70B uses this)
2 × 16 × 4096 × 80_layers × 8_heads × 128_dim × 2_bytes
= 2 × 16 × 4096 × 80 × 8 × 128 × 2
= 1,073,741,824 bytes ≈ 1 GB per sample × 16 = 16 GB total
(If standard MHA with 64 heads: 128 GB - impractical!)
Q: How does FlashAttention affect memory-compute tradeoff?
- Reduces memory: Avoids materializing full attention matrix
- Reduces compute time: Fused kernel, better cache locality
- Win-win: Memory AND compute improvement
- No quality impact (mathematically equivalent)
9. Modern Techniques (2024-2025)¶
AWQ (Activation-Aware Weight Quantization)¶
- Protect weights with high activation magnitude
- Better quality than naive INT4
- Used in production (Hugging Face TGI)
SmoothQuant¶
- Migrate difficulty from weights to activations
- Enables better INT8 quantization
- Particularly for older models not trained for quantization
FP8 (H100)¶
- Native FP8 support on Hopper
- 2x memory saving vs FP16
- Minimal quality loss
- Compute: Faster than FP16 (2x with tensor cores)
QuIP# / AQLM¶
- Extreme quantization (2-3 bits)
- Lattice-based, better than naive 2-bit
- Research stage, not production yet
10. Practical Guidelines¶
- Start with INT8: Minimal quality loss, 2x memory saving
- Use GQA architecture: If designing new models
- Enable KV cache quantization: Production-ready in vLLM
- FlashAttention is mandatory: No downside
- INT4 for large models: When GPU memory is the constraint
- Monitor quality: Always benchmark on your task
11. Key Takeaways¶
- Most memory optimizations have negligible compute cost (GQA, FlashAttention)
- Quantization is a clear win on modern hardware (INT8 tensor cores)
- KV cache often dominates memory in long-context scenarios
- Decode phase is memory-bound: Reducing memory access helps latency
- Hardware matters: H100 handles quantization overhead better than A100