Memory Considerations for LLM Training and Inference¶
1. GPU and CPU Memory and Transfer Rates¶
1.1 GPU Memory¶
GPU memory (HBM - High Bandwidth Memory) is high bandwidth memory physically attached to the GPU package, designed to feed thousands of parallel compute cores efficiently.
Key characteristics:
- Very high bandwidth
- Low access latency relative to CPU memory
- Limited capacity compared to system RAM
Typical values for NVIDIA A100:
- Capacity: 40GB or 80GB HBM2e
- Peak bandwidth: ~1.6 TB/s
GPU memory stores:
- Model weights
- Activations
- Gradients
- Optimizer states
- KV cache during inference
1.2 CPU Memory¶
CPU memory refers to system RAM (DDR4 or DDR5) located on the motherboard.
Key characteristics:
- Much larger capacity (64GB to 1TB+)
- Significantly lower bandwidth (~50 to 100 GB/s per socket)
- Higher latency compared to GPU memory
CPU memory is used for:
- Data loading and preprocessing
- Checkpoint storage before transfer
- Offloaded parameters or optimizer states (ZeRO-Offload, Paged Adam)
1.3 GPU to CPU Transfer Rates¶
Data movement between GPU and CPU happens over interconnects.
Approximate peak transfer rates:
- PCIe Gen4: ~32 GB/s
- PCIe Gen5: ~64 GB/s
- NVLink (A100): ~300 GB/s
Key insight: Even with NVLink, transfer bandwidth is far lower than on-device GPU memory bandwidth (1.6 TB/s), making frequent transfers expensive.
2. Bandwidth, Latency, and Compute¶
2.1 Bandwidth vs Latency vs Compute¶
| Metric | Definition | Typical Values (A100) | Matters For |
|---|---|---|---|
| Bandwidth | Data transfer rate | ~1.6 TB/s (HBM) | Large tensor operations |
| Latency | Time to first byte | ~100-200 ns (HBM) | Small tensor ops, kernel launch |
| Compute | Arithmetic capability | ~312 TFLOPs (BF16) | Matrix multiplications |
In practice:
- Many LLM workloads are memory bandwidth bound, not compute bound
- Compute units often wait on memory due to limited data reuse
2.2 Compute vs Memory Bound Regimes¶
Arithmetic intensity determines the bottleneck: $$ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes transferred}} $$
Transformers are often memory bound during:
- Attention (especially long sequences)
- Layer normalization
- Optimizer updates
- Embedding lookups
Compute bound operations:
- Large matrix multiplications (when batch size and dimensions are large)
3. Training Memory for a 7B Model on a Single A100¶
3.1 Memory Components During Training¶
Training requires four major components in GPU memory:
- Model weights
- Gradients
- Optimizer states
- Activations
Quantifying each component:
Let \(P = 7 \times 10^9\) parameters, BF16 for weights/gradients (2 bytes), FP32 for optimizer (4 bytes).
3.1.1 Model Weights¶
3.1.2 Gradients¶
3.1.3 Optimizer States (Adam/AdamW)¶
Adam maintains two FP32 states per parameter (momentum and variance):
Running total: 14 + 14 + 56 = 84 GB (already exceeds A100 80GB!)
3.1.4 Activations¶
Activation memory depends on layers \(L\), batch size \(B\), sequence length \(S\), hidden dimension \(d\):
Where \(C\) is a constant (typically 10-20 without gradient checkpointing).
For 7B model:
- Without checkpointing: ~32 GB
- With checkpointing: ~2-5 GB
3.1.5 Total Training Memory¶
Without gradient checkpointing: ~116 GB
With gradient checkpointing: ~89 GB
This explains why full fine-tuning of a 7B model doesn't fit on a single A100.
3.2 Techniques That Enable Feasible Training¶
Common strategies:
| Technique | Memory Savings | Trade-off |
|---|---|---|
| LoRA | 100-1000× (trainable params) | Slightly lower quality |
| Gradient checkpointing | 80-90% (activations) | +25-33% compute time |
| 8-bit optimizers | 75% (optimizer states) | Minimal quality loss |
| Gradient accumulation | 50-75% (activations) | Longer iteration time |
| Mixed precision (BF16) | 50% (weights/grads/acts) | Requires compatible hardware |
Typical combination for 7B on 80GB A100:
- BF16 mixed precision
- Gradient checkpointing
- 8-bit Adam
- Batch size 1-2 with gradient accumulation
3.3 CPU Offloading¶
What can be offloaded:
- Optimizer states (ZeRO-Offload, Paged Adam)
- Parameters (ZeRO-Infinity with NVMe)
Trade-offs:
✅ Enables fitting larger models
❌ Severely reduces training throughput (50-100× slower optimizer step)
❌ Often impractical for production pretraining
When to use: Single-GPU fine-tuning with memory constraints.
4. Memory Differences Between Training and Inference¶
4.1 Training vs Inference Memory Profile¶
| Component | Training | Inference | Notes |
|---|---|---|---|
| Weights | 14 GB | 14 GB | Same |
| Gradients | 14 GB | 0 GB | No backward pass |
| Optimizer | 56 GB | 0 GB | Not needed |
| Activations | 5-32 GB | 1-2 GB | No need to store for backprop |
| KV cache | 0 GB | 5-20 GB | Autoregressive decoding |
| Total | 89-116 GB | 20-36 GB | 3-5× difference |
Key insight: Training requires 3-5× more memory than inference for the same model.
Why inference fits models that training cannot:
- No gradients
- No optimizer states
- Minimal activation storage
5. KV Cache Memory Usage¶
KV cache is an inference-time concern. For KV cache optimizations (PagedAttention, Prefix Caching, GQA/MQA), see the LLM Inference Speed repo. This section covers only the memory accounting relevant for hardware planning.
5.1 What Is the KV Cache¶
The KV cache stores key and value tensors from attention for previously processed tokens during autoregressive decoding.
Purpose: Avoid recomputing attention over the full context for each new token.
Speed-up: \(O(S^2) \rightarrow O(S)\) for generating \(S\) tokens.
5.2 Which Memory Does KV Cache Use¶
KV cache resides in:
✅ GPU memory (during standard inference)
❌ CPU memory (only in specialized offloading setups)
Why GPU? Attention computation requires fast access; CPU storage would destroy throughput.
5.3 KV Cache Memory Scaling¶
KV cache memory formula:
Where:
- Factor of 2: both K and V
- \(L\): number of layers
- \(B\): batch size
- \(S\): sequence length
- \(d\): hidden dimension
Example (7B LLaMA, BF16):
For single request (\(B=1\)), sequence length 4096: $$ M_{\text{KV}} = 2 \times 32 \times 1 \times 4096 \times 4096 \times 2 = 2.15 \text{ GB} $$
For batch size 8: $$ M_{\text{KV}} = 2.15 \times 8 = 17.2 \text{ GB} $$
This makes KV cache the dominant memory consumer during long-context inference.
5.4 KV Cache Management Techniques¶
Problem: KV cache grows linearly with generated tokens and batch size.
Solutions:
| Technique | Memory Reduction | Notes |
|---|---|---|
| Sliding window | Caps at fixed size | Only keep last W tokens |
| PagedAttention (vLLM) | 10-30% | Better packing, shared prefixes |
| Grouped-Query Attention (GQA) | 4-32× | Fewer KV heads (requires model change) |
| KV cache quantization (INT8) | 50% | Store K/V in INT8 |
GQA example (7B model, seq=4096):
- Multi-Head (32 KV heads): 2.15 GB
- GQA (8 KV heads): 0.54 GB (4× reduction)
- MQA (1 KV head): 0.07 GB (32× reduction)
6. Multi-GPU Memory Considerations¶
6.1 Data Parallelism (DP)¶
Memory per GPU:
- Each GPU holds full model copy
- Activations/gradients split across batch
Limitation: Model must fit on single GPU.
6.2 ZeRO (Zero Redundancy Optimizer)¶
ZeRO partitions optimizer states, gradients, and parameters:
| Stage | Partitions | Memory Savings (vs DP) |
|---|---|---|
| ZeRO-1 | Optimizer states | 4× |
| ZeRO-2 | + Gradients | 8× |
| ZeRO-3 | + Parameters | Up to \(N_{\text{GPUs}}\)× |
Example (7B model, 8 GPUs, ZeRO-3):
Per GPU memory: $$ \frac{14 + 14 + 56}{8} = 10.5 \text{ GB} $$
Enables training on GPUs with <16 GB memory.
Trade-off: More communication (all-gather parameters each layer) vs Data Parallel.
6.3 Tensor and Pipeline Parallelism¶
Tensor Parallelism (TP):
- Model weights sharded across GPUs
- Each GPU computes a slice (e.g., split attention heads)
Pipeline Parallelism (PP):
- Layers distributed across GPUs
- Sequential processing with pipeline stages
Both reduce per-GPU memory but require careful implementation.
8. Summary¶
Key takeaways:
Memory Hierarchy¶
- GPU HBM: ~1.6 TB/s, 40-80 GB capacity (keep hot data here)
- CPU RAM: ~80 GB/s, 64GB-2TB capacity (offload cold data)
- PCIe: ~32 GB/s (minimize transfers)
Training vs Inference¶
- Training: 3-5× more memory (gradients + optimizer states dominate)
- Inference: KV cache dominates for long contexts
- Can infer models that won't fit for training
Compute vs Memory Bound¶
- Most Transformer ops are memory-bound (attention, LayerNorm, embeddings)
- Only large matrix multiplications are compute-bound
- Optimize for memory bandwidth, not just FLOPs
Key Optimization Techniques¶
- Gradient checkpointing: 80-90% activation memory reduction
- 8-bit optimizers: 75% optimizer state reduction
- LoRA: 100-1000× trainable parameter reduction
- ZeRO-3: 5-7× memory reduction in multi-GPU training
- GQA/MQA: 4-32× KV cache reduction
- PagedAttention: 10-30% better KV cache utilization
Design Principles¶
- Keep hot data on GPU (avoid CPU-GPU transfers)
- Use gradient checkpointing for memory-constrained training
- Consider LoRA/PEFT before full fine-tuning
- For serving: use vLLM, GQA models, and batching
- For multi-GPU: ZeRO-3 if memory-limited, DP if throughput-critical