Inference Basics
1. Core Concepts¶
Autoregressive Generation¶
- LLMs generate tokens sequentially: P(token_t | token_1, ..., token_{t-1})
- Each token requires full model forward pass
- Output of step t becomes input for step t+1
Two-Phase Inference¶
Prefill Phase (Prompt Processing)
- Process entire input prompt in parallel
- Compute KV cache for all input tokens
- Computationally intensive, compute-bound
- Time complexity: O(n²d) for n tokens, d dimensions
Decode Phase (Token Generation)
- Generate one token at a time
- Reuse cached KV from previous tokens
- Memory-bound operation (fetching weights/KV cache)
- Continues until EOS token or max length
Key Metrics¶
Time to First Token (TTFT) = Prefill time
Time Per Output Token (TPOT) = Average decode time per token
Total Latency = TTFT + (num_output_tokens × TPOT)
2. Model Architecture Components¶
Transformer Blocks¶
- Multi-head self-attention: O(n²d) complexity
- Feed-forward network: O(nd_ff) where d_ff ≈ 4d
- Layer normalization
- Residual connections
KV Cache¶
- Stores key/value matrices from previous tokens
- Size per layer: 2 × batch_size × seq_len × num_heads × head_dim × 2 bytes (FP16)
- Example: LLaMA-2-7B with seq_len=2048, batch=1
- Per layer: 2 × 1 × 2048 × 32 × 128 × 2 ≈ 33 MB
- Total (32 layers): ~1 GB
3. Memory Requirements¶
Total Memory = Model Weights + KV Cache + Activations + Overhead
Model Weights = num_params × bytes_per_param
- FP32: 4 bytes, FP16: 2 bytes, INT8: 1 byte, INT4: 0.5 bytes
Activations = temporary tensors during forward pass
Overhead = CUDA context, fragmentation (~10-20%)
Example: LLaMA-2-7B (FP16)
- Weights: 7B × 2 = 14 GB
- KV cache (batch=1, seq=2048): ~1 GB
- Activations: ~0.5-1 GB
- Total: ~16-17 GB
4. Common Interview Questions¶
Q: Why is prefill compute-bound and decode memory-bound?
- Prefill: Process many tokens in parallel → high arithmetic intensity, GPU cores saturated
- Decode: Generate 1 token → fetch entire model weights from memory, low compute utilization
Q: How does batch size affect inference?
- Prefill: Higher batch increases compute, remains compute-bound
- Decode: Higher batch increases memory for KV cache, can become compute-bound with large batches
- Sweet spot: Balance between throughput and latency
Q: What limits maximum sequence length?
- KV cache memory grows linearly with sequence length
- Attention computation grows quadratically O(n²)
- GPU memory capacity is primary constraint
Q: Calculate memory for Mistral-7B (FP16) with batch=4, seq=4096?
Weights: 7B × 2 = 14 GB
KV cache: 2 × 4 × 4096 × 32 × 128 × 2 × 32 layers ≈ 8 GB
Total: ~22-24 GB
Q: Why can't we parallelize token generation?
- Each token depends on all previous tokens
- Autoregressive dependency prevents parallelization
- Speculative decoding attempts to work around this
5. Modern Optimizations (2024-2025)¶
- Grouped Query Attention (GQA): Reduce KV cache by sharing KV heads
- Multi-Query Attention (MQA): Single KV head for all queries
- FlashAttention-3: Fused attention kernel, 2x faster on H100
- Paged Attention (vLLM): Non-contiguous KV cache storage
- Continuous Batching: Dynamic batch assembly for throughput
6. Key Takeaways¶
- Inference has distinct prefill (parallel) and decode (sequential) phases
- KV cache is crucial for avoiding recomputation but consumes significant memory
- Memory bandwidth is often the bottleneck during decode
- Model size, sequence length, and batch size determine memory requirements
- Understanding the compute vs memory bound distinction is critical