Quantization Basics
1. Core Concept¶
Quantization reduces model precision from FP32/FP16 to lower bit representations (INT8, INT4) to decrease memory and increase inference speed.
Key Formula: Q(x) = round(x/S) - Z where S = scale, Z = zero-point
2. Types¶
Post-Training Quantization (PTQ)¶
- Applied after training
- No retraining needed
- Calibration dataset required
- Common methods: MinMax, Percentile, MSE
Quantization-Aware Training (QAT)¶
- Simulates quantization during training
- Better accuracy but requires full training
- Fake quantization nodes in forward pass
3. Quantization Schemes¶
Symmetric: Zero-point = 0, range = [-127, 127] for INT8
Asymmetric: Zero-point ≠ 0, range = [0, 255] for UINT8
Per-Tensor: Single scale for entire tensor
Per-Channel: Different scale per output channel (better accuracy)
4. Memory Savings¶
- FP32 → INT8: 4× reduction
- FP32 → INT4: 8× reduction
- Attention and FFN layers: Primary targets
5. Common Interview Questions¶
Q1: Why does quantization work for LLMs? A: LLMs have activation/weight distributions that cluster around certain values. Most information is captured in relative magnitudes rather than absolute precision.
Q2: What's the difference between static and dynamic quantization? A: Static uses calibration data to determine scales offline. Dynamic computes scales at runtime (slower but more accurate for varied inputs).
Q3: Which layers are hardest to quantize? A: Layer normalization and first/last layers are most sensitive. Activations often need higher precision than weights.
Q4: How do you measure quantization quality? A: Perplexity on validation set, task-specific metrics (accuracy, F1), and activation distribution analysis (KL divergence).
Q5: What's the typical accuracy drop for INT8 quantization? A: Well-executed INT8 PTQ: <1% degradation. INT4: 2-5% depending on model size and method.