Skip to content

Quantization Basics

1. Core Concept

Quantization reduces model precision from FP32/FP16 to lower bit representations (INT8, INT4) to decrease memory and increase inference speed.

Key Formula: Q(x) = round(x/S) - Z where S = scale, Z = zero-point



2. Types

Post-Training Quantization (PTQ)

  • Applied after training
  • No retraining needed
  • Calibration dataset required
  • Common methods: MinMax, Percentile, MSE

Quantization-Aware Training (QAT)

  • Simulates quantization during training
  • Better accuracy but requires full training
  • Fake quantization nodes in forward pass


3. Quantization Schemes

Symmetric: Zero-point = 0, range = [-127, 127] for INT8
Asymmetric: Zero-point ≠ 0, range = [0, 255] for UINT8

Per-Tensor: Single scale for entire tensor
Per-Channel: Different scale per output channel (better accuracy)



4. Memory Savings

  • FP32 → INT8: 4× reduction
  • FP32 → INT4: 8× reduction
  • Attention and FFN layers: Primary targets


5. Common Interview Questions

Q1: Why does quantization work for LLMs? A: LLMs have activation/weight distributions that cluster around certain values. Most information is captured in relative magnitudes rather than absolute precision.


Q2: What's the difference between static and dynamic quantization? A: Static uses calibration data to determine scales offline. Dynamic computes scales at runtime (slower but more accurate for varied inputs).


Q3: Which layers are hardest to quantize? A: Layer normalization and first/last layers are most sensitive. Activations often need higher precision than weights.


Q4: How do you measure quantization quality? A: Perplexity on validation set, task-specific metrics (accuracy, F1), and activation distribution analysis (KL divergence).


Q5: What's the typical accuracy drop for INT8 quantization? A: Well-executed INT8 PTQ: <1% degradation. INT4: 2-5% depending on model size and method.