INT8 Quantization

1. Overview¶

Maps FP16/FP32 values to 8-bit integers (256 discrete values). Standard for production LLM deployment.

2. Quantization Process¶

Weight Quantization¶

# Per-channel quantization
scale = max(abs(W)) / 127
W_int8 = round(W / scale).clip(-128, 127)

Activation Quantization¶

# Calibration phase (100-1000 samples)
min_val, max_val = collect_statistics(calibration_data)
scale = (max_val - min_val) / 255
zero_point = -round(min_val / scale)

3. LLM.int8() (Dettmers et al., 2022)¶

Key Innovation: Mixed-precision decomposition for outliers

Process:

Detect outlier features (>6σ threshold)
Separate matrix multiplication: FP16 for outliers, INT8 for rest
Typically, <0.1% outlier features, but they're critical

Memory: 2× reduction with minimal accuracy loss

SmoothQuant Bridge¶

Often combined with SmoothQuant for activation smoothing before INT8 conversion.

Hardware Support¶

NVIDIA Tensor Cores: INT8 GEMM operations
Intel VNNI: Vector Neural Network Instructions
ARM: INT8 GEMM on modern CPUs

Speedup: 2-4× on modern hardware

Common Interview Questions¶

Q1: Why is INT8 considered the "sweet spot"? A: Best balance of compression (4× from FP32), hardware support, and accuracy preservation. INT4 needs more careful handling.

Q2: What's the bottleneck in INT8 inference? A: Dequantization overhead and memory bandwidth. For small batches, compute isn't fully saturated.

Q3: How does LLM.int8() handle outliers? A: Uses vector-wise quantization to detect outliers (values >6σ), processes them in FP16 while using INT8 for the remaining 99.9% of values.

Q4: Can you quantize all layers to INT8? A: No. Embedding layers, layer norms, and sometimes first/last layers stay in FP16 for stability.

Q5: What's absmax quantization? A: Symmetric quantization using absolute maximum: scale = max(|W|) / 127. Simple but can waste range if distribution is skewed.

Q6: Calibration dataset size? A: 100-1000 samples from training distribution. More doesn't always help; diversity matters more than quantity.