SmoothQuant

1. Paper¶

Xiao et al., 2022 - "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models"

2. Problem Statement¶

Activation outliers make INT8 quantization difficult. Weights quantize well, activations don't.

Observation¶

Weight range: typically within [-3σ, 3σ]
Activation range: 10-100× larger due to systematic outliers in specific channels

3. Core Idea: Smoothing¶

Migrate difficulty from activations to weights via mathematically equivalent transformation.

Key Transformation¶

Y = XW = (X / s) · (W · s)

Where s is per-channel smoothing factor.

Divide activations by s → reduces outliers
Multiply weights by s → increases weight range (but weights easier to quantize)

4. Algorithm¶

1. Identify Outlier Channels¶

# Collect calibration statistics
alpha_x = max(|X|, dim=tokens)  # Per-channel activation range
alpha_w = max(|W|, dim=input_dim)  # Per-channel weight range

2. Compute Smoothing Scales¶

# Migration strength α ∈ [0, 1]
# α=0: no smoothing, α=1: full migration
s = alpha_x^α / alpha_w^(1-α)

3. Apply Smoothing¶

# Offline transformation
W_smooth = W * s  # Fold into weights
# At runtime: X_smooth = X / s

5. Migration Strength α¶

α = 0.5: Balanced migration (default)
- Geometric mean of activation and weight ranges - Empirically optimal for most models

α = 0.75: More aggressive activation smoothing
- Better for models with severe outliers (OPT)

6. Per-Token vs. Per-Tensor¶

Per-Tensor Dynamic: Single scale per activation tensor (fast, less accurate) Per-Token Dynamic: Scale per token in sequence (better accuracy, slower)

SmoothQuant enables per-tensor quantization by smoothing outliers beforehand.

7. Performance¶

Model	W8A8 Accuracy	vs FP16
OPT-175B	66.7%	-0.1%
BLOOM-176B	68.4%	-0.3%
LLaMA-65B	69.2%	-0.2%

Speedup: 1.5-2× on A100 (INT8 Tensor Cores)

8. Integration with Other Methods¶

SmoothQuant + AWQ:
- SmoothQuant for activation INT8 - AWQ for weight INT4 - Hybrid W4A8 quantization

SmoothQuant + LLM.int8():
- SmoothQuant pre-processing - LLM.int8() for outlier handling - Complementary techniques

9. Implementation¶

from smoothquant.smooth import smooth_lm

# Apply smoothing transformation
model = smooth_lm(
    model,
    calibration_data,
    alpha=0.5  # Migration strength
)

# Then quantize with standard tools
quantized = quantize_model(model, w_bit=8, a_bit=8)

10. Common Interview Questions¶

Q1: Why do LLM activations have outliers?
A: Systematic outliers in specific channels across all tokens, likely due to attention patterns and positional encodings. Some channels accumulate large values.

Q2: How does smoothing preserve mathematical equivalence?
A: Matrix multiplication property: (X/s) @ (W*s) = X @ W. Division and multiplication by same per-channel scales cancel out.

Q3: Why can't we just clip outliers?
A: Clipping loses information and degrades quality significantly (5-10%). Smoothing redistributes dynamic range without information loss.

Q4: What's the overhead of smoothing at runtime?
A: Negligible. Smoothing scales folded into weights offline. Only X/s at runtime (cheap element-wise division before matmul).

Q5: Does SmoothQuant help with KV cache quantization?
A: Yes! KV cache contains activations. Smoothing reduces outliers, enabling INT8 KV cache with minimal quality loss.

Q6: Why is α=0.5 optimal?
A: Balances difficulty migration. Too low → activations still hard. Too high → weights become hard. Geometric mean (0.5) is empirical sweet spot.

Q7: Can SmoothQuant be applied per-layer?
A: Yes, α can be tuned per-layer. Some layers benefit from more aggressive smoothing (α=0.7), others from less (α=0.3).

Q8: SmoothQuant vs. absmax/percentile scaling?
A: Those methods scale after observing outliers. SmoothQuant prevents outliers from forming. More fundamental solution.