AWQ
1. Paper¶
Lin et al., 2023 - "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration"
2. Core Insight¶
Not all weights are equal - 1% of salient weights (channels) matter disproportionately for model quality.
3. Key Observation¶
Salient weight channels correlate with large activation magnitudes. Protect these during quantization.
4. Algorithm¶
1. Identify Salient Channels¶
# Collect activation statistics
salient_scores = activation_magnitude.mean(dim=samples)
# Top 1% channels by activation magnitude
2. Per-Channel Scaling¶
# Scale up salient weights BEFORE quantization
s = compute_optimal_scale(W, X) # Based on activation distribution
W_scaled = W * s # Per-channel scale
W_quantized = quantize(W_scaled)
# At inference:
# output = (W_quantized / s) @ X = W_quantized @ (X * s)
# Move scaling to activations (cheap)
3. Search Optimal Scales¶
Minimize: ||W·X - Q(W·s)·(X/s)||
Grid search over s ∈ [0.5, 1.5] per channel
5. Why It Works¶
- Increases effective resolution for important weights
- Shifts dynamic range to match activation distribution
- Quantization error on salient weights reduced by 2-4×
6. Specifications¶
Calibration: 128 samples Quantization Time: ~10 minutes for 7B model (much faster than GPTQ) Group Size: 128 typical Bits: Optimized for 4-bit, works for 3-bit
7. Performance Comparison¶
| Method | LLaMA-7B (4-bit) | Quantization Time |
|---|---|---|
| RTN | 73.2 PPL | seconds |
| GPTQ | 68.4 PPL | 4 hours |
| AWQ | 68.1 PPL | 10 min |
8. Advantages over GPTQ¶
- Speed: 20-30× faster quantization
- Simplicity: No Hessian computation
- Hardware-friendly: Simple per-channel scales
9. TinyChat Integration¶
AWQ includes custom CUDA kernels for efficient INT4 inference: - Fused dequantization + GEMM - 3-4× speedup over FP16 on consumer GPUs
10. Implementation¶
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("model_name")
model.quantize(
calib_data="wikitext",
w_bit=4,
q_group_size=128,
version="GEMM" # Inference kernel
)
11. Common Interview Questions¶
Q1: How does AWQ differ from GPTQ philosophically?
A: GPTQ compensates errors across all weights. AWQ protects important weights from error in the first place. Prevention vs. compensation.
Q2: Why is AWQ faster to quantize?
A: No iterative weight updates or Hessian computation. Just statistics collection + grid search for scales. Embarrassingly parallel.
Q3: What's the "1% salient weights" finding?
A: 1% of weight channels (those with highest activation magnitude) contribute disproportionately. Protecting them preserves 90%+ of model quality.
Q4: How are scales applied at inference?
A: Mathematically equivalent to scale weights (expensive) or scale activations (cheap). AWQ scales activations: (W/s) @ X = (W) @ (X/s).
Q5: Can AWQ work for INT8?
A: Yes, but less beneficial. INT8 already preserves most weights well. AWQ's advantage is strongest at 3-4 bits where bit budget is tight.
Q6: What's the memory overhead of scales?
A: Per-channel FP16 scales: 0.1-0.2% overhead. Negligible compared to 8× weight reduction.
Q7: AWQ vs. SmoothQuant?
A: SmoothQuant smooths activations for easier quantization. AWQ protects important weights. Can be combined: SmoothQuant for activation quantization, AWQ for weights.
Q8: Why grid search for scales?
A: Closed-form solution doesn't exist for optimal scales. Grid search over reasonable range [0.5, 1.5] is fast and effective. Can use gradient-based for better results but slower.