Skip to content

AWQ

1. Paper

Lin et al., 2023 - "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration"



2. Core Insight

Not all weights are equal - 1% of salient weights (channels) matter disproportionately for model quality.



3. Key Observation

Salient weight channels correlate with large activation magnitudes. Protect these during quantization.



4. Algorithm

1. Identify Salient Channels

# Collect activation statistics
salient_scores = activation_magnitude.mean(dim=samples)
# Top 1% channels by activation magnitude

2. Per-Channel Scaling

# Scale up salient weights BEFORE quantization
s = compute_optimal_scale(W, X)  # Based on activation distribution
W_scaled = W * s  # Per-channel scale
W_quantized = quantize(W_scaled)

# At inference: 
# output = (W_quantized / s) @ X = W_quantized @ (X * s)
# Move scaling to activations (cheap)

3. Search Optimal Scales

Minimize: ||W·X - Q(W·s)·(X/s)||

Grid search over s ∈ [0.5, 1.5] per channel



5. Why It Works

  • Increases effective resolution for important weights
  • Shifts dynamic range to match activation distribution
  • Quantization error on salient weights reduced by 2-4×


6. Specifications

Calibration: 128 samples Quantization Time: ~10 minutes for 7B model (much faster than GPTQ) Group Size: 128 typical Bits: Optimized for 4-bit, works for 3-bit



7. Performance Comparison

Method LLaMA-7B (4-bit) Quantization Time
RTN 73.2 PPL seconds
GPTQ 68.4 PPL 4 hours
AWQ 68.1 PPL 10 min


8. Advantages over GPTQ

  1. Speed: 20-30× faster quantization
  2. Simplicity: No Hessian computation
  3. Hardware-friendly: Simple per-channel scales


9. TinyChat Integration

AWQ includes custom CUDA kernels for efficient INT4 inference: - Fused dequantization + GEMM - 3-4× speedup over FP16 on consumer GPUs



10. Implementation

from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("model_name")
model.quantize(
    calib_data="wikitext",
    w_bit=4,
    q_group_size=128,
    version="GEMM"  # Inference kernel
)


11. Common Interview Questions

Q1: How does AWQ differ from GPTQ philosophically?
A: GPTQ compensates errors across all weights. AWQ protects important weights from error in the first place. Prevention vs. compensation.


Q2: Why is AWQ faster to quantize?
A: No iterative weight updates or Hessian computation. Just statistics collection + grid search for scales. Embarrassingly parallel.


Q3: What's the "1% salient weights" finding?
A: 1% of weight channels (those with highest activation magnitude) contribute disproportionately. Protecting them preserves 90%+ of model quality.


Q4: How are scales applied at inference?
A: Mathematically equivalent to scale weights (expensive) or scale activations (cheap). AWQ scales activations: (W/s) @ X = (W) @ (X/s).


Q5: Can AWQ work for INT8?
A: Yes, but less beneficial. INT8 already preserves most weights well. AWQ's advantage is strongest at 3-4 bits where bit budget is tight.


Q6: What's the memory overhead of scales?
A: Per-channel FP16 scales: 0.1-0.2% overhead. Negligible compared to 8× weight reduction.


Q7: AWQ vs. SmoothQuant?
A: SmoothQuant smooths activations for easier quantization. AWQ protects important weights. Can be combined: SmoothQuant for activation quantization, AWQ for weights.


Q8: Why grid search for scales?
A: Closed-form solution doesn't exist for optimal scales. Grid search over reasonable range [0.5, 1.5] is fast and effective. Can use gradient-based for better results but slower.