GPTQ
1. Paper¶
Frantar et al., 2022 - "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"
2. Core Idea¶
Quantize weights one-by-one while compensating errors in remaining weights using second-order information (Hessian).
3. Algorithm (Simplified)¶
# For each layer's weight matrix W
H = 2 * X^T @ X / n_samples # Hessian (input correlations)
for i in range(n_columns):
# Quantize column i
w_q = quantize(W[:, i])
error = W[:, i] - w_q
# Distribute error to remaining columns using Hessian
W[:, i+1:] -= error * H[i, i+1:] / H[i, i]
W[:, i] = w_q
4. Key Innovation: Optimal Brain Quantization (OBQ)¶
- Uses second-order Taylor expansion to minimize quantization loss
- Iteratively quantizes weights in order that minimizes Hessian-weighted error
- Compensates each quantization error before next step
Lazy Batch Updates (Efficiency)¶
- Don't update all weights individually
- Process in blocks of 128 columns
- Dramatic speedup with minimal quality loss
Specifications¶
- Calibration: 128 samples typical
- Time: 3-4 hours for 175B model on single GPU
- Group Size: 128 (default), lower for better quality
- Bits: Designed for 3-4 bit, works for INT8 too
Accuracy¶
| Model | Bits | Group Size | Perplexity Δ |
|---|---|---|---|
| LLaMA-7B | 4 | 128 | +0.2 |
| LLaMA-13B | 3 | 128 | +0.9 |
| OPT-175B | 4 | 128 | +0.1 |
vs. Round-To-Nearest (RTN)¶
- RTN: Fast, 5-10% degradation at 4-bit
- GPTQ: Slow quantization, <2% degradation at 4-bit
Implementation¶
# AutoGPTQ library
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_pretrained(
"model_name",
quantize_config={
"bits": 4,
"group_size": 128,
"desc_act": False # Activation ordering
}
)
Common Interview Questions¶
Q1: Why does GPTQ outperform naive quantization?
A: It compensates each quantization error by adjusting remaining weights based on input correlations (Hessian), preventing error accumulation.
Q2: What's the computational complexity?
A: O(n²) in weight dimensions due to Hessian computation and updates. Lazy batching reduces this to O(n²/b) where b = block size.
Q3: Why use Hessian instead of just gradient?
A: Second-order information captures weight interactions. First-order (gradient) only gives local slope, not curvature of loss landscape.
Q4: What's "desc_act" in GPTQ?
A: Reorders activation channels by importance before quantization. Helps but adds complexity; often disabled for speed.
Q5: Can GPTQ quantize activations?
A: No, GPTQ is weight-only. Activations typically stay FP16 or use runtime INT8 quantization.
Q6: GPTQ vs. AWQ - when to use which?
A: GPTQ: Better for extreme compression (3-bit). AWQ: Faster quantization, better at 4-bit, protects important weights rather than compensating errors.
Q7: Why is calibration data needed?
A: To compute Hessian (H = X^T X), which captures input statistics. Need representative samples to estimate weight importance accurately.