Skip to content

GPTQ

1. Paper

Frantar et al., 2022 - "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"



2. Core Idea

Quantize weights one-by-one while compensating errors in remaining weights using second-order information (Hessian).



3. Algorithm (Simplified)

# For each layer's weight matrix W
H = 2 * X^T @ X / n_samples  # Hessian (input correlations)

for i in range(n_columns):
    # Quantize column i
    w_q = quantize(W[:, i])
    error = W[:, i] - w_q

    # Distribute error to remaining columns using Hessian
    W[:, i+1:] -= error * H[i, i+1:] / H[i, i]

    W[:, i] = w_q


4. Key Innovation: Optimal Brain Quantization (OBQ)

  • Uses second-order Taylor expansion to minimize quantization loss
  • Iteratively quantizes weights in order that minimizes Hessian-weighted error
  • Compensates each quantization error before next step


Lazy Batch Updates (Efficiency)

  • Don't update all weights individually
  • Process in blocks of 128 columns
  • Dramatic speedup with minimal quality loss


Specifications

  • Calibration: 128 samples typical
  • Time: 3-4 hours for 175B model on single GPU
  • Group Size: 128 (default), lower for better quality
  • Bits: Designed for 3-4 bit, works for INT8 too


Accuracy

Model Bits Group Size Perplexity Δ
LLaMA-7B 4 128 +0.2
LLaMA-13B 3 128 +0.9
OPT-175B 4 128 +0.1


vs. Round-To-Nearest (RTN)

  • RTN: Fast, 5-10% degradation at 4-bit
  • GPTQ: Slow quantization, <2% degradation at 4-bit


Implementation

# AutoGPTQ library
from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_pretrained(
    "model_name",
    quantize_config={
        "bits": 4,
        "group_size": 128,
        "desc_act": False  # Activation ordering
    }
)


Common Interview Questions

Q1: Why does GPTQ outperform naive quantization?
A: It compensates each quantization error by adjusting remaining weights based on input correlations (Hessian), preventing error accumulation.


Q2: What's the computational complexity?
A: O(n²) in weight dimensions due to Hessian computation and updates. Lazy batching reduces this to O(n²/b) where b = block size.


Q3: Why use Hessian instead of just gradient?
A: Second-order information captures weight interactions. First-order (gradient) only gives local slope, not curvature of loss landscape.


Q4: What's "desc_act" in GPTQ?
A: Reorders activation channels by importance before quantization. Helps but adds complexity; often disabled for speed.


Q5: Can GPTQ quantize activations?
A: No, GPTQ is weight-only. Activations typically stay FP16 or use runtime INT8 quantization.


Q6: GPTQ vs. AWQ - when to use which?
A: GPTQ: Better for extreme compression (3-bit). AWQ: Faster quantization, better at 4-bit, protects important weights rather than compensating errors.


Q7: Why is calibration data needed?
A: To compute Hessian (H = X^T X), which captures input statistics. Need representative samples to estimate weight importance accurately.