INT4 Quantization

1. Overview¶

4-bit quantization (16 discrete values) achieves 8× memory reduction from FP32. Requires careful techniques to maintain quality.

2. Key Challenge¶

Limited range (16 values) makes naive quantization lossy. Need sophisticated methods: GPTQ, AWQ, or group quantization.

3. Group Quantization¶

Concept: Different scales for weight groups instead of entire layer

# Group size typically 32-128
for group in split_weights(W, group_size=128):
    scale = max(abs(group)) / 7  # 4-bit signed: -8 to 7
    group_int4 = round(group / scale).clip(-8, 7)

Tradeoff: Better accuracy vs. more scales to store (usually 1-2% overhead)

NormalFloat (NF4) - QLoRA¶

Innovation: Non-uniform quantization matching normal distribution

Standard INT4: [-8, -7, ..., 0, ..., 7]
NF4: [-1.0, -0.6962, -0.5251, -0.3949, ...]

Why it works: Pre-trained weights follow ~N(0, σ), NF4 bins optimally quantize normal distribution

Usage: QLoRA for parameter-efficient fine-tuning

Double Quantization¶

Quantize the quantization scales themselves (QLoRA technique)

FP16 scales → INT8 scales
Saves additional 0.4 bits per parameter
Minimal accuracy impact

Inference Kernels¶

Challenge: No native INT4 arithmetic on most hardware

Solution: Pack two INT4 values per byte, unpack during compute

byte = (val1 << 4) | val2  # Pack
val1 = (byte >> 4) & 0xF   # Unpack

Performance¶

Memory: 8× reduction (2GB for 7B model)
Speed: 1.5-2× faster than INT8 (memory-bound scenarios)
Accuracy drop: 3-7% with naive methods, <2% with GPTQ/AWQ

Common Interview Questions¶

Q1: Why not INT4 everywhere if it's 8× smaller?
A: Quality degradation becomes significant. Activations especially need higher precision. Still typically use FP16/INT8 for activations.

Q2: What's the typical group size for INT4?
A: 32-128. Smaller = better accuracy but more overhead. 128 is common sweet spot.

Q3: How does NF4 differ from uniform INT4?
A: NF4 uses quantiles of normal distribution as bins instead of uniform spacing. Since weights are normally distributed, this minimizes quantization error.

Q4: Can you do INT4 quantization without GPTQ/AWQ?
A: Yes, but expect 5-10% accuracy drop. Round-to-nearest with group quantization gets you ~3-5% drop. GPTQ/AWQ optimize to <2%.

Q5: What's the memory breakdown for INT4 model?
A: Weights: 4 bits, Scales: ~0.1 bits (with double quantization), KV cache: still FP16/INT8 (separate issue).

Q6: Why is INT4 harder than INT8 for activations?
A: Activations have wider dynamic range and outliers. INT4's 16 values can't capture this without severe clipping or poor resolution.