Skip to content

Quantization Tradeoffs

1. Memory vs. Quality Spectrum

Precision Memory (7B) Typical PPL Δ Use Case
FP16 14 GB 0.0 (baseline) Training, high-quality inference
INT8 7 GB +0.1-0.5 Production standard
INT4 (GPTQ/AWQ) 3.5 GB +0.5-1.5 Commodity GPU inference
3-bit 2.6 GB +1.5-3.0 Extreme compression
Q2_K 2 GB +3.0-5.0 Last resort


Speed vs. Quality

Inference Latency (7B model, batch=1)

Method GPU (A100) CPU (32-core)
FP16 20 ms/token N/A (OOM)
INT8 10 ms/token N/A (OOM)
INT4 (AWQ) 7 ms/token 80 ms/token
GGUF Q4_K_M 8 ms/token 35 ms/token

Key insight: CPU competitive for quantized models, especially with optimized kernels.



Quantization Method Selection

Decision Tree

Need extreme compression (2-3 bit)? → GPTQ (best quality at extreme compression)

Standard 4-bit, fast quantization needed? → AWQ (10 min vs 4 hours for GPTQ, similar quality)

CPU deployment? → GGUF with llama.cpp (optimized CPU kernels)

GPU deployment, production quality? → INT8 with SmoothQuant (robust, well-supported)

Fine-tuning on limited memory? → QLoRA with NF4 (efficient training)



Layer-wise Quantization Strategy

Typical Configuration

Embeddings: FP16 (critical for semantic space)
Attention Weights (Q, K, V): INT4/INT8
Attention Output: INT8
FFN Weights: INT4 (largest, most compressible)
FFN Activations: INT8
Layer Norm: FP16 (small, sensitive)
Final Layer: FP16 or INT8

Rationale

  • FFN: 66% of parameters, less sensitive → aggressive INT4
  • Attention: 33% of parameters, more sensitive → INT8 or careful INT4
  • Norms/Embeddings: <1% of parameters → keep FP16


Mixed Precision Strategies

W4A8 (Weight 4-bit, Activation 8-bit)

  • Best of both worlds for many use cases
  • Weights: AWQ/GPTQ 4-bit
  • Activations: SmoothQuant INT8
  • 6-8× memory reduction, <1% quality loss

W8A8 (Both 8-bit)

  • Production standard for quality-critical apps
  • 4× memory reduction
  • Hardware-accelerated on all modern platforms
  • <0.5% quality loss with SmoothQuant


Hardware Considerations

NVIDIA GPUs

  • Tensor Cores: INT8 (Turing+), INT4 (Hopper)
  • Recommendation: INT8 for A100, INT4 for H100
  • Custom kernels: AWQ's TinyChat, ExLlamaV2 for GPTQ

AMD GPUs

  • ROCm: INT8 support
  • Recommendation: INT8, limited INT4 optimization
  • Ecosystem: Less mature than NVIDIA

Apple Silicon

  • Metal: INT8, INT4 via llama.cpp
  • Recommendation: GGUF Q4_K_M or Q6_K
  • Strength: Unified memory architecture

CPU (x86)

  • VNNI (Cascade Lake+): INT8 acceleration
  • AVX512: INT8/INT4 kernels
  • Recommendation: GGUF with llama.cpp, Q4_K_M sweet spot


Calibration Data Tradeoffs

Size

  • 100 samples: Usually sufficient, fast
  • 1000 samples: Marginal quality improvement
  • 10000 samples: No additional benefit, waste of time

Diversity vs. Representativeness

  • In-domain: Better for specialized models
  • General (WikiText): Better for general models
  • Mixed: Best for production


Dynamic vs. Static Quantization

Static (PTQ)

Pros: Faster inference, lower memory Cons: Fixed scales, may underfit outliers Best for: Stable input distributions


Dynamic

Pros: Adapts to inputs, better quality Cons: Runtime overhead (scale computation) Best for: Varied input distributions, activation quantization



Common Interview Questions

Q1: When would you use INT8 over INT4? A: (1) Quality-critical applications where 0.5% matters, (2) Hardware with INT8 acceleration but no INT4, (3) Activations (INT4 activations too lossy).


Q2: What's the minimum model size for effective quantization? A: ~1B parameters. Smaller models have less redundancy, quantization hurts more. <500M models: stick to FP16.


Q3: How do you decide between GPTQ and AWQ? A: GPTQ for 3-bit or when quality is paramount. AWQ for 4-bit, faster iteration, production deadlines. Quality difference minimal at 4-bit.


Q4: What's the biggest failure mode of quantization? A: Outlier channels not handled properly. SmoothQuant, AWQ, or mixed-precision decomposition (LLM.int8()) essential for robust quantization.


Q5: Can you quantize a fine-tuned model? A: Yes, but better to fine-tune with QLoRA (quantize-aware fine-tuning). Post-hoc quantization of fine-tuned models can be more sensitive than base models.


Q6: What's the practical lower bound for useful quantization? A: 2-bit with current methods. Below that, quality degrades unacceptably even for large models (70B+). Active research on sub-2-bit.


Q7: How much quality loss is acceptable? A: Domain-dependent. Chatbots: 2-3% acceptable. Code generation: <1%. Reasoning tasks: <0.5%. Benchmark on your specific use case.


Q8: Should you quantize KV cache? A: Yes for long context (4K+). INT8 KV cache with SmoothQuant: 2× memory savings, <0.5% quality loss. Critical for 32K+ context.


Q9: What's the ROI of quantization engineering time? A: High. 4 hours GPTQ quantization → 8× memory reduction → 8× more users per GPU → 8× cost reduction. One-time cost, continuous savings.


Q10: Biggest misconception about quantization? A: "Lower bits always means faster." Reality: memory-bound scenarios see speedup, compute-bound scenarios don't. Batch size, context length matter more than bit width for speed.