⚙️ Block-wise k-bit Quantization¶
1. Overview¶
Block-wise k-bit quantization is a technique that compresses model weights into low-bit representations (e.g., 4-bit, 8-bit) while preserving performance and minimizing quantization error. Instead of quantizing each value independently, block-wise quantization divides the weight matrix into smaller blocks (chunks) and performs quantization relative to local statistics (like scale and zero-point) of each block.
This local normalization significantly reduces quantization error caused by outlier values — a common issue in transformer weights.
2. Motivation¶
🧠 Problem: Outliers in Weight Distributions¶
Weights in large models (especially in attention layers) often follow heavy-tailed distributions — a few large values coexist with many small ones.
In global quantization, a single scale \( s_{\text{global}} = \frac{\max(|W|)}{2^{k-1}-1} \) is used for all weights.
Large outliers force the scale up, making most small weights collapse to zero after quantization.
Example
Consider: $$ W = [0.01, 0.02, -0.03, 0.05, 3.0] $$
With 4-bit global quantization: $$ s_{\text{global}} = \frac{3.0}{7} \approx 0.43 $$
Quantized weights → [0, 0, 0, 0, 7] — almost all small weights vanish due to the single large outlier.
✅ Solution: Block-wise Quantization¶
Split weights into small blocks (e.g., 64–256 values each), and compute a separate scale per block: $$ s_b = \frac{\max(|W_b|)}{2^{k-1}-1} $$ Each block adapts to its local range, preserving fine details while still compressing efficiently.
By partitioning weights into blocks and computing scale/offset per block, quantization adapts to local statistics and better preserves precision.
3. Mathematical Formulation¶
📘 Steps for block quantization
Let:
- \(W \in \mathbb{R}^{d \times k}\): full-precision weight matrix
- \(B_i \subset W\): the i-th block of size \(n_b\)
- \(k\): number of bits used for quantization (e.g., 4 or 8)
Step 1: Compute Local Scale and Zero-Point¶
For each block \(B_i\):
Where:
- \(s_i\): scale factor for block \(i\)
- \(z_i\): zero-point (offset)
Step 2: Quantization¶
Quantized integer representation:
Step 3: Dequantization (Reconstruction)¶
The final reconstructed weight matrix:
4. Double Quantization¶
Double quantization is a secondary compression layer designed to reduce the overhead of storing multiple block-wise scales. Instead of storing each block’s scaling factor \( s_j \) as a 16-bit or 32-bit float, these scale values themselves are quantized into a lower precision representation (e.g., 8-bit or 4-bit).
📘 Details
4.1. Concept¶
If there are \(N\) blocks, each with a scale \(s_j\):
Here, \(s_{\text{meta}}\) is a higher-level scale shared across a group of block-scales.
At dequantization:
This approach can yield 20–30% memory savings, especially when using small block sizes where the number of stored scales is large.
4.2. Example¶
Consider a model layer with 10,000 blocks of weights.
Each block has one scale \( s_i \).
| Parameter | Value |
|---|---|
| Number of blocks | 10,000 |
| Scale per block (FP16) | 2 bytes |
| Memory (without double quantization) | 20 KB |
| Quantized scale (8-bit) | 1 byte |
| Memory (with double quantization) | 10 KB |
So double quantization reduces metadata memory by 50% with negligible degradation (typically < 0.1% accuracy loss).
4.3. Implementation in Bitsandbytes¶
In bitsandbytes 0.39+, both block-wise quantization and double quantization are implemented jointly:
- Each weight block is quantized in NF4 format.
- Each block’s scale value is quantized using 8-bit quantization.
- The quantized scales are stored alongside the 4-bit codes.
- Dequantization happens transparently during forward passes.
This enables models like LLaMA-2 70B to be fine-tuned on single 48GB GPUs.
4.4 Key Notes¶
- Double quantization is orthogonal but complementary to block-wise quantization.
- It primarily targets metadata compression, not model accuracy.
- Used in QLoRA to compress per-block scales efficiently.
| Aspect | Effect |
|---|---|
| Memory Efficiency | Up to 2× reduction in metadata storage |
| Accuracy Impact | Negligible (< 0.1% degradation) |
| Computation Overhead | Minimal (scales dequantized once per block) |
| Compatibility | Fully supported in bitsandbytes & QLoRA stack |
5. Implementation Details (Pseudo-Code)¶
def blockwise_quantize(weights, block_size=64, num_bits=4):
q_blocks, scales, zeros = [], [], []
n = len(weights)
for i in range(0, n, block_size):
block = weights[i:i+block_size]
min_val, max_val = block.min(), block.max()
scale = (max_val - min_val) / (2 ** num_bits - 1)
zero_point = -min_val / scale
q_block = np.round(block / scale + zero_point).clip(0, 2 ** num_bits - 1)
q_blocks.append(q_block)
scales.append(scale)
zeros.append(zero_point)
return q_blocks, scales, zeros
6. Example (4-bit Quantization)¶
📘 Working example
Consider a block of weights:
For \(k = 4\) bits:
- \(\min = -0.9, \max = 1.0\)
- \(s_i = (1.0 - (-0.9)) / 15 = 0.1267\)
- \(z_i = -(-0.9) / 0.1267 = 7.1 \approx 7\)
Quantized values:
Dequantized:
The reconstruction closely approximates the original block.
7. Advantages¶
| Aspect | Benefit |
|---|---|
| Local scaling | Reduces sensitivity to outliers |
| Memory | Lower storage cost (e.g., 4-bit = 8× compression) |
| Compute | Enables efficient GPU matrix-multiplication with custom kernels |
| Accuracy | Closer performance to full precision |
8. Hardware Implementation¶
- Most modern inference frameworks (e.g., bitsandbytes, TensorRT) store the scale and zero-point per block.
- For 4-bit quantization, typical block sizes: 32, 64, or 128.
- Scales are stored in FP16 to balance precision and storage.
9. Visualization¶
A conceptual diagram of block-wise quantization:
┌────────────────────────────┐
│ Weight Matrix │
│ [w₁, w₂, …, wₙ] │
└────────────────────────────┘
↓ Split into Blocks
┌──────────────┬──────────────┐
│ Block 1 │ Block 2 │ ...
└──────────────┴──────────────┘
↓ ↓
Compute s₁,z₁ Compute s₂,z₂
↓ ↓
Quantize each block separately
↓ ↓
Store q₁,s₁,z₁,...,qₙ,sₙ,zₙ
Each block retains its own quantization scale and offset, enabling more accurate low-bit representation.
10. Relationship to QLoRA¶
QLoRA uses 4-bit NormalFloat (NF4) quantization with block-wise statistics:
- Each block (typically 64 elements) uses local mean and std for normalization.
- NF4 values are quantized into [-1, 1] with learned scales.
- This approach allows fine-tuning large LLMs on a single GPU without significant accuracy loss.