Skip to content

⚙️ Block-wise k-bit Quantization


1. Overview

Block-wise k-bit quantization is a technique that compresses model weights into low-bit representations (e.g., 4-bit, 8-bit) while preserving performance and minimizing quantization error. Instead of quantizing each value independently, block-wise quantization divides the weight matrix into smaller blocks (chunks) and performs quantization relative to local statistics (like scale and zero-point) of each block.

This local normalization significantly reduces quantization error caused by outlier values — a common issue in transformer weights.


2. Motivation

🧠 Problem: Outliers in Weight Distributions

Weights in large models (especially in attention layers) often follow heavy-tailed distributions — a few large values coexist with many small ones.
In global quantization, a single scale \( s_{\text{global}} = \frac{\max(|W|)}{2^{k-1}-1} \) is used for all weights.
Large outliers force the scale up, making most small weights collapse to zero after quantization.


Example

Consider: $$ W = [0.01, 0.02, -0.03, 0.05, 3.0] $$

With 4-bit global quantization: $$ s_{\text{global}} = \frac{3.0}{7} \approx 0.43 $$

Quantized weights → [0, 0, 0, 0, 7] — almost all small weights vanish due to the single large outlier.


✅ Solution: Block-wise Quantization

Split weights into small blocks (e.g., 64–256 values each), and compute a separate scale per block: $$ s_b = \frac{\max(|W_b|)}{2^{k-1}-1} $$ Each block adapts to its local range, preserving fine details while still compressing efficiently.

By partitioning weights into blocks and computing scale/offset per block, quantization adapts to local statistics and better preserves precision.


3. Mathematical Formulation

📘 Steps for block quantization

Let:

  • \(W \in \mathbb{R}^{d \times k}\): full-precision weight matrix
  • \(B_i \subset W\): the i-th block of size \(n_b\)
  • \(k\): number of bits used for quantization (e.g., 4 or 8)

Step 1: Compute Local Scale and Zero-Point

For each block \(B_i\):

\[ s_i = \frac{\max(B_i) - \min(B_i)}{2^k - 1} \]
\[ z_i = \text{round}\left(-\frac{\min(B_i)}{s_i}\right) \]

Where:

  • \(s_i\): scale factor for block \(i\)
  • \(z_i\): zero-point (offset)

Step 2: Quantization

Quantized integer representation:

\[ q_i = \text{clip}\left(\text{round}\left(\frac{B_i}{s_i}\right) + z_i, 0, 2^k - 1\right) \]

Step 3: Dequantization (Reconstruction)

\[ \hat{B_i} = s_i \times (q_i - z_i) \]

The final reconstructed weight matrix:

\[ \hat{W} = \bigcup_i \hat{B_i} \]

4. Double Quantization

Double quantization is a secondary compression layer designed to reduce the overhead of storing multiple block-wise scales. Instead of storing each block’s scaling factor \( s_j \) as a 16-bit or 32-bit float, these scale values themselves are quantized into a lower precision representation (e.g., 8-bit or 4-bit).

📘 Details

4.1. Concept

If there are \(N\) blocks, each with a scale \(s_j\):

\[ \tilde{s_j} = \text{quantize}(s_j, s_{\text{meta}}, q_{\min}, q_{\max}) \]

Here, \(s_{\text{meta}}\) is a higher-level scale shared across a group of block-scales.

At dequantization:

\[ s_j = s_{\text{meta}} \cdot \tilde{s_j} \]
\[ \hat{x_i} = s_j \cdot q_i \]

This approach can yield 20–30% memory savings, especially when using small block sizes where the number of stored scales is large.

4.2. Example

Consider a model layer with 10,000 blocks of weights.
Each block has one scale \( s_i \).

Parameter Value
Number of blocks 10,000
Scale per block (FP16) 2 bytes
Memory (without double quantization) 20 KB
Quantized scale (8-bit) 1 byte
Memory (with double quantization) 10 KB

So double quantization reduces metadata memory by 50% with negligible degradation (typically < 0.1% accuracy loss).

4.3. Implementation in Bitsandbytes

In bitsandbytes 0.39+, both block-wise quantization and double quantization are implemented jointly:

  • Each weight block is quantized in NF4 format.
  • Each block’s scale value is quantized using 8-bit quantization.
  • The quantized scales are stored alongside the 4-bit codes.
  • Dequantization happens transparently during forward passes.

This enables models like LLaMA-2 70B to be fine-tuned on single 48GB GPUs.


4.4 Key Notes

  • Double quantization is orthogonal but complementary to block-wise quantization.
  • It primarily targets metadata compression, not model accuracy.
  • Used in QLoRA to compress per-block scales efficiently.
Aspect Effect
Memory Efficiency Up to 2× reduction in metadata storage
Accuracy Impact Negligible (< 0.1% degradation)
Computation Overhead Minimal (scales dequantized once per block)
Compatibility Fully supported in bitsandbytes & QLoRA stack

5. Implementation Details (Pseudo-Code)

def blockwise_quantize(weights, block_size=64, num_bits=4):
    q_blocks, scales, zeros = [], [], []
    n = len(weights)
    for i in range(0, n, block_size):
        block = weights[i:i+block_size]
        min_val, max_val = block.min(), block.max()
        scale = (max_val - min_val) / (2 ** num_bits - 1)
        zero_point = -min_val / scale
        q_block = np.round(block / scale + zero_point).clip(0, 2 ** num_bits - 1)
        q_blocks.append(q_block)
        scales.append(scale)
        zeros.append(zero_point)
    return q_blocks, scales, zeros

6. Example (4-bit Quantization)

📘 Working example

Consider a block of weights:

\[ B_i = [-0.9, -0.3, 0.2, 0.5, 1.0] \]

For \(k = 4\) bits:

  • \(\min = -0.9, \max = 1.0\)
  • \(s_i = (1.0 - (-0.9)) / 15 = 0.1267\)
  • \(z_i = -(-0.9) / 0.1267 = 7.1 \approx 7\)

Quantized values:

\[ q_i = \text{round}(B_i / s_i + z_i) = [0, 5, 9, 11, 15] \]

Dequantized:

\[ \hat{B_i} = s_i \times (q_i - 7) = [-0.9, -0.26, 0.32, 0.51, 1.01] \]

The reconstruction closely approximates the original block.


7. Advantages

Aspect Benefit
Local scaling Reduces sensitivity to outliers
Memory Lower storage cost (e.g., 4-bit = 8× compression)
Compute Enables efficient GPU matrix-multiplication with custom kernels
Accuracy Closer performance to full precision

8. Hardware Implementation

  • Most modern inference frameworks (e.g., bitsandbytes, TensorRT) store the scale and zero-point per block.
  • For 4-bit quantization, typical block sizes: 32, 64, or 128.
  • Scales are stored in FP16 to balance precision and storage.

9. Visualization

A conceptual diagram of block-wise quantization:

┌────────────────────────────┐
│         Weight Matrix      │
│  [w₁, w₂, …, wₙ]          │
└────────────────────────────┘
          ↓ Split into Blocks
┌──────────────┬──────────────┐
│ Block 1      │ Block 2      │ ...
└──────────────┴──────────────┘
     ↓                 ↓
Compute s₁,z₁      Compute s₂,z₂
     ↓                 ↓
Quantize each block separately
     ↓                 ↓
Store q₁,s₁,z₁,...,qₙ,sₙ,zₙ

Each block retains its own quantization scale and offset, enabling more accurate low-bit representation.


10. Relationship to QLoRA

QLoRA uses 4-bit NormalFloat (NF4) quantization with block-wise statistics:

  • Each block (typically 64 elements) uses local mean and std for normalization.
  • NF4 values are quantized into [-1, 1] with learned scales.
  • This approach allows fine-tuning large LLMs on a single GPU without significant accuracy loss.