Skip to content

GGUF GGML

1. GGML (Georgi Gerganov Machine Learning)

C library for machine learning inference, optimized for CPU execution of LLMs.

Key Features:
- Pure C/C++ (no Python runtime) - CPU-optimized kernels (AVX2, AVX512, NEON) - 4-bit to 16-bit quantization - Memory-mapped model loading - Apple Metal, CUDA, OpenCL backends



GGUF (GGML Universal Format)

Successor to GGML format (deprecated). Single-file model container.

File Structure

[Header]
- Magic number: "GGUF"
- Version
- Tensor count, KV metadata count

[Metadata]
- Model hyperparameters
- Tokenizer config
- Quantization scheme
- Author, license, etc.

[Tensor Info]
- Name, dimensions, type, offset

[Tensor Data]
- Actual quantized weights (memory-mapped)

Advantages

  • Single file: All model data + config in one .gguf
  • Memory mapping: Load multi-GB models instantly, use minimal RAM
  • Extensible: KV metadata for any additional info
  • Backward compatible: Old GGML loaders fail safely


Quantization Types

K-Quantization (K-quants)

Optimized 2-6 bit quantization schemes:

Type Bits Description Use Case
Q2_K ~2.5 Extreme compression Large models on limited RAM
Q3_K_S ~3.4 Small, less accurate Acceptable quality loss
Q3_K_M ~3.7 Medium quality Balanced
Q4_K_S ~4.0 Small, good quality Recommended default
Q4_K_M ~4.5 Medium, best quality Best 4-bit option
Q5_K_S ~5.0 Small, very good Low loss
Q6_K ~6.0 High quality Near-FP16 quality

Legacy Quantization

  • Q4_0: Original 4-bit (group size 32)
  • Q4_1: 4-bit with per-group min (better than Q4_0)
  • Q5_0, Q5_1: 5-bit variants
  • Q8_0: INT8 quantization

Importance Matrix (I-quants)

Uses importance scores to allocate more bits to salient weights: - IQ3_XXS: 3-bit with importance weighting - IQ4_XS: 4-bit with importance weighting



llama.cpp

Reference implementation for GGUF inference.

# Convert HF model to GGUF
python convert-hf-to-gguf.py model_path --outtype q4_K_M

# Inference
./main -m model.gguf -p "Prompt" \
  -n 512 \              # Max tokens
  -c 4096 \             # Context length
  -t 8 \                # Threads
  --mlock               # Lock in RAM (prevent swapping)


CPU Optimizations

Kernel Fusion

attention = softmax(QK^T/√d) @ V
Fused into single kernel instead of 3 separate ops.


Cache-Friendly Layout

Reorder tensors for sequential memory access. Dramatic speedup on CPU.


Quantized Matrix Multiply

Custom AVX2/AVX512 kernels for INT4/INT8 GEMM. 4-8× faster than naive C.



Performance

M2 Max (Metal):
- 7B Q4_K_M: ~40 tokens/sec - 13B Q4_K_M: ~25 tokens/sec

AMD 5950X (16-core):
- 7B Q4_K_M: ~30 tokens/sec - 13B Q4_K_M: ~15 tokens/sec



Common Interview Questions

Q1: Why GGUF instead of safetensors or PyTorch?
A: GGUF designed for inference, not training. Memory-mapped loading, quantization metadata embedded, optimized for llama.cpp ecosystem.


Q2: What's memory mapping and why does it matter?
A: OS maps file directly to virtual memory. Model loads "instantly" because data stays on disk until accessed. Enables running larger models than RAM.


Q3: Why CPU inference for LLMs?
A: Consumer hardware accessibility. Most users don't have high-end GPUs. CPUs have large memory, enabling bigger models with quantization.


Q4: What's the quality difference between Q4_K_M and Q8_0?
A: Q8_0: ~99% of FP16 quality. Q4_K_M: ~95-97%. For chat, Q4_K_M usually indistinguishable. For precise tasks, Q8_0 safer.


Q5: How do K-quants improve on original Q4_0?
A: Better block structure, per-block scales and mins, optimized for specific bit rates. Q4_K_M beats Q4_0 quality while being similar size.


Q6: Can GGUF models be used outside llama.cpp?
A: Yes. Libraries: llama-cpp-python (Python), whisper.cpp (audio), GPT4All, Ollama, LM Studio all support GGUF.


Q7: What's the tradeoff between Q4_K_S and Q4_K_M?
A: Q4_K_S: Smaller (~4.0 bpw), faster. Q4_K_M: Slightly larger (~4.5 bpw), better quality. Difference: ~0.3 PPL for 7B models.


Q8: Why multiple quantization types instead of just one?
A: Different hardware, use cases, quality requirements. Q2_K for extreme memory constraints, Q6_K for quality-critical applications, Q4_K_M for general use.


Q9: What's "bpw" (bits per weight)?
A: Effective bits including quantization metadata overhead. Q4_K_M is labeled "4-bit" but actually ~4.5 bpw due to scales/mins storage.