GGUF GGML
1. GGML (Georgi Gerganov Machine Learning)¶
C library for machine learning inference, optimized for CPU execution of LLMs.
Key Features:
- Pure C/C++ (no Python runtime)
- CPU-optimized kernels (AVX2, AVX512, NEON)
- 4-bit to 16-bit quantization
- Memory-mapped model loading
- Apple Metal, CUDA, OpenCL backends
GGUF (GGML Universal Format)¶
Successor to GGML format (deprecated). Single-file model container.
File Structure¶
[Header]
- Magic number: "GGUF"
- Version
- Tensor count, KV metadata count
[Metadata]
- Model hyperparameters
- Tokenizer config
- Quantization scheme
- Author, license, etc.
[Tensor Info]
- Name, dimensions, type, offset
[Tensor Data]
- Actual quantized weights (memory-mapped)
Advantages¶
- Single file: All model data + config in one .gguf
- Memory mapping: Load multi-GB models instantly, use minimal RAM
- Extensible: KV metadata for any additional info
- Backward compatible: Old GGML loaders fail safely
Quantization Types¶
K-Quantization (K-quants)¶
Optimized 2-6 bit quantization schemes:
| Type | Bits | Description | Use Case |
|---|---|---|---|
| Q2_K | ~2.5 | Extreme compression | Large models on limited RAM |
| Q3_K_S | ~3.4 | Small, less accurate | Acceptable quality loss |
| Q3_K_M | ~3.7 | Medium quality | Balanced |
| Q4_K_S | ~4.0 | Small, good quality | Recommended default |
| Q4_K_M | ~4.5 | Medium, best quality | Best 4-bit option |
| Q5_K_S | ~5.0 | Small, very good | Low loss |
| Q6_K | ~6.0 | High quality | Near-FP16 quality |
Legacy Quantization¶
- Q4_0: Original 4-bit (group size 32)
- Q4_1: 4-bit with per-group min (better than Q4_0)
- Q5_0, Q5_1: 5-bit variants
- Q8_0: INT8 quantization
Importance Matrix (I-quants)¶
Uses importance scores to allocate more bits to salient weights:
- IQ3_XXS: 3-bit with importance weighting
- IQ4_XS: 4-bit with importance weighting
llama.cpp¶
Reference implementation for GGUF inference.
# Convert HF model to GGUF
python convert-hf-to-gguf.py model_path --outtype q4_K_M
# Inference
./main -m model.gguf -p "Prompt" \
-n 512 \ # Max tokens
-c 4096 \ # Context length
-t 8 \ # Threads
--mlock # Lock in RAM (prevent swapping)
CPU Optimizations¶
Kernel Fusion¶
attention = softmax(QK^T/√d) @ V
Cache-Friendly Layout¶
Reorder tensors for sequential memory access. Dramatic speedup on CPU.
Quantized Matrix Multiply¶
Custom AVX2/AVX512 kernels for INT4/INT8 GEMM. 4-8× faster than naive C.
Performance¶
M2 Max (Metal):
- 7B Q4_K_M: ~40 tokens/sec
- 13B Q4_K_M: ~25 tokens/sec
AMD 5950X (16-core):
- 7B Q4_K_M: ~30 tokens/sec
- 13B Q4_K_M: ~15 tokens/sec
Common Interview Questions¶
Q1: Why GGUF instead of safetensors or PyTorch?
A: GGUF designed for inference, not training. Memory-mapped loading, quantization metadata embedded, optimized for llama.cpp ecosystem.
Q2: What's memory mapping and why does it matter?
A: OS maps file directly to virtual memory. Model loads "instantly" because data stays on disk until accessed. Enables running larger models than RAM.
Q3: Why CPU inference for LLMs?
A: Consumer hardware accessibility. Most users don't have high-end GPUs. CPUs have large memory, enabling bigger models with quantization.
Q4: What's the quality difference between Q4_K_M and Q8_0?
A: Q8_0: ~99% of FP16 quality. Q4_K_M: ~95-97%. For chat, Q4_K_M usually indistinguishable. For precise tasks, Q8_0 safer.
Q5: How do K-quants improve on original Q4_0?
A: Better block structure, per-block scales and mins, optimized for specific bit rates. Q4_K_M beats Q4_0 quality while being similar size.
Q6: Can GGUF models be used outside llama.cpp?
A: Yes. Libraries: llama-cpp-python (Python), whisper.cpp (audio), GPT4All, Ollama, LM Studio all support GGUF.
Q7: What's the tradeoff between Q4_K_S and Q4_K_M?
A: Q4_K_S: Smaller (~4.0 bpw), faster. Q4_K_M: Slightly larger (~4.5 bpw), better quality. Difference: ~0.3 PPL for 7B models.
Q8: Why multiple quantization types instead of just one?
A: Different hardware, use cases, quality requirements. Q2_K for extreme memory constraints, Q6_K for quality-critical applications, Q4_K_M for general use.
Q9: What's "bpw" (bits per weight)?
A: Effective bits including quantization metadata overhead. Q4_K_M is labeled "4-bit" but actually ~4.5 bpw due to scales/mins storage.