Paged Attention
1. Overview¶
PagedAttention is a memory management technique for efficient LLM serving that stores KV cache in non-contiguous memory blocks (pages). It's the core innovation behind vLLM, enabling near-zero waste in KV cache memory and higher throughput.
Key insight: Treat KV cache like virtual memory in operating systems—use paging to eliminate fragmentation and enable flexible memory sharing.
2. The KV Cache Memory Problem¶
2.1 What is KV Cache?¶
In autoregressive generation, transformers reuse Key and Value tensors from previous tokens:
- Without cache: Recompute K, V for all previous tokens at each step (wasteful)
- With cache: Store K, V tensors and only compute for new token
For a sequence of length \(N\) with \(L\) layers, \(H\) heads, and dimension \(d\): $$ \text{KV cache size} = 2 \times N \times L \times H \times d $$
Example: LLaMA-13B with 2048 tokens ≈ 800 MB per sequence
2.2 Problems with Traditional KV Cache¶
Problem 1: Memory Fragmentation¶
Issue: Must pre-allocate contiguous memory for maximum sequence length
Sequence 1: [████████░░░░] (800 tokens, allocated for 2048)
Sequence 2: [███░░░░░░░░░] (300 tokens, allocated for 2048)
Sequence 3: [██████░░░░░░] (600 tokens, allocated for 2048)
- Actual usage: 1700 tokens
- Allocated: 6144 slots (3 × 2048)
- Waste: 72% of memory unused!
Problem 2: No Memory Sharing¶
- Cannot share KV cache across sequences (even with identical prompts)
- Parallel sampling requires duplicating entire cache
- Beam search creates multiple copies
Problem 3: Static Allocation¶
- Must allocate for worst case (max sequence length)
- Can't dynamically adjust based on actual needs
- Limits batch size and throughput
3. How PagedAttention Works¶
3.1 Core Concept: Paging¶
Divide KV cache into fixed-size blocks (pages), similar to OS virtual memory:
Logical sequence: [Token 0, Token 1, ..., Token N]
↓
Physical memory: [Block 0] → [Block 5] → [Block 2] (non-contiguous)
Key properties:
- Block size: Typically 16-64 tokens
- Blocks can be anywhere in physical memory
- Mapping tracked via block table (like OS page table)
3.2 Block Table¶
Each sequence has a block table mapping logical blocks to physical blocks:
Sequence 1:
Logical Block 0 → Physical Block 3
Logical Block 1 → Physical Block 7
Logical Block 2 → Physical Block 1
Sequence 2:
Logical Block 0 → Physical Block 3 (shared with Seq 1!)
Logical Block 1 → Physical Block 9
3.3 Dynamic Allocation¶
Blocks are allocated on-demand as sequences grow:
# Conceptual allocation
def generate_token(sequence):
if sequence.last_block_is_full():
new_block = allocate_free_block()
sequence.block_table.append(new_block)
# Compute attention using block table
output = paged_attention(Q, sequence.block_table)
return output
Benefits:
- Only allocate what's actually used
- No pre-allocation for max length
- Memory freed immediately when sequence completes
3.4 Memory Sharing via Copy-on-Write¶
Multiple sequences can share blocks (read-only):
Prompt: "Translate to French: "
↓
[Block 0: "Translate to French: "] ← Shared by all sequences
Seq 1: [Block 0] → [Block 3: "Hello → "]
Seq 2: [Block 0] → [Block 5: "Goodbye → "]
When modifying a shared block → copy-on-write:
- Allocate new physical block
- Copy contents
- Update block table
- Modify the copy
4. Attention Computation with Paging¶
Standard attention accesses KV cache contiguously. PagedAttention accesses via block table:
# Simplified PagedAttention
def paged_attention(Q, block_table, K_blocks, V_blocks):
output = 0
for logical_idx, physical_idx in enumerate(block_table):
# Fetch K, V from physical block
K_block = K_blocks[physical_idx]
V_block = V_blocks[physical_idx]
# Compute attention for this block
scores = Q @ K_block.T / sqrt(d)
attn = softmax(scores)
output += attn @ V_block
return output
Key insight: The indirection (block table lookup) has minimal overhead compared to memory savings.
5. Performance Impact¶
Memory Efficiency¶
- Traditional: 20-40% KV cache utilization (60-80% waste)
- PagedAttention: 80-95% utilization (5-20% waste)
- 2-3x more sequences in same memory
Throughput Improvement¶
- vLLM with PagedAttention: 2-4x higher throughput vs traditional serving
- Batch size limited by memory → bigger batches with less waste
Latency¶
- Minimal overhead from block table lookups (<5%)
- Often better latency due to higher batch efficiency
6. Interview Questions¶
Q1: What problem does PagedAttention solve?¶
Answer: PagedAttention solves memory fragmentation and waste in KV cache management. Traditional approaches pre-allocate contiguous memory for max sequence length, wasting 60-80% of memory. PagedAttention uses non-contiguous blocks allocated on-demand, achieving 80-95% utilization and enabling 2-4x higher throughput.
Q2: How is PagedAttention similar to OS virtual memory?¶
Answer: Both use paging:
- Virtual memory: Maps virtual addresses to physical pages via page table
- PagedAttention: Maps logical KV cache positions to physical blocks via block table
Both enable non-contiguous allocation, on-demand paging, and copy-on-write sharing.
Q3: What's a typical block size and why?¶
Answer: Typically 16-64 tokens. Trade-offs:
- Too small: High block table overhead, more lookups during attention
- Too large: Internal fragmentation (wasted space within partially-filled blocks)
- Sweet spot: 16-64 balances overhead vs. fragmentation (similar to OS page sizes like 4KB)
Q4: How does PagedAttention enable memory sharing?¶
Answer: Multiple sequences can point to the same physical blocks (read-only). Common use cases:
- Shared prompts: All sequences share blocks containing the same prompt
- Parallel sampling: Multiple outputs from same prompt share prefix blocks
- Beam search: Different beams share common prefix
When a shared block needs modification → copy-on-write: allocate new block, copy contents, update that sequence's block table.
Q5: What's the overhead of block table lookups?¶
Answer: Minimal (<5% typically) because:
- Block tables are small (fits in cache)
- Lookups are simple integer indexing
- Attention computation dominates (matrix ops on blocks)
- Modern GPUs handle indirection efficiently
The memory savings far outweigh this small overhead.
Q6: How does PagedAttention improve throughput?¶
Answer: By reducing memory waste:
- Traditional: Can fit 10 sequences (60% waste)
- PagedAttention: Can fit 25 sequences (10% waste) in same memory
- Bigger batches → better GPU utilization → higher throughput
Typical improvement: 2-4x more requests/second.
Q7: What happens when we run out of physical blocks?¶
Answer: Memory management strategies:
- Preemption: Evict lower-priority sequences, save their state
- Swapping: Move blocks to CPU memory (like OS swap)
- Recomputation: Drop blocks and recompute if needed
- Blocking: Wait until blocks free up
vLLM typically uses preemption for fairness and efficiency.
Q8: Can PagedAttention work with FlashAttention?¶
Answer: Yes! They're complementary:
- FlashAttention: Optimizes attention computation (tiling, kernel fusion)
- PagedAttention: Optimizes KV cache memory management (paging, sharing)
You can use both together: FlashAttention for fast computation, PagedAttention for efficient memory. vLLM does exactly this.
Q9: What's the difference between block size and tile size?¶
Answer:
- Block size (PagedAttention): Memory management granularity (16-64 tokens)
- Determines allocation unit for KV cache storage
- Tile size (FlashAttention): Computation granularity (128-256 tokens)
- Determines how much data loads into shared memory at once
They're independent concepts operating at different levels (memory management vs computation).
Q10: What are the limitations of PagedAttention?¶
Answer:
- Complexity: More complex implementation than contiguous allocation
- Indirection overhead: Small cost from block table lookups
- GPU kernel changes: Requires custom attention kernels that understand block tables
- Internal fragmentation: Last block in sequence may be partially empty
Despite these, benefits (2-4x throughput) far outweigh costs for LLM serving.
7. PagedAttention in Practice (vLLM)¶
Key Features¶
# vLLM with PagedAttention
from vllm import LLM
llm = LLM(model="meta-llama/Llama-2-7b")
# Automatic memory management
outputs = llm.generate(prompts, sampling_params)
# - Blocks allocated on-demand
# - Shared prompts reuse blocks
# - Memory freed automatically
Use Cases Where It Shines¶
✅ High-throughput serving (many concurrent requests)
✅ Long sequences (less pre-allocation waste)
✅ Parallel sampling / beam search (shared prefixes)
✅ Shared system prompts across requests
❌ Single-sequence inference (no sharing benefits)
❌ Very short sequences (overhead not amortized)
8. Key Takeaways for Interviews¶
- Main idea: Treat KV cache like OS virtual memory—use paging for efficient, flexible allocation
- Problem solved: Memory fragmentation (60-80% waste → 5-20% waste)
- Mechanism: Block table maps logical positions to physical memory blocks
- Sharing: Copy-on-write enables multiple sequences to share read-only blocks
- Performance: 2-4x throughput improvement in LLM serving
- Complementary: Works alongside FlashAttention (computation vs memory optimization)
9. Comparison Table¶
| Aspect | Traditional KV Cache | PagedAttention |
|---|---|---|
| Allocation | Contiguous, pre-allocated | Non-contiguous, on-demand |
| Memory waste | 60-80% | 5-20% |
| Max sequences | Limited by pre-allocation | 2-4x more in same memory |
| Sharing | No sharing | Copy-on-write sharing |
| Complexity | Simple | More complex |
| Overhead | None | <5% (block lookups) |
| Throughput | Baseline | 2-4x higher |