vLLM

1. Core Innovation: PagedAttention¶

Problem: Traditional engines pre-allocate contiguous memory for max sequence length

60-80% GPU memory wasted on over-reservation
Internal fragmentation from unused allocated space

Solution: Paged memory management for KV cache

KV cache split into fixed-size blocks (pages)
Non-contiguous physical memory mapped via block tables
Reduces waste to <4%, enables 2-3x larger batch sizes

Memory Formula:

KV Memory ≈ 2 × L × T × H × D_h × B

For Llama-3 8B (L=32, D=4096, FP16): ~0.5 MB per token

2k tokens → ~1 GB
8k tokens → ~4 GB

2. Continuous Batching¶

vs Static Batching: Waits for entire batch to complete before accepting new requests

vLLM Approach: Iteration-level scheduling
- New requests fill slots freed by completed sequences immediately - Eliminates GPU idle time ("bubbles") - Increases throughput by 20-30%

3. Prefill vs Decode Phases¶

Phase	Processing	Bottleneck	vLLM Optimization
Prefill	Parallel over tokens	Compute-bound	Chunked prefill
Decode	Sequential per token	Memory-bandwidth	PagedAttention

Chunked Prefill: Breaks large prompts into chunks to prevent blocking decode operations

4. Modern Features (2025-2026)¶

Speculative Decoding¶

Small draft model generates k tokens
Large target model verifies in single forward pass
2-3x latency reduction for heavy models

Automatic Prefix Caching (APC)¶

Shared KV blocks for common prefixes (system prompts, RAG contexts)
Multiple requests reference same physical memory
Critical for multi-turn chat and RAG applications

Multi-LoRA Support¶

Serve base model + hundreds of LoRA adapters simultaneously
SGMV kernels enable batched computation across different adapters
Ideal for multi-tenant SaaS deployments

5. Memory Pressure Handling¶

Preemption Strategies: 1. Swap: Move KV blocks to CPU memory (slower, preserves compute) 2. Recompute: Drop blocks and recalculate later (faster on modern GPUs)

Strategy selection based on GPU compute vs memory bandwidth ratio.

6. Interview Q&A¶

Q: Why does PagedAttention improve throughput?
A: Eliminates memory fragmentation, allowing more concurrent requests to fit in GPU memory. With 60-80% waste reduced to <4%, effective batch size increases 2-3x.

Q: When is prefill the bottleneck vs decode?
A: Prefill dominates for short outputs with long prompts (summarization). Decode dominates for long generations (creative writing). vLLM uses chunked prefill to balance both.

Q: How does vLLM handle variable-length sequences in a batch?
A: Continuous batching removes completed sequences and adds new ones at iteration boundaries. Block tables allow each sequence to use non-contiguous memory independently.

Q: Why use recompute over swap for preemption?
A: On H100/A100 GPUs with high compute, recomputing KV cache is faster than PCIe transfer to CPU. Swap preferred for older GPUs or when CPU memory is abundant.

Q: How does APC differ from traditional caching?
A: Traditional caching stores entire request results. APC caches KV blocks at sub-request granularity, enabling partial reuse across different requests with shared prefixes.