Skip to content

vLLM

1. Core Innovation: PagedAttention

Problem: Traditional engines pre-allocate contiguous memory for max sequence length

  • 60-80% GPU memory wasted on over-reservation
  • Internal fragmentation from unused allocated space

Solution: Paged memory management for KV cache

  • KV cache split into fixed-size blocks (pages)
  • Non-contiguous physical memory mapped via block tables
  • Reduces waste to <4%, enables 2-3x larger batch sizes

Memory Formula:

KV Memory ≈ 2 × L × T × H × D_h × B

For Llama-3 8B (L=32, D=4096, FP16): ~0.5 MB per token

  • 2k tokens → ~1 GB
  • 8k tokens → ~4 GB


2. Continuous Batching

vs Static Batching: Waits for entire batch to complete before accepting new requests

vLLM Approach: Iteration-level scheduling
- New requests fill slots freed by completed sequences immediately - Eliminates GPU idle time ("bubbles") - Increases throughput by 20-30%



3. Prefill vs Decode Phases

Phase Processing Bottleneck vLLM Optimization
Prefill Parallel over tokens Compute-bound Chunked prefill
Decode Sequential per token Memory-bandwidth PagedAttention

Chunked Prefill: Breaks large prompts into chunks to prevent blocking decode operations



4. Modern Features (2025-2026)

Speculative Decoding

  • Small draft model generates k tokens
  • Large target model verifies in single forward pass
  • 2-3x latency reduction for heavy models

Automatic Prefix Caching (APC)

  • Shared KV blocks for common prefixes (system prompts, RAG contexts)
  • Multiple requests reference same physical memory
  • Critical for multi-turn chat and RAG applications

Multi-LoRA Support

  • Serve base model + hundreds of LoRA adapters simultaneously
  • SGMV kernels enable batched computation across different adapters
  • Ideal for multi-tenant SaaS deployments


5. Memory Pressure Handling

Preemption Strategies: 1. Swap: Move KV blocks to CPU memory (slower, preserves compute) 2. Recompute: Drop blocks and recalculate later (faster on modern GPUs)

Strategy selection based on GPU compute vs memory bandwidth ratio.



6. Interview Q&A

Q: Why does PagedAttention improve throughput?
A: Eliminates memory fragmentation, allowing more concurrent requests to fit in GPU memory. With 60-80% waste reduced to <4%, effective batch size increases 2-3x.


Q: When is prefill the bottleneck vs decode?
A: Prefill dominates for short outputs with long prompts (summarization). Decode dominates for long generations (creative writing). vLLM uses chunked prefill to balance both.


Q: How does vLLM handle variable-length sequences in a batch?
A: Continuous batching removes completed sequences and adds new ones at iteration boundaries. Block tables allow each sequence to use non-contiguous memory independently.


Q: Why use recompute over swap for preemption?
A: On H100/A100 GPUs with high compute, recomputing KV cache is faster than PCIe transfer to CPU. Swap preferred for older GPUs or when CPU memory is abundant.


Q: How does APC differ from traditional caching?
A: Traditional caching stores entire request results. APC caches KV blocks at sub-request granularity, enabling partial reuse across different requests with shared prefixes.