DeepSpeed Inference
1. Overview¶
Microsoft's inference optimization library
- Part of the larger DeepSpeed training ecosystem
- Focus: Multi-GPU inference, kernel optimizations, quantization
- Integrated with DeepSpeed-MII (Model Implementations for Inference)
2. Core Innovations¶
DeepSpeed-MII¶
High-level serving framework built on DeepSpeed-Inference - REST API server - Dynamic batching - Multi-GPU tensor parallelism - Lower-level alternative to vLLM/TGI
ZeRO-Inference¶
- Adapts ZeRO training optimizations for inference
- Offloading strategies for large models
- CPU/NVMe offloading when GPU memory insufficient
3. Kernel Optimizations¶
Custom CUDA Kernels¶
- Optimized Transformer layers
- Attention mechanisms (pre-FlashAttention era)
- Fused operations (LayerNorm+Residual, etc.)
Note: Some kernels now superseded by FlashAttention and newer libraries
Inference-Specialized Ops¶
- KV cache management (simpler than vLLM's paging)
- Optimized softmax for long sequences
- Custom GEMM operations
4. Quantization Support¶
INT8 Quantization¶
- Symmetric/asymmetric quantization
- Per-channel or per-tensor
- ZeroQuant for activation quantization
Mixed Precision¶
- FP16/BF16 computation
- INT8 weights with FP16 activations
- Automatic mixed precision selection
5. Model Parallelism¶
Tensor Parallelism¶
- Column/row parallelism for linear layers
- Optimized communication patterns
- Supports pipeline parallelism combination
Pipeline Parallelism¶
- Micro-batching for throughput
- 1F1B (one-forward-one-backward) scheduling adapted for inference
- Good for extremely large models (>100B parameters)
6. DeepSpeed-FastGen (2024+)¶
Latest addition: Dynamic SplitFuse scheduling
- Combines prefill and decode in single batch
- Similar to vLLM's chunked prefill concept
- Claimed improvements over naive continuous batching
SplitFuse Algorithm¶
- Split long prefills into chunks
- Fuse with decode operations
- Balance compute resources dynamically
Benefit: Reduces tail latency for long prompts
7. Inference Engine Initialization¶
Simplified API:
import deepspeed
engine = deepspeed.init_inference(
model,
tensor_parallel={"tp_size": 4},
dtype=torch.float16,
replace_with_kernel_inject=True
)
replace_with_kernel_inject: Swaps model ops with DeepSpeed optimized kernels
8. Performance Characteristics¶
Strengths:
- Good for research/prototyping
- Integrated training-to-inference workflow
- Strong multi-GPU support
Limitations:
- Less production-hardened than TGI/vLLM
- Smaller community/ecosystem
- Kernel optimizations lag behind latest research
9. Interview Q&A¶
Q: When to use DeepSpeed-Inference vs vLLM?
A: DeepSpeed-Inference for research environments with existing DeepSpeed training pipelines. vLLM for production serving with better memory efficiency and throughput.
Q: What is ZeRO-Inference's offloading strategy?
A: Hierarchical offloading: GPU → CPU RAM → NVMe SSD. Brings parameters into GPU on-demand. Enables inference of models larger than GPU memory but with latency penalty.
Q: How does DeepSpeed-FastGen compare to vLLM's continuous batching? A: Both use iteration-level scheduling. FastGen adds SplitFuse for better prefill/decode balance. vLLM has more mature PagedAttention for memory efficiency. Performance similar in practice.
Q: Why isn't DeepSpeed-Inference as popular as vLLM for serving?
A: Later entry to production serving space, less focus on ease-of-use, smaller ecosystem. Primarily adopted by users already in DeepSpeed training ecosystem.
Q: What's the role of kernel injection?
A: Automatically replaces PyTorch operations with optimized DeepSpeed kernels at runtime. Transparent acceleration without model code changes. Trade-off: may have compatibility issues with custom model architectures.