DeepSpeed Inference

1. Overview¶

Microsoft's inference optimization library
- Part of the larger DeepSpeed training ecosystem - Focus: Multi-GPU inference, kernel optimizations, quantization - Integrated with DeepSpeed-MII (Model Implementations for Inference)

2. Core Innovations¶

DeepSpeed-MII¶

High-level serving framework built on DeepSpeed-Inference - REST API server - Dynamic batching - Multi-GPU tensor parallelism - Lower-level alternative to vLLM/TGI

ZeRO-Inference¶

Adapts ZeRO training optimizations for inference
Offloading strategies for large models
CPU/NVMe offloading when GPU memory insufficient

3. Kernel Optimizations¶

Custom CUDA Kernels¶

Optimized Transformer layers
Attention mechanisms (pre-FlashAttention era)
Fused operations (LayerNorm+Residual, etc.)

Note: Some kernels now superseded by FlashAttention and newer libraries

Inference-Specialized Ops¶

KV cache management (simpler than vLLM's paging)
Optimized softmax for long sequences
Custom GEMM operations

4. Quantization Support¶

INT8 Quantization¶

Symmetric/asymmetric quantization
Per-channel or per-tensor
ZeroQuant for activation quantization

Mixed Precision¶

FP16/BF16 computation
INT8 weights with FP16 activations
Automatic mixed precision selection

5. Model Parallelism¶

Tensor Parallelism¶

Column/row parallelism for linear layers
Optimized communication patterns
Supports pipeline parallelism combination

Pipeline Parallelism¶

Micro-batching for throughput
1F1B (one-forward-one-backward) scheduling adapted for inference
Good for extremely large models (>100B parameters)

6. DeepSpeed-FastGen (2024+)¶

Latest addition: Dynamic SplitFuse scheduling
- Combines prefill and decode in single batch - Similar to vLLM's chunked prefill concept - Claimed improvements over naive continuous batching

SplitFuse Algorithm¶

Split long prefills into chunks
Fuse with decode operations
Balance compute resources dynamically

Benefit: Reduces tail latency for long prompts

7. Inference Engine Initialization¶

Simplified API:

import deepspeed
engine = deepspeed.init_inference(
    model,
    tensor_parallel={"tp_size": 4},
    dtype=torch.float16,
    replace_with_kernel_inject=True
)

replace_with_kernel_inject: Swaps model ops with DeepSpeed optimized kernels

8. Performance Characteristics¶

Strengths:
- Good for research/prototyping - Integrated training-to-inference workflow - Strong multi-GPU support

Limitations:
- Less production-hardened than TGI/vLLM - Smaller community/ecosystem - Kernel optimizations lag behind latest research

9. Interview Q&A¶

Q: When to use DeepSpeed-Inference vs vLLM?
A: DeepSpeed-Inference for research environments with existing DeepSpeed training pipelines. vLLM for production serving with better memory efficiency and throughput.

Q: What is ZeRO-Inference's offloading strategy?
A: Hierarchical offloading: GPU → CPU RAM → NVMe SSD. Brings parameters into GPU on-demand. Enables inference of models larger than GPU memory but with latency penalty.

Q: How does DeepSpeed-FastGen compare to vLLM's continuous batching? A: Both use iteration-level scheduling. FastGen adds SplitFuse for better prefill/decode balance. vLLM has more mature PagedAttention for memory efficiency. Performance similar in practice.

Q: Why isn't DeepSpeed-Inference as popular as vLLM for serving?
A: Later entry to production serving space, less focus on ease-of-use, smaller ecosystem. Primarily adopted by users already in DeepSpeed training ecosystem.

Q: What's the role of kernel injection?
A: Automatically replaces PyTorch operations with optimized DeepSpeed kernels at runtime. Transparent acceleration without model code changes. Trade-off: may have compatibility issues with custom model architectures.