Skip to content

DeepSpeed Inference

1. Overview

Microsoft's inference optimization library
- Part of the larger DeepSpeed training ecosystem - Focus: Multi-GPU inference, kernel optimizations, quantization - Integrated with DeepSpeed-MII (Model Implementations for Inference)



2. Core Innovations

DeepSpeed-MII

High-level serving framework built on DeepSpeed-Inference - REST API server - Dynamic batching - Multi-GPU tensor parallelism - Lower-level alternative to vLLM/TGI

ZeRO-Inference

  • Adapts ZeRO training optimizations for inference
  • Offloading strategies for large models
  • CPU/NVMe offloading when GPU memory insufficient


3. Kernel Optimizations

Custom CUDA Kernels

  • Optimized Transformer layers
  • Attention mechanisms (pre-FlashAttention era)
  • Fused operations (LayerNorm+Residual, etc.)

Note: Some kernels now superseded by FlashAttention and newer libraries

Inference-Specialized Ops

  • KV cache management (simpler than vLLM's paging)
  • Optimized softmax for long sequences
  • Custom GEMM operations


4. Quantization Support

INT8 Quantization

  • Symmetric/asymmetric quantization
  • Per-channel or per-tensor
  • ZeroQuant for activation quantization

Mixed Precision

  • FP16/BF16 computation
  • INT8 weights with FP16 activations
  • Automatic mixed precision selection


5. Model Parallelism

Tensor Parallelism

  • Column/row parallelism for linear layers
  • Optimized communication patterns
  • Supports pipeline parallelism combination

Pipeline Parallelism

  • Micro-batching for throughput
  • 1F1B (one-forward-one-backward) scheduling adapted for inference
  • Good for extremely large models (>100B parameters)


6. DeepSpeed-FastGen (2024+)

Latest addition: Dynamic SplitFuse scheduling
- Combines prefill and decode in single batch - Similar to vLLM's chunked prefill concept - Claimed improvements over naive continuous batching

SplitFuse Algorithm

  1. Split long prefills into chunks
  2. Fuse with decode operations
  3. Balance compute resources dynamically

Benefit: Reduces tail latency for long prompts



7. Inference Engine Initialization

Simplified API:

import deepspeed
engine = deepspeed.init_inference(
    model,
    tensor_parallel={"tp_size": 4},
    dtype=torch.float16,
    replace_with_kernel_inject=True
)

replace_with_kernel_inject: Swaps model ops with DeepSpeed optimized kernels



8. Performance Characteristics

Strengths:
- Good for research/prototyping - Integrated training-to-inference workflow - Strong multi-GPU support

Limitations:
- Less production-hardened than TGI/vLLM - Smaller community/ecosystem - Kernel optimizations lag behind latest research



9. Interview Q&A

Q: When to use DeepSpeed-Inference vs vLLM?
A: DeepSpeed-Inference for research environments with existing DeepSpeed training pipelines. vLLM for production serving with better memory efficiency and throughput.


Q: What is ZeRO-Inference's offloading strategy?
A: Hierarchical offloading: GPU → CPU RAM → NVMe SSD. Brings parameters into GPU on-demand. Enables inference of models larger than GPU memory but with latency penalty.


Q: How does DeepSpeed-FastGen compare to vLLM's continuous batching? A: Both use iteration-level scheduling. FastGen adds SplitFuse for better prefill/decode balance. vLLM has more mature PagedAttention for memory efficiency. Performance similar in practice.


Q: Why isn't DeepSpeed-Inference as popular as vLLM for serving?
A: Later entry to production serving space, less focus on ease-of-use, smaller ecosystem. Primarily adopted by users already in DeepSpeed training ecosystem.


Q: What's the role of kernel injection?
A: Automatically replaces PyTorch operations with optimized DeepSpeed kernels at runtime. Transparent acceleration without model code changes. Trade-off: may have compatibility issues with custom model architectures.