Triton Inference Server

1. Overview¶

NVIDIA's general-purpose inference server
- Framework-agnostic (PyTorch, TensorFlow, ONNX, TensorRT) - Not LLM-specific, but increasingly optimized for them - Focus: Enterprise deployment, multi-model serving, complex pipelines

2. Core Architecture¶

Backend System¶

Pluggable backends for different frameworks: - Python backend (custom inference logic) - PyTorch backend (TorchScript models) - TensorRT backend (TensorRT engines) - vLLM backend (integration added 2024)

Benefit: Mix different model types in same server

Model Repository¶

Centralized model storage (local/S3/GCS/Azure)
Version management
Hot-reloading of model versions

3. LLM-Specific Features (2024-2025)¶

vLLM Backend Integration¶

Uses vLLM engine under the hood
Triton API layer on top
Get vLLM's PagedAttention + Triton's enterprise features

TensorRT-LLM Backend¶

Native integration with TensorRT-LLM engines
Maximum performance for NVIDIA GPUs
Requires pre-built TensorRT-LLM engines

4. Advanced Serving Capabilities¶

Model Ensembles¶

Multi-stage pipelines as single endpoint: - Preprocessing → Embedding → LLM → Postprocessing - Automatic scheduling between stages - Example: RAG pipeline with retrieval + generation

Dynamic Batching¶

Accumulates requests up to max batch size
Timeout-based flushing
More basic than vLLM/TGI continuous batching

Sequence Batching¶

For stateful models (e.g., streaming LLMs)
Maintains state across multiple requests
Useful for chat applications

5. Model Configuration¶

Model config.pbtxt example:

backend: "vllm"
max_batch_size: 32

instance_group [{ 
  count: 1
  kind: KIND_GPU
}]

parameters: {
  key: "max_tokens"
  value: { string_value: "2048" }
}

6. Scaling & Deployment¶

Kubernetes Native¶

Official Helm charts
Horizontal Pod Autoscaler support
Integration with Istio/Envoy for traffic management

Multi-Instance Serving¶

Multiple model instances per GPU
Rate limiting and priority queues
Request routing based on model version

7. Metrics & Observability¶

Comprehensive Monitoring:
- Prometheus metrics (latency, throughput, queue depth) - Per-model and per-version metrics - GPU utilization tracking - Inference count, batch statistics

Tracing:
- OpenTelemetry support - Request-level tracing through pipeline stages

8. Performance Optimization¶

Concurrent Model Execution¶

Multiple models on same GPU (if memory allows)
Scheduler balances execution
Useful for A/B testing

Instance Groups¶

Multiple instances of same model
Load balancing across instances
Can specify different GPUs per instance

9. Interview Q&A¶

Q: When to use Triton over vLLM/TGI?
A: When you need multi-framework support, complex model pipelines, or enterprise features (model versioning, ensembles). For pure LLM serving, vLLM/TGI are simpler.

Q: How does Triton's vLLM backend differ from standalone vLLM?
A: Same core engine, but Triton adds: model versioning, ensemble pipelines, enterprise monitoring, multi-framework support. Trade-off: extra abstraction layer with slight overhead.

Q: What's the benefit of model ensembles?
A: Single API call for multi-stage pipelines. Triton handles scheduling, batching, and data passing between stages. Reduces latency vs multiple network hops.

Q: How does dynamic batching work in Triton?
A: Accumulates requests for up to max_batch_size or max_delay_ms. Simpler than continuous batching (no iteration-level scheduling). Better for CV/audio models than LLMs.

Q: Why use Triton for LLMs when vLLM exists?
A: Multi-model serving (embeddings + LLM + reranker), existing NVIDIA infrastructure, need for A/B testing across model versions, enterprise governance requirements.

Q: How does Triton handle model updates?
A: Model repository polling detects new versions. Can load new version without stopping server. Traffic routing supports gradual rollout (e.g., 90% v1, 10% v2).