Skip to content

Triton Inference Server

1. Overview

NVIDIA's general-purpose inference server
- Framework-agnostic (PyTorch, TensorFlow, ONNX, TensorRT) - Not LLM-specific, but increasingly optimized for them - Focus: Enterprise deployment, multi-model serving, complex pipelines



2. Core Architecture

Backend System

Pluggable backends for different frameworks: - Python backend (custom inference logic) - PyTorch backend (TorchScript models) - TensorRT backend (TensorRT engines) - vLLM backend (integration added 2024)

Benefit: Mix different model types in same server


Model Repository

  • Centralized model storage (local/S3/GCS/Azure)
  • Version management
  • Hot-reloading of model versions


3. LLM-Specific Features (2024-2025)

vLLM Backend Integration

  • Uses vLLM engine under the hood
  • Triton API layer on top
  • Get vLLM's PagedAttention + Triton's enterprise features

TensorRT-LLM Backend

  • Native integration with TensorRT-LLM engines
  • Maximum performance for NVIDIA GPUs
  • Requires pre-built TensorRT-LLM engines


4. Advanced Serving Capabilities

Model Ensembles

Multi-stage pipelines as single endpoint: - Preprocessing → Embedding → LLM → Postprocessing - Automatic scheduling between stages - Example: RAG pipeline with retrieval + generation

Dynamic Batching

  • Accumulates requests up to max batch size
  • Timeout-based flushing
  • More basic than vLLM/TGI continuous batching

Sequence Batching

  • For stateful models (e.g., streaming LLMs)
  • Maintains state across multiple requests
  • Useful for chat applications


5. Model Configuration

Model config.pbtxt example:

backend: "vllm"
max_batch_size: 32

instance_group [{ 
  count: 1
  kind: KIND_GPU
}]

parameters: {
  key: "max_tokens"
  value: { string_value: "2048" }
}



6. Scaling & Deployment

Kubernetes Native

  • Official Helm charts
  • Horizontal Pod Autoscaler support
  • Integration with Istio/Envoy for traffic management

Multi-Instance Serving

  • Multiple model instances per GPU
  • Rate limiting and priority queues
  • Request routing based on model version


7. Metrics & Observability

Comprehensive Monitoring:
- Prometheus metrics (latency, throughput, queue depth) - Per-model and per-version metrics - GPU utilization tracking - Inference count, batch statistics

Tracing:
- OpenTelemetry support - Request-level tracing through pipeline stages



8. Performance Optimization

Concurrent Model Execution

  • Multiple models on same GPU (if memory allows)
  • Scheduler balances execution
  • Useful for A/B testing

Instance Groups

  • Multiple instances of same model
  • Load balancing across instances
  • Can specify different GPUs per instance


9. Interview Q&A

Q: When to use Triton over vLLM/TGI?
A: When you need multi-framework support, complex model pipelines, or enterprise features (model versioning, ensembles). For pure LLM serving, vLLM/TGI are simpler.


Q: How does Triton's vLLM backend differ from standalone vLLM?
A: Same core engine, but Triton adds: model versioning, ensemble pipelines, enterprise monitoring, multi-framework support. Trade-off: extra abstraction layer with slight overhead.


Q: What's the benefit of model ensembles?
A: Single API call for multi-stage pipelines. Triton handles scheduling, batching, and data passing between stages. Reduces latency vs multiple network hops.


Q: How does dynamic batching work in Triton?
A: Accumulates requests for up to max_batch_size or max_delay_ms. Simpler than continuous batching (no iteration-level scheduling). Better for CV/audio models than LLMs.


Q: Why use Triton for LLMs when vLLM exists?
A: Multi-model serving (embeddings + LLM + reranker), existing NVIDIA infrastructure, need for A/B testing across model versions, enterprise governance requirements.


Q: How does Triton handle model updates?
A: Model repository polling detects new versions. Can load new version without stopping server. Traffic routing supports gradual rollout (e.g., 90% v1, 10% v2).