TensorRT LLM
1. Core Architecture¶
NVIDIA's optimization stack for LLM inference on their GPUs
- Built on TensorRT for kernel-level optimization
- Focuses on extracting maximum performance from NVIDIA hardware
- Trade-off: Complex setup vs peak performance
2. Key Technologies¶
1. In-flight Batching (Continuous Batching)¶
- Similar to vLLM's approach
- Dynamically adds/removes requests during execution
- Optimized specifically for NVIDIA GPU scheduling
2. Paged KV Cache¶
- Inspired by vLLM's PagedAttention
- NVIDIA-optimized memory management
- Custom CUDA kernels for memory operations
3. Kernel Fusion¶
- Combines multiple operations into single kernels
- Reduces memory transfers between GPU operations
- Examples: LayerNorm+Residual, QKV projection fusion
4. FlashAttention & FP8 Support¶
- Integrated FlashAttention-2 for memory-efficient attention
- Native FP8 quantization on Hopper GPUs (H100)
- 2x throughput vs FP16 with minimal accuracy loss
3. Quantization Support¶
Weight-Only Quantization:
- INT8/INT4 weights, FP16 activations
- 2-4x memory reduction
- GPTQ, AWQ methods supported
Activation Quantization:
- FP8 (Hopper GPUs only)
- SmoothQuant for INT8 activations
4. Model Parallelism¶
Tensor Parallelism¶
- Splits model layers across GPUs
- Low-latency (intra-node communication)
- Best for latency-sensitive serving
Pipeline Parallelism¶
- Splits model vertically into stages
- Higher throughput for large batches
- Micro-batching to reduce bubbles
Combined TP+PP¶
- Multi-dimensional parallelism
- Example: 8-way TP × 4-way PP for 32 GPUs
5. Engine Building Process¶
Two-Step Workflow:
1. Build: Model → Optimized TensorRT engine (slow, one-time)
2. Runtime: Load engine → Inference (fast)
Key considerations:
- Engines are GPU-specific (H100 engine ≠ A100 engine)
- Rebuild required for different batch sizes or sequence lengths
- Trade flexibility for maximum performance
6. Multi-GPU Inference Modes¶
KV Cache Transfer Optimization: - Custom NCCL/NVLink operations for KV cache - Overlaps communication with computation - Critical for tensor parallel setups
7. Interview Q&A¶
Q: When to choose TensorRT-LLM over vLLM?
A: When you need absolute maximum throughput on NVIDIA GPUs and can handle complex setup. vLLM for ease of use and flexibility; TensorRT-LLM for peak performance.
Q: Why is engine building necessary?
A: TensorRT optimizes compute graphs at compile time (kernel selection, fusion, memory layout). This specialization achieves maximum performance but loses runtime flexibility.
Q: How does TensorRT-LLM handle dynamic shapes?
A: Uses optimization profiles with min/max ranges during build. Runtime performance varies by how well actual inputs match the profile. Too wide a range reduces optimization effectiveness.
Q: What's the FP8 accuracy impact?
A: On Hopper GPUs with proper calibration, <1% accuracy degradation for most models. Requires per-tensor scaling and careful quantization of outlier features.
Q: Why does TensorRT-LLM require specific CUDA versions?
A: Tightly integrated with CUDA toolkit for custom kernel launches, memory management, and GPU-specific optimizations. Newer releases exploit latest CUDA features (e.g., Hopper's Tensor Memory Accelerator).