Skip to content

TensorRT LLM

1. Core Architecture

NVIDIA's optimization stack for LLM inference on their GPUs
- Built on TensorRT for kernel-level optimization - Focuses on extracting maximum performance from NVIDIA hardware - Trade-off: Complex setup vs peak performance



2. Key Technologies

1. In-flight Batching (Continuous Batching)

  • Similar to vLLM's approach
  • Dynamically adds/removes requests during execution
  • Optimized specifically for NVIDIA GPU scheduling

2. Paged KV Cache

  • Inspired by vLLM's PagedAttention
  • NVIDIA-optimized memory management
  • Custom CUDA kernels for memory operations

3. Kernel Fusion

  • Combines multiple operations into single kernels
  • Reduces memory transfers between GPU operations
  • Examples: LayerNorm+Residual, QKV projection fusion

4. FlashAttention & FP8 Support

  • Integrated FlashAttention-2 for memory-efficient attention
  • Native FP8 quantization on Hopper GPUs (H100)
  • 2x throughput vs FP16 with minimal accuracy loss


3. Quantization Support

Weight-Only Quantization:
- INT8/INT4 weights, FP16 activations - 2-4x memory reduction - GPTQ, AWQ methods supported

Activation Quantization:
- FP8 (Hopper GPUs only) - SmoothQuant for INT8 activations



4. Model Parallelism

Tensor Parallelism

  • Splits model layers across GPUs
  • Low-latency (intra-node communication)
  • Best for latency-sensitive serving

Pipeline Parallelism

  • Splits model vertically into stages
  • Higher throughput for large batches
  • Micro-batching to reduce bubbles

Combined TP+PP

  • Multi-dimensional parallelism
  • Example: 8-way TP × 4-way PP for 32 GPUs


5. Engine Building Process

Two-Step Workflow:
1. Build: Model → Optimized TensorRT engine (slow, one-time) 2. Runtime: Load engine → Inference (fast)

Key considerations:
- Engines are GPU-specific (H100 engine ≠ A100 engine) - Rebuild required for different batch sizes or sequence lengths - Trade flexibility for maximum performance



6. Multi-GPU Inference Modes

KV Cache Transfer Optimization: - Custom NCCL/NVLink operations for KV cache - Overlaps communication with computation - Critical for tensor parallel setups



7. Interview Q&A

Q: When to choose TensorRT-LLM over vLLM?
A: When you need absolute maximum throughput on NVIDIA GPUs and can handle complex setup. vLLM for ease of use and flexibility; TensorRT-LLM for peak performance.


Q: Why is engine building necessary?
A: TensorRT optimizes compute graphs at compile time (kernel selection, fusion, memory layout). This specialization achieves maximum performance but loses runtime flexibility.


Q: How does TensorRT-LLM handle dynamic shapes?
A: Uses optimization profiles with min/max ranges during build. Runtime performance varies by how well actual inputs match the profile. Too wide a range reduces optimization effectiveness.


Q: What's the FP8 accuracy impact?
A: On Hopper GPUs with proper calibration, <1% accuracy degradation for most models. Requires per-tensor scaling and careful quantization of outlier features.


Q: Why does TensorRT-LLM require specific CUDA versions?
A: Tightly integrated with CUDA toolkit for custom kernel launches, memory management, and GPU-specific optimizations. Newer releases exploit latest CUDA features (e.g., Hopper's Tensor Memory Accelerator).