🎯

tensorrt-llm

🎯Skill

from ovachiever/droid-tings

VibeIndex|
What it does

Optimizes and accelerates LLM inference on NVIDIA GPUs with up to 10-100x faster performance, supporting quantization, in-flight batching, and multi-GPU scaling.

πŸ“¦

Part of

ovachiever/droid-tings(370 items)

tensorrt-llm

Installation

pip installInstall Python package
pip install tensorrt_llm==1.2.0rc3
πŸ“– Extracted from docs: ovachiever/droid-tings
16Installs
-
AddedFeb 4, 2026

Skill Details

SKILL.md

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Overview

# TensorRT-LLM

NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.

When to use TensorRT-LLM

Use TensorRT-LLM when:

  • Deploying on NVIDIA GPUs (A100, H100, GB200)
  • Need maximum throughput (24,000+ tokens/sec on Llama 3)
  • Require low latency for real-time applications
  • Working with quantized models (FP8, INT4, FP4)
  • Scaling across multiple GPUs or nodes

Use vLLM instead when:

  • Need simpler setup and Python-first API
  • Want PagedAttention without TensorRT compilation
  • Working with AMD GPUs or non-NVIDIA hardware

Use llama.cpp instead when:

  • Deploying on CPU or Apple Silicon
  • Need edge deployment without NVIDIA GPUs
  • Want simpler GGUF quantization format

Quick start

Installation

```bash

# Docker (recommended)

docker pull nvidia/tensorrt_llm:latest

# pip install

pip install tensorrt_llm==1.2.0rc3

# Requires CUDA 13.0.0, TensorRT 10.13.2, Python 3.10-3.12

```

Basic inference

```python

from tensorrt_llm import LLM, SamplingParams

# Initialize model

llm = LLM(model="meta-llama/Meta-Llama-3-8B")

# Configure sampling

sampling_params = SamplingParams(

max_tokens=100,

temperature=0.7,

top_p=0.9

)

# Generate

prompts = ["Explain quantum computing"]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:

print(output.text)

```

Serving with trtllm-serve

```bash

# Start server (automatic model download and compilation)

trtllm-serve meta-llama/Meta-Llama-3-8B \

--tp_size 4 \ # Tensor parallelism (4 GPUs)

--max_batch_size 256 \

--max_num_tokens 4096

# Client request

curl -X POST http://localhost:8000/v1/chat/completions \

-H "Content-Type: application/json" \

-d '{

"model": "meta-llama/Meta-Llama-3-8B",

"messages": [{"role": "user", "content": "Hello!"}],

"temperature": 0.7,

"max_tokens": 100

}'

```

Key features

Performance optimizations

  • In-flight batching: Dynamic batching during generation
  • Paged KV cache: Efficient memory management
  • Flash Attention: Optimized attention kernels
  • Quantization: FP8, INT4, FP4 for 2-4Γ— faster inference
  • CUDA graphs: Reduced kernel launch overhead

Parallelism

  • Tensor parallelism (TP): Split model across GPUs
  • Pipeline parallelism (PP): Layer-wise distribution
  • Expert parallelism: For Mixture-of-Experts models
  • Multi-node: Scale beyond single machine

Advanced features

  • Speculative decoding: Faster generation with draft models
  • LoRA serving: Efficient multi-adapter deployment
  • Disaggregated serving: Separate prefill and generation

Common patterns

Quantized model (FP8)

```python

from tensorrt_llm import LLM

# Load FP8 quantized model (2Γ— faster, 50% memory)

llm = LLM(

model="meta-llama/Meta-Llama-3-70B",

dtype="fp8",

max_num_tokens=8192

)

# Inference same as before

outputs = llm.generate(["Summarize this article..."])

```

Multi-GPU deployment

```python

# Tensor parallelism across 8 GPUs

llm = LLM(

model="meta-llama/Meta-Llama-3-405B",

tensor_parallel_size=8,

dtype="fp8"

)

```

Batch inference

```python

# Process 100 prompts efficiently

prompts = [f"Question {i}: ..." for i in range(100)]

outputs = llm.generate(

prompts,

sampling_params=SamplingParams(max_tokens=200)

)

# Automatic in-flight batching for maximum throughput

```

Performance benchmarks

Meta Llama 3-8B (H100 GPU):

  • Throughput: 24,000 tokens/sec
  • Latency: ~10ms per token
  • vs PyTorch: 100Γ— faster

Llama 3-70B (8Γ— A100 80GB):

  • FP8 quantization: 2Γ— faster than FP16
  • Memory: 50% reduction with FP8

Supported models

  • LLaMA family: Llama 2, Llama 3, CodeLlama
  • GPT family: GPT-2, GPT-J, GPT-NeoX
  • Qwen: Qwen, Qwen2, QwQ
  • DeepSeek: DeepSeek-V2, DeepSeek-V3
  • Mixtral: Mixtral-8x7B, Mixtral-8x22B
  • Vision: LLaVA, Phi-3-vision
  • 100+ models on HuggingFace

References

  • [Optimization Guide](references/optimization.md) - Quantization, batching, KV cache tuning
  • [Multi-GPU Setup](references/multi-gpu.md) - Tensor/pipeline parallelism, multi-node
  • [Serving Guide](references/serving.md) - Production deployment, monitoring, autoscaling

Resources

  • Docs: https://nvidia.github.io/TensorRT-LLM/
  • GitHub: https://github.com/NVIDIA/TensorRT-LLM
  • Models: https://huggingface.co/models?library=tensorrt_llm