Sourcify
Effortlessly find and manage open-source dependencies for your projects.

The world's fastest deep learning inference optimizer and runtime for NVIDIA GPUs.

NVIDIA TensorRT is a high-performance deep learning inference SDK designed to deliver low latency and high throughput for production applications. As of 2026, it remains the industry standard for optimizing models trained in frameworks like PyTorch and TensorFlow for deployment on NVIDIA's Blackwell and Hopper architectures. The architecture revolves around a specialized optimizer that performs layer and tensor fusion, kernel autotuning, and precision calibration (including FP8, INT8, and FP16). By converting models into highly optimized runtime engines, TensorRT maximizes the utilization of Tensor Cores. With the recent integration of TensorRT-LLM, the SDK has pivoted to become the foundational layer for Generative AI, offering state-of-the-art techniques like In-flight Batching and Paged Attention. This allows developers to scale Large Language Models (LLMs) with up to 8x better efficiency than standard framework-native inference. It is essential for low-latency requirements in autonomous systems, real-time video analytics, and large-scale cloud-based AI services, providing a unified path from training to hyper-scale deployment.
NVIDIA TensorRT is a high-performance deep learning inference SDK designed to deliver low latency and high throughput for production applications.
Explore all tools that specialize in model quantization. This domain focus ensures NVIDIA TensorRT delivers optimized results for this specific requirement.
Explore all tools that specialize in quantize and prune models for efficient deployment. This domain focus ensures NVIDIA TensorRT delivers optimized results for this specific requirement.
Combines nodes in a computation graph to reduce memory bandwidth requirements and kernel launch overhead.
Uses symmetric/asymmetric quantization to convert high-precision weights to lower bit-widths without losing significant accuracy.
Scalable architecture that processes multiple input streams in parallel using CUDA streams.
Selects the fastest algorithm for a specific GPU architecture from a library of optimized kernels.
Enables the engine to handle variable input dimensions (e.g., different image sizes or sequence lengths) at runtime.
A TensorRT-LLM feature that allows new requests to be added to a batch even while others are still being processed.
Allows updating model weights in an already built engine without re-running the full optimization process.
Install NVIDIA Driver and CUDA Toolkit (v12.x or higher recommended).
Install TensorRT via Debian/RPM package or Python Wheel (pip install tensorrt).
Export your trained model to ONNX format or use the TensorRT-LLM build script.
Use the 'trtexec' command-line tool to perform a baseline performance profile.
Configure precision settings: choose between FP32, FP16, INT8, or FP8 (Blackwell/Hopper).
Run calibration for INT8 quantization using a representative dataset for accuracy retention.
Build the optimized TensorRT engine using the Builder API.
Serialize the engine to a local file (.engine or .plan).
Deserialize the engine in your production C++ or Python application.
Deploy via NVIDIA Triton Inference Server for multi-model management and scaling.
All Set
Ready to go
Verified feedback from other users.
"Highly praised for unparalleled performance on NVIDIA hardware, though the learning curve for quantization and C++ integration is noted as steep."
Post questions, share tips, and help other users.
Effortlessly find and manage open-source dependencies for your projects.

End-to-end typesafe APIs made easy.

Page speed monitoring with Lighthouse, focusing on user experience metrics and data visualization.

Topcoder is a pioneer in crowdsourcing, connecting businesses with a global talent network to solve technical challenges.

Explore millions of Discord Bots and Discord Apps.

Build internal tools 10x faster with an open-source low-code platform.

Open-source RAG evaluation tool for assessing accuracy, context quality, and latency of RAG systems.

AI-powered synthetic data generation for software and AI development, ensuring compliance and accelerating engineering velocity.