
Modular MAX
The world's most performant AI execution engine and platform for heterogeneous compute.

Accelerate machine learning inference and training across any hardware, framework, and platform.

ONNX Runtime (ORT) is a high-performance engine designed to accelerate machine learning models across a vast spectrum of hardware and operating systems. Originally developed by Microsoft, it serves as the industry-standard execution engine for models exported in the Open Neural Network Exchange (ONNX) format. By 2026, ORT has solidified its position as the critical middleware between high-level frameworks like PyTorch or TensorFlow and hardware-specific accelerators. Its architecture utilizes Execution Providers (EPs) to interface with hardware-specific libraries such as NVIDIA CUDA, TensorRT, Intel OpenVINO, and Apple CoreML. This modularity allows developers to 'write once, deploy anywhere' without sacrificing performance. Beyond inference, ORT Training enables accelerated distributed training on edge devices and in the cloud. With the rise of Generative AI, ORT has evolved to include specific optimizations for Large Language Models (LLMs) via DirectML and specialized kernel fusions, making it the preferred choice for local LLM execution in browser environments (WebAssembly) and mobile applications. Its 2026 market position is defined by its ubiquity in production-grade AI pipelines where latency, throughput, and hardware flexibility are non-negotiable requirements.
ONNX Runtime (ORT) is a high-performance engine designed to accelerate machine learning models across a vast spectrum of hardware and operating systems.
Explore all tools that specialize in model quantization. This domain focus ensures ONNX Runtime delivers optimized results for this specific requirement.
A pluggable architecture that allows the runtime to interface with hardware-specific libraries (CUDA, ROCm, OneDNN, etc.) dynamically.
Built-in tools to convert FP32 models to INT8 or UINT8, significantly reducing model size and increasing speed on mobile/IoT.
Automatically performs constant folding, redundant node elimination, and node fusion at runtime.
Enables gradient-based training directly within the runtime, optimized for edge devices.
Compiles the runtime for browser execution, leveraging WebGL or WebGPU for acceleration.
Allows Zero-copy data transfer between the application and the inference engine.
Allows developers to register custom C++ kernels for operations not defined in the standard ONNX spec.
Export your model from PyTorch or TensorFlow to the .onnx format using torch.onnx.export or tf2onnx.
Install the ONNX Runtime package compatible with your hardware (e.g., pip install onnxruntime-gpu).
Select and configure the appropriate Execution Provider (CUDA, TensorRT, or CPU).
Initialize an InferenceSession pointing to your .onnx model file.
Inspect model metadata to determine required input shapes and data types.
Prepare input tensors using NumPy or specialized data loaders.
Execute the 'run' method to perform high-speed inference.
Apply post-processing logic to the output tensors.
Profile the session using the built-in profiling tool to identify bottlenecks.
Deploy the optimized binary to target environments like Linux, Windows, Android, or WebAssembly.
All Set
Ready to go
Verified feedback from other users.
"Users praise ONNX Runtime for its incredible performance gains and hardware flexibility. It is considered the gold standard for moving ML from research to production, though some note the learning curve for complex C++ deployments."
Post questions, share tips, and help other users.

The world's most performant AI execution engine and platform for heterogeneous compute.

The world's fastest deep learning inference optimizer and runtime for NVIDIA GPUs.

The open-source standard for high-performance AI model interoperability and cross-platform deployment.

Accelerating the journey from frontier AI research to hardware-optimized production scale.

The Decentralized Intelligence Layer for Autonomous AI Agents and Scalable Inference.

The Knowledge Graph Infrastructure for Structured GraphRAG and Deterministic AI Retrieval.