ONNX Runtime is a cross-platform, high-performance scoring engine for Open Neural Network Exchange (ONNX) models. Developed by Microsoft, it provides optimized inference across CPU, GPU, and specialized accelerators through pluggable execution providers. For CUDA developers, ONNX Runtime offers a flexible deployment solution that works across NVIDIA GPUs, AMD GPUs, Intel hardware, and edge devices. Its CUDA execution provider leverages cuDNN, cuBLAS, and TensorRT for optimized GPU inference while maintaining model portability. This guide covers ONNX Runtime's CUDA execution provider, graph optimizations, quantization, TensorRT integration, and best practices for production ML deployment.
CUDA Integration: ONNX Runtime's CUDA execution provider uses cuDNN for convolutions, cuBLAS for matrix operations, and can delegate to TensorRT for maximum performance. It automatically applies graph optimizations like operator fusion, constant folding, and layout transformation for efficient GPU execution.
Install ONNX Runtime with GPU support.
# Install ONNX Runtime with CUDA support
pip install onnxruntime-gpu
# Or install latest nightly
pip install ort-nightly-gpu
# Verify installation
python -c "import onnxruntime as ort; print(f'ONNX Runtime {ort.__version__}')"
# Check available execution providers
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Should include: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
# Install additional tools
pip install onnx # For model manipulation
pip install onnxmltools # For model conversionLoad and run ONNX model on GPU.
import onnxruntime as ort
import numpy as np
# Create inference session with CUDA provider
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession('model.onnx', providers=providers)
# Check which provider is being used
print(f"Execution provider: {session.get_providers()}")
# Get input/output names and shapes
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
print(f"Input: {input_name}, shape: {session.get_inputs()[0].shape}")
print(f"Output: {output_name}, shape: {session.get_outputs()[0].shape}")
# Prepare input data
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
# Run inference
outputs = session.run([output_name], {input_name: input_data})
result = outputs[0]
print(f"Output shape: {result.shape}")
# Configure CUDA provider with options
cuda_provider_options = {
'device_id': 0,
'arena_extend_strategy': 'kSameAsRequested',
'gpu_mem_limit': 2 * 1024 * 1024 * 1024, # 2GB
'cudnn_conv_algo_search': 'EXHAUSTIVE',
'do_copy_in_default_stream': True,
}
session = ort.InferenceSession(
'model.onnx',
providers=[
('CUDAExecutionProvider', cuda_provider_options),
'CPUExecutionProvider'
]
)
# Batch inference
batch_size = 16
batch_data = np.random.randn(batch_size, 3, 224, 224).astype(np.float32)
batch_outputs = session.run([output_name], {input_name: batch_data})Use TensorRT execution provider and model quantization.
import onnxruntime as ort
import numpy as np
from onnxruntime.quantization import quantize_dynamic, QuantType
# Dynamic quantization to INT8
quantize_dynamic(
model_input='model.onnx',
model_output='model_quantized.onnx',
weight_type=QuantType.QInt8
)
# TensorRT execution provider for maximum performance
trt_provider_options = {
'device_id': 0,
'trt_max_workspace_size': 2147483648, # 2GB
'trt_fp16_enable': True,
'trt_int8_enable': False,
'trt_engine_cache_enable': True,
'trt_engine_cache_path': './trt_cache'
}
session = ort.InferenceSession(
'model.onnx',
providers=[
('TensorrtExecutionProvider', trt_provider_options),
('CUDAExecutionProvider', {}),
'CPUExecutionProvider'
]
)
# Session with graph optimizations
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.optimized_model_filepath = 'model_optimized.onnx'
# Enable profiling
sess_options.enable_profiling = True
session = ort.InferenceSession(
'model.onnx',
sess_options,
providers=['CUDAExecutionProvider']
)
# Run with profiling
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
output = session.run(None, {session.get_inputs()[0].name: input_data})
# Get profiling results
prof_file = session.end_profiling()
print(f"Profiling data saved to: {prof_file}")
# IO Binding for zero-copy inference
import torch
# Create PyTorch tensors on GPU
device = torch.device('cuda:0')
input_tensor = torch.randn(1, 3, 224, 224).to(device)
output_tensor = torch.empty(1, 1000).to(device)
# Create IO binding
io_binding = session.io_binding()
# Bind input
io_binding.bind_input(
name=session.get_inputs()[0].name,
device_type='cuda',
device_id=0,
element_type=np.float32,
shape=input_tensor.shape,
buffer_ptr=input_tensor.data_ptr()
)
# Bind output
io_binding.bind_output(
name=session.get_outputs()[0].name,
device_type='cuda',
device_id=0,
element_type=np.float32,
shape=output_tensor.shape,
buffer_ptr=output_tensor.data_ptr()
)
# Run with IO binding (zero-copy)
session.run_with_iobinding(io_binding)
# Result is directly in output_tensor
print(f"Output on GPU: {output_tensor.shape}")
# Multi-model inference
class MultiModelInference:
def __init__(self, model_paths):
self.sessions = []
for path in model_paths:
session = ort.InferenceSession(
path,
providers=['CUDAExecutionProvider']
)
self.sessions.append(session)
def run_pipeline(self, input_data):
result = input_data
for session in self.sessions:
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
result = session.run([output_name], {input_name: result})[0]
return result
pipeline = MultiModelInference(['model1.onnx', 'model2.onnx', 'model3.onnx'])
final_output = pipeline.run_pipeline(input_data)TensorRT provider is faster than CUDA provider. Enable FP16 for Tensor Core GPUs. First run builds engine (slow), subsequent runs are fast.
Set graph_optimization_level to ORT_ENABLE_ALL for operator fusion, constant folding, and layout optimization.
IO binding eliminates CPU-GPU transfers when working with GPU tensors from PyTorch/TensorFlow.
Dynamic or static quantization reduces model size and speeds up inference with minimal accuracy loss.
Enable trt_engine_cache to avoid rebuilding engines. Saves minutes on startup.
EXHAUSTIVE finds fastest kernels but takes longer on first run. DEFAULT is faster startup.
| Task | Performance | Notes |
|---|---|---|
| ResNet-50 (CUDA provider) | 5ms | Batch=1, FP32, RTX 4090 |
| ResNet-50 (TensorRT provider) | 2ms | Batch=1, FP16, RTX 4090 |
| BERT-Base (TensorRT FP16) | 3ms | Seq len=128, RTX 4090 |
| TensorRT vs CUDA speedup | 1.5-3x | Varies by model |
CUDA provider uses cuDNN/cuBLAS directly. TensorRT provider converts model to TensorRT for additional optimizations (layer fusion, precision calibration). TensorRT is faster but has longer startup time.
PyTorch: use torch.onnx.export(). TensorFlow: use tf2onnx. Most models convert cleanly, but custom ops may need special handling.
ONNX Runtime supports multi-GPU by creating separate sessions per GPU. For data parallelism, manually shard data across sessions. No built-in multi-GPU like PyTorch DDP.
First run includes graph optimization and TensorRT engine building. Enable engine caching and save optimized graph to avoid this overhead.
NVIDIA-only, faster, less portable
Training framework, less optimized inference
Training framework, TF Lite for mobile
Optimize your ONNX Runtime CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.