RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

accelerationPython/C++

ONNX Runtime CUDA Guide: Cross-Platform ML Inference Optimization

December 25, 202513 min read

Introduction

ONNX Runtime is a cross-platform, high-performance scoring engine for Open Neural Network Exchange (ONNX) models. Developed by Microsoft, it provides optimized inference across CPU, GPU, and specialized accelerators through pluggable execution providers. For CUDA developers, ONNX Runtime offers a flexible deployment solution that works across NVIDIA GPUs, AMD GPUs, Intel hardware, and edge devices. Its CUDA execution provider leverages cuDNN, cuBLAS, and TensorRT for optimized GPU inference while maintaining model portability. This guide covers ONNX Runtime's CUDA execution provider, graph optimizations, quantization, TensorRT integration, and best practices for production ML deployment.

CUDA Integration: ONNX Runtime's CUDA execution provider uses cuDNN for convolutions, cuBLAS for matrix operations, and can delegate to TensorRT for maximum performance. It automatically applies graph optimizations like operator fusion, constant folding, and layout transformation for efficient GPU execution.

Key Features

✓CUDA execution provider for NVIDIA GPUs
✓TensorRT execution provider for optimized inference
✓Automatic graph optimizations
✓INT8/FP16 quantization support
✓Cross-platform model deployment
✓Multi-threaded and multi-GPU support
✓DirectML for Windows GPU acceleration
✓OpenVINO, AMD ROCm, and ARM support
✓C++, Python, C#, Java, and JavaScript APIs
✓Framework interoperability (PyTorch, TF, scikit-learn)

Installation

Install ONNX Runtime with GPU support.

bash

# Install ONNX Runtime with CUDA support
pip install onnxruntime-gpu

# Or install latest nightly
pip install ort-nightly-gpu

# Verify installation
python -c "import onnxruntime as ort; print(f'ONNX Runtime {ort.__version__}')"

# Check available execution providers
python -c "import onnxruntime as ort; print(ort.get_available_providers())"
# Should include: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

# Install additional tools
pip install onnx  # For model manipulation
pip install onnxmltools  # For model conversion

Basic Example

Basic ONNX Runtime Inference with CUDA

Load and run ONNX model on GPU.

python

import onnxruntime as ort
import numpy as np

# Create inference session with CUDA provider
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
session = ort.InferenceSession('model.onnx', providers=providers)

# Check which provider is being used
print(f"Execution provider: {session.get_providers()}")

# Get input/output names and shapes
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
print(f"Input: {input_name}, shape: {session.get_inputs()[0].shape}")
print(f"Output: {output_name}, shape: {session.get_outputs()[0].shape}")

# Prepare input data
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# Run inference
outputs = session.run([output_name], {input_name: input_data})
result = outputs[0]

print(f"Output shape: {result.shape}")

# Configure CUDA provider with options
cuda_provider_options = {
    'device_id': 0,
    'arena_extend_strategy': 'kSameAsRequested',
    'gpu_mem_limit': 2 * 1024 * 1024 * 1024,  # 2GB
    'cudnn_conv_algo_search': 'EXHAUSTIVE',
    'do_copy_in_default_stream': True,
}

session = ort.InferenceSession(
    'model.onnx',
    providers=[
        ('CUDAExecutionProvider', cuda_provider_options),
        'CPUExecutionProvider'
    ]
)

# Batch inference
batch_size = 16
batch_data = np.random.randn(batch_size, 3, 224, 224).astype(np.float32)
batch_outputs = session.run([output_name], {input_name: batch_data})

Advanced Example

TensorRT Provider and Quantization

Use TensorRT execution provider and model quantization.

python

import onnxruntime as ort
import numpy as np
from onnxruntime.quantization import quantize_dynamic, QuantType

# Dynamic quantization to INT8
quantize_dynamic(
    model_input='model.onnx',
    model_output='model_quantized.onnx',
    weight_type=QuantType.QInt8
)

# TensorRT execution provider for maximum performance
trt_provider_options = {
    'device_id': 0,
    'trt_max_workspace_size': 2147483648,  # 2GB
    'trt_fp16_enable': True,
    'trt_int8_enable': False,
    'trt_engine_cache_enable': True,
    'trt_engine_cache_path': './trt_cache'
}

session = ort.InferenceSession(
    'model.onnx',
    providers=[
        ('TensorrtExecutionProvider', trt_provider_options),
        ('CUDAExecutionProvider', {}),
        'CPUExecutionProvider'
    ]
)

# Session with graph optimizations
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.optimized_model_filepath = 'model_optimized.onnx'

# Enable profiling
sess_options.enable_profiling = True

session = ort.InferenceSession(
    'model.onnx',
    sess_options,
    providers=['CUDAExecutionProvider']
)

# Run with profiling
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
output = session.run(None, {session.get_inputs()[0].name: input_data})

# Get profiling results
prof_file = session.end_profiling()
print(f"Profiling data saved to: {prof_file}")

# IO Binding for zero-copy inference
import torch

# Create PyTorch tensors on GPU
device = torch.device('cuda:0')
input_tensor = torch.randn(1, 3, 224, 224).to(device)
output_tensor = torch.empty(1, 1000).to(device)

# Create IO binding
io_binding = session.io_binding()

# Bind input
io_binding.bind_input(
    name=session.get_inputs()[0].name,
    device_type='cuda',
    device_id=0,
    element_type=np.float32,
    shape=input_tensor.shape,
    buffer_ptr=input_tensor.data_ptr()
)

# Bind output
io_binding.bind_output(
    name=session.get_outputs()[0].name,
    device_type='cuda',
    device_id=0,
    element_type=np.float32,
    shape=output_tensor.shape,
    buffer_ptr=output_tensor.data_ptr()
)

# Run with IO binding (zero-copy)
session.run_with_iobinding(io_binding)

# Result is directly in output_tensor
print(f"Output on GPU: {output_tensor.shape}")

# Multi-model inference
class MultiModelInference:
    def __init__(self, model_paths):
        self.sessions = []
        for path in model_paths:
            session = ort.InferenceSession(
                path,
                providers=['CUDAExecutionProvider']
            )
            self.sessions.append(session)

    def run_pipeline(self, input_data):
        result = input_data
        for session in self.sessions:
            input_name = session.get_inputs()[0].name
            output_name = session.get_outputs()[0].name
            result = session.run([output_name], {input_name: result})[0]
        return result

pipeline = MultiModelInference(['model1.onnx', 'model2.onnx', 'model3.onnx'])
final_output = pipeline.run_pipeline(input_data)

Performance Tips

high impact

Use TensorRT execution provider for best performance

TensorRT provider is faster than CUDA provider. Enable FP16 for Tensor Core GPUs. First run builds engine (slow), subsequent runs are fast.

high impact

Enable graph optimizations

Set graph_optimization_level to ORT_ENABLE_ALL for operator fusion, constant folding, and layout optimization.

high impact

Use IO binding for zero-copy inference

IO binding eliminates CPU-GPU transfers when working with GPU tensors from PyTorch/TensorFlow.

medium impact

Quantize models to INT8

Dynamic or static quantization reduces model size and speeds up inference with minimal accuracy loss.

medium impact

Cache TensorRT engines

Enable trt_engine_cache to avoid rebuilding engines. Saves minutes on startup.

low impact

Use appropriate cuDNN algorithm search

EXHAUSTIVE finds fastest kernels but takes longer on first run. DEFAULT is faster startup.

Common Pitfalls

•Not specifying execution providers - defaults to CPU
•Using list providers instead of tuple with options
•Not enabling TensorRT engine cache - rebuilds every time
•Copying data unnecessarily - use IO binding
•Not setting graph optimization level
•Using wrong input data type - check model expects

Benchmarks

Task	Performance	Notes
ResNet-50 (CUDA provider)	5ms	Batch=1, FP32, RTX 4090
ResNet-50 (TensorRT provider)	2ms	Batch=1, FP16, RTX 4090
BERT-Base (TensorRT FP16)	3ms	Seq len=128, RTX 4090
TensorRT vs CUDA speedup	1.5-3x	Varies by model

Frequently Asked Questions

What is the difference between CUDA and TensorRT execution providers?

CUDA provider uses cuDNN/cuBLAS directly. TensorRT provider converts model to TensorRT for additional optimizations (layer fusion, precision calibration). TensorRT is faster but has longer startup time.

How do I convert PyTorch/TensorFlow models to ONNX?

PyTorch: use torch.onnx.export(). TensorFlow: use tf2onnx. Most models convert cleanly, but custom ops may need special handling.

Can I use ONNX Runtime with multiple GPUs?

ONNX Runtime supports multi-GPU by creating separate sessions per GPU. For data parallelism, manually shard data across sessions. No built-in multi-GPU like PyTorch DDP.