RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

compilerPython/C++

Apache TVM CUDA Guide: ML Compiler Stack for GPU Optimization

December 25, 202514 min read

Introduction

Apache TVM is an open-source machine learning compiler stack that optimizes deep learning models for CPUs, GPUs, and specialized accelerators. It automatically generates efficient code from high-level model descriptions, with advanced auto-tuning to find optimal implementations for your specific hardware. For CUDA developers, TVM provides a unique approach: instead of using hand-coded kernels, it generates optimized CUDA code through a compilation pipeline with auto-scheduling. This enables portable performance across different GPU architectures without manual optimization per device. This guide covers TVM's Relay IR, AutoTVM/AutoScheduler for auto-tuning, CUDA code generation, operator fusion strategies, and deployment workflows for production ML systems.

CUDA Integration: TVM generates CUDA code through its compilation pipeline: Relay IR → TE (Tensor Expression) → Schedule → CUDA/PTX. It uses cuBLAS and cuDNN when beneficial but can also generate custom kernels. AutoScheduler explores the optimization space to find optimal tiling, unrolling, and vectorization parameters.

Key Features

✓Relay: high-level IR for neural networks
✓AutoTVM: automated kernel tuning
✓AutoScheduler (Ansor): automatic scheduling
✓Graph-level optimizations and fusion
✓BYOC (Bring Your Own Codegen) framework
✓Cross-compilation for edge devices
✓Quantization and mixed precision
✓Multi-backend support (CUDA, ROCm, Vulkan)
✓TVM Runtime for minimal deployment
✓VTA: versatile tensor accelerator

Installation

Install TVM from pip or build from source.

bash

# Install from pip (basic)
pip install apache-tvm

# For GPU support, install with CUDA
pip install tlcpack-nightly -f https://tlcpack.ai/wheels

# Or build from source for latest features
git clone --recursive https://github.com/apache/tvm tvm
cd tvm
mkdir build && cd build
cp ../cmake/config.cmake .
# Edit config.cmake: set USE_CUDA ON, set CUDA path
cmake ..
make -j$(nproc)

# Set Python path
export PYTHONPATH=$PYTHONPATH:~/tvm/python

# Verify installation
python -c "import tvm; print(f'TVM {tvm.__version__}')"
python -c "import tvm; print(f'CUDA available: {tvm.cuda().exist}')"

Basic Example

Compile PyTorch Model with TVM

Basic model compilation and inference.

python

import tvm
from tvm import relay
import torch
import torchvision
import numpy as np

# Load PyTorch model
model = torchvision.models.resnet18(pretrained=True)
model.eval()

# Trace the model
input_shape = (1, 3, 224, 224)
input_data = torch.randn(input_shape)
scripted_model = torch.jit.trace(model, input_data).eval()

# Convert to TVM Relay
input_name = "input0"
shape_list = [(input_name, input_shape)]
mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)

# Compile for CUDA
target = tvm.target.cuda()
with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=target, params=params)

# Create runtime module
dev = tvm.cuda(0)
module = tvm.contrib.graph_executor.GraphModule(lib["default"](dev))

# Run inference
input_data = np.random.randn(*input_shape).astype(np.float32)
module.set_input(input_name, input_data)
module.run()
output = module.get_output(0).numpy()

print(f"Output shape: {output.shape}")

# Save compiled module
lib.export_library("resnet18_cuda.so")

# Load and use later
loaded_lib = tvm.runtime.load_module("resnet18_cuda.so")
module = tvm.contrib.graph_executor.GraphModule(loaded_lib["default"](dev))

Advanced Example

AutoTVM Tuning for Optimal Performance

Use AutoTVM to auto-tune kernels for your GPU.

python

import tvm
from tvm import relay, autotvm
import torch
import torchvision

# Load model
model = torchvision.models.resnet50(pretrained=True).eval()
input_shape = (1, 3, 224, 224)
input_data = torch.randn(input_shape)
scripted_model = torch.jit.trace(model, input_data).eval()

# Convert to Relay
mod, params = relay.frontend.from_pytorch(
    scripted_model,
    [("input0", input_shape)]
)

# AutoTVM tuning configuration
target = tvm.target.cuda()
log_file = "resnet50_cuda.log"

# Extract tuning tasks
tasks = autotvm.task.extract_from_program(
    mod["main"],
    target=target,
    params=params
)

print(f"Found {len(tasks)} tuning tasks")

# Configure tuner
tuning_option = {
    'log_filename': log_file,
    'tuner': 'xgb',  # XGBoost-based tuner
    'n_trial': 2000,  # Number of trials per task
    'early_stopping': 600,
    'measure_option': autotvm.measure_option(
        builder=autotvm.LocalBuilder(timeout=10),
        runner=autotvm.LocalRunner(
            number=20,
            repeat=3,
            timeout=4,
            min_repeat_ms=150
        ),
    ),
}

# Run tuning
for i, task in enumerate(tasks):
    prefix = f"[Task {i+1}/{len(tasks)}] "
    tuner_obj = autotvm.tuner.XGBTuner(task, loss_type='rank')
    tuner_obj.tune(
        n_trial=min(tuning_option['n_trial'], len(task.config_space)),
        early_stopping=tuning_option['early_stopping'],
        measure_option=tuning_option['measure_option'],
        callbacks=[
            autotvm.callback.progress_bar(tuning_option['n_trial'], prefix=prefix),
            autotvm.callback.log_to_file(log_file)
        ]
    )

# Compile with tuned kernels
with autotvm.apply_history_best(log_file):
    with tvm.transform.PassContext(opt_level=3):
        lib = relay.build(mod, target=target, params=params)

# Benchmark
dev = tvm.cuda(0)
module = tvm.contrib.graph_executor.GraphModule(lib["default"](dev))

import timeit
input_np = torch.randn(input_shape).numpy()
module.set_input("input0", input_np)

# Warmup
for _ in range(10):
    module.run()

# Benchmark
timer = module.module.time_evaluator("run", dev, number=100, repeat=3)
prof_res = timer()
print(f"Mean inference time: {prof_res.mean * 1000:.2f} ms")
print(f"Std dev: {prof_res.std * 1000:.2f} ms")

# Using AutoScheduler (Ansor) - newer, better
from tvm import auto_scheduler

# Extract tasks for auto-scheduler
tasks, task_weights = auto_scheduler.extract_tasks(
    mod["main"],
    params,
    target
)

# Tuning
tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
tune_option = auto_scheduler.TuningOptions(
    num_measure_trials=20000,
    runner=auto_scheduler.LocalRunner(repeat=10, enable_cpu_cache_flush=True),
    measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)

tuner.tune(tune_option)

# Compile with auto-scheduled kernels
with auto_scheduler.ApplyHistoryBest(log_file):
    with tvm.transform.PassContext(opt_level=3):
        lib = relay.build(mod, target=target, params=params)

Performance Tips

high impact

Always run AutoTVM or AutoScheduler tuning

Default schedules are generic. Tuning finds GPU-specific optimizations and can provide 2-10x speedup. Essential for production.

high impact

Use AutoScheduler over AutoTVM for new models

AutoScheduler (Ansor) is newer and often finds better schedules automatically without manual template design.

high impact

Enable graph-level optimizations

Set opt_level=3 for operator fusion, constant folding, and dead code elimination. Reduces kernel launches.

medium impact

Tune for your specific batch size

Optimal kernel parameters depend on batch size. Tune separately for batch=1 (latency) and larger batches (throughput).

medium impact

Use BYOC for cuDNN/cuBLAS fallback

Let TVM use cuDNN/cuBLAS for standard ops while generating custom code for fused operators.

high impact

Save and reuse tuning logs

Tuning takes hours. Save logs and reuse for similar models. Transfer learning of schedules is possible.

Common Pitfalls

•Skipping tuning - uses slow default schedules
•Not setting opt_level=3 - misses graph optimizations
•Tuning with too few trials - suboptimal results
•Not pinning GPU during tuning - inconsistent results
•Using CPU tuning for GPU deployment
•Ignoring compilation warnings about unsupported ops

Benchmarks

Task	Performance	Notes
ResNet-50 (tuned)	3ms	Batch=1, RTX 3090, AutoScheduler
BERT-Base (tuned)	8ms	Seq=128, RTX 3090
Tuning speedup over default	3-8x	Varies by model
vs PyTorch eager	1.5-3x	After tuning

Frequently Asked Questions

Should I use TVM or TensorRT?

TensorRT is faster for NVIDIA GPUs but NVIDIA-only. TVM is cross-platform and works on AMD, ARM, RISC-V, etc. Use TensorRT for NVIDIA-only deployment, TVM for portability.

How long does tuning take?

Depends on model complexity and n_trial. Expect 1-8 hours for full model tuning. Use pre-tuned schedules from TVM community when available.

Can I use TVM with PyTorch for training?

No, TVM is inference-only. Use torch.compile for PyTorch training. TVM is for deploying trained models with optimal inference speed.

What is Relay IR?

Relay is TVM's high-level intermediate representation for neural networks. It enables graph-level optimizations before lowering to hardware-specific code.