Apache TVM is an open-source machine learning compiler stack that optimizes deep learning models for CPUs, GPUs, and specialized accelerators. It automatically generates efficient code from high-level model descriptions, with advanced auto-tuning to find optimal implementations for your specific hardware. For CUDA developers, TVM provides a unique approach: instead of using hand-coded kernels, it generates optimized CUDA code through a compilation pipeline with auto-scheduling. This enables portable performance across different GPU architectures without manual optimization per device. This guide covers TVM's Relay IR, AutoTVM/AutoScheduler for auto-tuning, CUDA code generation, operator fusion strategies, and deployment workflows for production ML systems.
CUDA Integration: TVM generates CUDA code through its compilation pipeline: Relay IR → TE (Tensor Expression) → Schedule → CUDA/PTX. It uses cuBLAS and cuDNN when beneficial but can also generate custom kernels. AutoScheduler explores the optimization space to find optimal tiling, unrolling, and vectorization parameters.
Install TVM from pip or build from source.
# Install from pip (basic)
pip install apache-tvm
# For GPU support, install with CUDA
pip install tlcpack-nightly -f https://tlcpack.ai/wheels
# Or build from source for latest features
git clone --recursive https://github.com/apache/tvm tvm
cd tvm
mkdir build && cd build
cp ../cmake/config.cmake .
# Edit config.cmake: set USE_CUDA ON, set CUDA path
cmake ..
make -j$(nproc)
# Set Python path
export PYTHONPATH=$PYTHONPATH:~/tvm/python
# Verify installation
python -c "import tvm; print(f'TVM {tvm.__version__}')"
python -c "import tvm; print(f'CUDA available: {tvm.cuda().exist}')"Basic model compilation and inference.
import tvm
from tvm import relay
import torch
import torchvision
import numpy as np
# Load PyTorch model
model = torchvision.models.resnet18(pretrained=True)
model.eval()
# Trace the model
input_shape = (1, 3, 224, 224)
input_data = torch.randn(input_shape)
scripted_model = torch.jit.trace(model, input_data).eval()
# Convert to TVM Relay
input_name = "input0"
shape_list = [(input_name, input_shape)]
mod, params = relay.frontend.from_pytorch(scripted_model, shape_list)
# Compile for CUDA
target = tvm.target.cuda()
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, params=params)
# Create runtime module
dev = tvm.cuda(0)
module = tvm.contrib.graph_executor.GraphModule(lib["default"](dev))
# Run inference
input_data = np.random.randn(*input_shape).astype(np.float32)
module.set_input(input_name, input_data)
module.run()
output = module.get_output(0).numpy()
print(f"Output shape: {output.shape}")
# Save compiled module
lib.export_library("resnet18_cuda.so")
# Load and use later
loaded_lib = tvm.runtime.load_module("resnet18_cuda.so")
module = tvm.contrib.graph_executor.GraphModule(loaded_lib["default"](dev))Use AutoTVM to auto-tune kernels for your GPU.
import tvm
from tvm import relay, autotvm
import torch
import torchvision
# Load model
model = torchvision.models.resnet50(pretrained=True).eval()
input_shape = (1, 3, 224, 224)
input_data = torch.randn(input_shape)
scripted_model = torch.jit.trace(model, input_data).eval()
# Convert to Relay
mod, params = relay.frontend.from_pytorch(
scripted_model,
[("input0", input_shape)]
)
# AutoTVM tuning configuration
target = tvm.target.cuda()
log_file = "resnet50_cuda.log"
# Extract tuning tasks
tasks = autotvm.task.extract_from_program(
mod["main"],
target=target,
params=params
)
print(f"Found {len(tasks)} tuning tasks")
# Configure tuner
tuning_option = {
'log_filename': log_file,
'tuner': 'xgb', # XGBoost-based tuner
'n_trial': 2000, # Number of trials per task
'early_stopping': 600,
'measure_option': autotvm.measure_option(
builder=autotvm.LocalBuilder(timeout=10),
runner=autotvm.LocalRunner(
number=20,
repeat=3,
timeout=4,
min_repeat_ms=150
),
),
}
# Run tuning
for i, task in enumerate(tasks):
prefix = f"[Task {i+1}/{len(tasks)}] "
tuner_obj = autotvm.tuner.XGBTuner(task, loss_type='rank')
tuner_obj.tune(
n_trial=min(tuning_option['n_trial'], len(task.config_space)),
early_stopping=tuning_option['early_stopping'],
measure_option=tuning_option['measure_option'],
callbacks=[
autotvm.callback.progress_bar(tuning_option['n_trial'], prefix=prefix),
autotvm.callback.log_to_file(log_file)
]
)
# Compile with tuned kernels
with autotvm.apply_history_best(log_file):
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, params=params)
# Benchmark
dev = tvm.cuda(0)
module = tvm.contrib.graph_executor.GraphModule(lib["default"](dev))
import timeit
input_np = torch.randn(input_shape).numpy()
module.set_input("input0", input_np)
# Warmup
for _ in range(10):
module.run()
# Benchmark
timer = module.module.time_evaluator("run", dev, number=100, repeat=3)
prof_res = timer()
print(f"Mean inference time: {prof_res.mean * 1000:.2f} ms")
print(f"Std dev: {prof_res.std * 1000:.2f} ms")
# Using AutoScheduler (Ansor) - newer, better
from tvm import auto_scheduler
# Extract tasks for auto-scheduler
tasks, task_weights = auto_scheduler.extract_tasks(
mod["main"],
params,
target
)
# Tuning
tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=20000,
runner=auto_scheduler.LocalRunner(repeat=10, enable_cpu_cache_flush=True),
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)
tuner.tune(tune_option)
# Compile with auto-scheduled kernels
with auto_scheduler.ApplyHistoryBest(log_file):
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, params=params)Default schedules are generic. Tuning finds GPU-specific optimizations and can provide 2-10x speedup. Essential for production.
AutoScheduler (Ansor) is newer and often finds better schedules automatically without manual template design.
Set opt_level=3 for operator fusion, constant folding, and dead code elimination. Reduces kernel launches.
Optimal kernel parameters depend on batch size. Tune separately for batch=1 (latency) and larger batches (throughput).
Let TVM use cuDNN/cuBLAS for standard ops while generating custom code for fused operators.
Tuning takes hours. Save logs and reuse for similar models. Transfer learning of schedules is possible.
| Task | Performance | Notes |
|---|---|---|
| ResNet-50 (tuned) | 3ms | Batch=1, RTX 3090, AutoScheduler |
| BERT-Base (tuned) | 8ms | Seq=128, RTX 3090 |
| Tuning speedup over default | 3-8x | Varies by model |
| vs PyTorch eager | 1.5-3x | After tuning |
TensorRT is faster for NVIDIA GPUs but NVIDIA-only. TVM is cross-platform and works on AMD, ARM, RISC-V, etc. Use TensorRT for NVIDIA-only deployment, TVM for portability.
Depends on model complexity and n_trial. Expect 1-8 hours for full model tuning. Use pre-tuned schedules from TVM community when available.
No, TVM is inference-only. Use torch.compile for PyTorch training. TVM is for deploying trained models with optimal inference speed.
Relay is TVM's high-level intermediate representation for neural networks. It enables graph-level optimizations before lowering to hardware-specific code.
NVIDIA-only, faster, less portable
Easier to use, less optimization control
For custom kernels, not full model compilation
Optimize your TVM CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.