RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

deep learningPython

PyTorch CUDA Optimization Guide: GPU Acceleration Best Practices

December 25, 202515 min read

Introduction

PyTorch is the leading deep learning framework for research and production, with first-class CUDA support for GPU acceleration. Its dynamic computation graph and Pythonic API make it ideal for rapid prototyping, while features like TorchScript and torch.compile enable production-grade performance. For CUDA developers, PyTorch provides multiple levels of GPU integration: high-level APIs for automatic GPU utilization, mid-level control with explicit device management, and low-level access through custom CUDA extensions. Understanding these layers is key to maximizing GPU performance. This guide covers PyTorch's CUDA integration, memory management strategies, mixed precision training, custom kernel development, and optimization techniques for both training and inference.

CUDA Integration: PyTorch's CUDA backend is built on cuDNN, cuBLAS, and NCCL for optimized operations. The framework automatically selects the best algorithms for convolutions and matrix operations. PyTorch 2.0 introduced torch.compile with the Triton backend, enabling automatic kernel fusion and optimization without manual CUDA coding.

Key Features

✓Automatic GPU tensor operations with .cuda() and .to(device)
✓Mixed precision training with torch.cuda.amp
✓torch.compile for automatic kernel optimization (PyTorch 2.0+)
✓Custom CUDA extensions via torch.utils.cpp_extension
✓Distributed training with DistributedDataParallel
✓Memory-efficient attention with FlashAttention integration
✓CUDA graphs for reduced kernel launch overhead
✓Quantization support for INT8 inference
✓TensorRT integration for optimized inference
✓Profiling with torch.profiler and Nsight

Installation

Install PyTorch with the appropriate CUDA version for your system.

bash

# Check your CUDA version
nvidia-smi

# Install PyTorch with CUDA 12.1 (recommended)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Or with CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Verify installation
python -c "import torch; print(f'PyTorch {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"

# Check GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"

Basic Example

Basic GPU Training Loop

A simple example of training a model on GPU with proper device management.

python

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
).to(device)

# Optimizer and loss
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Training loop
def train_epoch(model, loader, optimizer, criterion, device):
    model.train()
    total_loss = 0

    for batch_idx, (data, target) in enumerate(loader):
        # Move data to GPU
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    return total_loss / len(loader)

# Inference with no_grad for memory efficiency
@torch.no_grad()
def evaluate(model, loader, device):
    model.eval()
    correct = 0

    for data, target in loader:
        data, target = data.to(device), target.to(device)
        output = model(data)
        pred = output.argmax(dim=1)
        correct += pred.eq(target).sum().item()

    return correct / len(loader.dataset)

Advanced Example

Mixed Precision Training with torch.compile

Production-ready training with AMP, gradient scaling, and torch.compile for maximum performance.

python

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler

# Enable TF32 for faster matmuls on Ampere+ GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

class TransformerModel(nn.Module):
    def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, batch_first=True)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        x = self.embedding(x)
        x = self.transformer(x)
        return self.fc(x)

# Create model and compile for optimization
model = TransformerModel(vocab_size=50000).cuda()
model = torch.compile(model, mode="reduce-overhead")  # or "max-autotune"

# Mixed precision training setup
scaler = GradScaler()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

def train_step(model, data, target, optimizer, scaler):
    optimizer.zero_grad(set_to_none=True)  # More efficient than zero_grad()

    # Automatic mixed precision
    with autocast(dtype=torch.float16):
        output = model(data)
        loss = nn.functional.cross_entropy(output.view(-1, output.size(-1)), target.view(-1))

    # Scaled backward pass
    scaler.scale(loss).backward()

    # Gradient clipping (unscale first)
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Optimizer step with scaler
    scaler.step(optimizer)
    scaler.update()

    return loss.item()

# Memory-efficient inference
@torch.inference_mode()
def generate(model, prompt, max_length=100):
    model.eval()
    with autocast(dtype=torch.float16):
        for _ in range(max_length):
            output = model(prompt)
            next_token = output[:, -1, :].argmax(dim=-1, keepdim=True)
            prompt = torch.cat([prompt, next_token], dim=1)
    return prompt

Performance Tips

high impact

Use torch.compile for automatic optimization

PyTorch 2.0+ can automatically fuse operations and optimize kernels. Use model = torch.compile(model) for 20-40% speedup with zero code changes.

high impact

Enable TF32 on Ampere GPUs

TF32 provides 3x faster matmuls with minimal accuracy loss. Enable with torch.backends.cuda.matmul.allow_tf32 = True.

medium impact

Use pin_memory in DataLoader

Pinned memory enables faster CPU-to-GPU transfers. Set pin_memory=True and use non_blocking=True in .to(device).

medium impact

Prefer channels_last memory format for CNNs

Use model.to(memory_format=torch.channels_last) for 10-30% faster convolutions on modern GPUs.

low impact

Use set_to_none=True in zero_grad

optimizer.zero_grad(set_to_none=True) is faster than the default as it avoids a memset operation.

medium impact

Enable cudnn.benchmark for fixed input sizes

Set torch.backends.cudnn.benchmark = True to auto-tune convolution algorithms. Only use with fixed input sizes.

Common Pitfalls

•Forgetting to move both model AND data to GPU - causes silent CPU fallback
•Not using torch.no_grad() or torch.inference_mode() during inference
•Keeping tensors in a Python list - prevents garbage collection
•Using .item() in training loop - forces GPU sync on every iteration
•Not calling .contiguous() before operations requiring contiguous memory
•Ignoring DataLoader num_workers - CPU preprocessing bottleneck
•Using Python loops instead of vectorized operations
•Not profiling to find actual bottlenecks

Benchmarks

Task	Performance	Notes
ResNet-50 Training (imgs/sec)	1,850	RTX 4090, batch=64, AMP enabled
BERT-Large Inference (sentences/sec)	320	RTX 4090, batch=32, torch.compile
GPT-2 Generation (tokens/sec)	145	RTX 4090, KV-cache enabled
torch.compile speedup	1.3-2x	Varies by model architecture

Frequently Asked Questions

How do I check if PyTorch is using my GPU?

Use torch.cuda.is_available() to check CUDA availability, and torch.cuda.current_device() to see the active GPU. During training, nvidia-smi should show GPU memory usage and utilization.

What's the difference between .cuda() and .to(device)?

.cuda() always moves to GPU (defaults to cuda:0). .to(device) is more flexible - works with any device including CPU. Best practice is to define device = torch.device("cuda" if available else "cpu") and use .to(device) everywhere.

How do I reduce GPU memory usage in PyTorch?

Use gradient checkpointing, mixed precision (AMP), smaller batch sizes, gradient accumulation, or del intermediate tensors. For inference, use torch.inference_mode() and consider quantization.

Should I use torch.compile on all my models?

torch.compile works best on models with static shapes and standard operations. It may not help (or could slow down) models with heavy dynamic control flow. Always benchmark before and after.