PyTorch is the leading deep learning framework for research and production, with first-class CUDA support for GPU acceleration. Its dynamic computation graph and Pythonic API make it ideal for rapid prototyping, while features like TorchScript and torch.compile enable production-grade performance. For CUDA developers, PyTorch provides multiple levels of GPU integration: high-level APIs for automatic GPU utilization, mid-level control with explicit device management, and low-level access through custom CUDA extensions. Understanding these layers is key to maximizing GPU performance. This guide covers PyTorch's CUDA integration, memory management strategies, mixed precision training, custom kernel development, and optimization techniques for both training and inference.
CUDA Integration: PyTorch's CUDA backend is built on cuDNN, cuBLAS, and NCCL for optimized operations. The framework automatically selects the best algorithms for convolutions and matrix operations. PyTorch 2.0 introduced torch.compile with the Triton backend, enabling automatic kernel fusion and optimization without manual CUDA coding.
Install PyTorch with the appropriate CUDA version for your system.
# Check your CUDA version
nvidia-smi
# Install PyTorch with CUDA 12.1 (recommended)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Or with CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Verify installation
python -c "import torch; print(f'PyTorch {torch.__version__}'); print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"
# Check GPU
python -c "import torch; print(torch.cuda.get_device_name(0))"A simple example of training a model on GPU with proper device management.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Model
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
).to(device)
# Optimizer and loss
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
# Training loop
def train_epoch(model, loader, optimizer, criterion, device):
model.train()
total_loss = 0
for batch_idx, (data, target) in enumerate(loader):
# Move data to GPU
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
return total_loss / len(loader)
# Inference with no_grad for memory efficiency
@torch.no_grad()
def evaluate(model, loader, device):
model.eval()
correct = 0
for data, target in loader:
data, target = data.to(device), target.to(device)
output = model(data)
pred = output.argmax(dim=1)
correct += pred.eq(target).sum().item()
return correct / len(loader.dataset)Production-ready training with AMP, gradient scaling, and torch.compile for maximum performance.
import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
# Enable TF32 for faster matmuls on Ampere+ GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
class TransformerModel(nn.Module):
def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
encoder_layer = nn.TransformerEncoderLayer(d_model, nhead, batch_first=True)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
self.fc = nn.Linear(d_model, vocab_size)
def forward(self, x):
x = self.embedding(x)
x = self.transformer(x)
return self.fc(x)
# Create model and compile for optimization
model = TransformerModel(vocab_size=50000).cuda()
model = torch.compile(model, mode="reduce-overhead") # or "max-autotune"
# Mixed precision training setup
scaler = GradScaler()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
def train_step(model, data, target, optimizer, scaler):
optimizer.zero_grad(set_to_none=True) # More efficient than zero_grad()
# Automatic mixed precision
with autocast(dtype=torch.float16):
output = model(data)
loss = nn.functional.cross_entropy(output.view(-1, output.size(-1)), target.view(-1))
# Scaled backward pass
scaler.scale(loss).backward()
# Gradient clipping (unscale first)
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Optimizer step with scaler
scaler.step(optimizer)
scaler.update()
return loss.item()
# Memory-efficient inference
@torch.inference_mode()
def generate(model, prompt, max_length=100):
model.eval()
with autocast(dtype=torch.float16):
for _ in range(max_length):
output = model(prompt)
next_token = output[:, -1, :].argmax(dim=-1, keepdim=True)
prompt = torch.cat([prompt, next_token], dim=1)
return promptPyTorch 2.0+ can automatically fuse operations and optimize kernels. Use model = torch.compile(model) for 20-40% speedup with zero code changes.
TF32 provides 3x faster matmuls with minimal accuracy loss. Enable with torch.backends.cuda.matmul.allow_tf32 = True.
Pinned memory enables faster CPU-to-GPU transfers. Set pin_memory=True and use non_blocking=True in .to(device).
Use model.to(memory_format=torch.channels_last) for 10-30% faster convolutions on modern GPUs.
optimizer.zero_grad(set_to_none=True) is faster than the default as it avoids a memset operation.
Set torch.backends.cudnn.benchmark = True to auto-tune convolution algorithms. Only use with fixed input sizes.
| Task | Performance | Notes |
|---|---|---|
| ResNet-50 Training (imgs/sec) | 1,850 | RTX 4090, batch=64, AMP enabled |
| BERT-Large Inference (sentences/sec) | 320 | RTX 4090, batch=32, torch.compile |
| GPT-2 Generation (tokens/sec) | 145 | RTX 4090, KV-cache enabled |
| torch.compile speedup | 1.3-2x | Varies by model architecture |
Use torch.cuda.is_available() to check CUDA availability, and torch.cuda.current_device() to see the active GPU. During training, nvidia-smi should show GPU memory usage and utilization.
.cuda() always moves to GPU (defaults to cuda:0). .to(device) is more flexible - works with any device including CPU. Best practice is to define device = torch.device("cuda" if available else "cpu") and use .to(device) everywhere.
Use gradient checkpointing, mixed precision (AMP), smaller batch sizes, gradient accumulation, or del intermediate tensors. For inference, use torch.inference_mode() and consider quantization.
torch.compile works best on models with static shapes and standard operations. It may not help (or could slow down) models with heavy dynamic control flow. Always benchmark before and after.
More production-focused, better mobile/edge deployment
Functional paradigm, better XLA optimization, research-focused
For writing custom GPU kernels in Python
Optimize your PyTorch CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.