cudaErrorMemoryAllocation (2)cudaErrorMemoryAllocation (error code 2) is the most common CUDA error, occurring when the GPU cannot allocate the requested memory. This typically happens when your application requests more GPU memory than is available, either due to memory fragmentation, other processes using GPU memory, or simply requesting more than the GPU has. This error frequently appears in deep learning frameworks like PyTorch and TensorFlow as "CUDA out of memory" or "RuntimeError: CUDA error: out of memory". Understanding and resolving this error is essential for any GPU developer working with large models or datasets. This guide covers the root causes, step-by-step solutions, and best practices for preventing memory allocation failures in your CUDA applications.
CUDA error: out of memory cudaErrorMemoryAllocation: out of memory RuntimeError: CUDA error: out of memory. Tried to allocate X.XX GiB cudaMalloc failed: out of memory
First, determine how much memory is available and what is consuming it.
# Command line
nvidia-smi
# In Python
import torch
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
print(f"Max allocated: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")The simplest solution is often reducing the batch size. Memory usage scales linearly with batch size.
# Before: batch_size = 32, uses ~12GB
# After: batch_size = 16, uses ~6GB
# For PyTorch DataLoader
train_loader = DataLoader(dataset, batch_size=16) # Reduced from 32
# Gradient accumulation to maintain effective batch size
accumulation_steps = 2
for i, (inputs, labels) in enumerate(train_loader):
outputs = model(inputs)
loss = criterion(outputs, labels) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()PyTorch and TensorFlow cache memory allocations. Explicitly clearing can free memory.
# PyTorch
import torch
torch.cuda.empty_cache()
import gc
gc.collect()
# TensorFlow
import tensorflow as tf
tf.keras.backend.clear_session()
# After each training epoch
def clear_memory():
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()FP16 training uses half the memory of FP32 with minimal accuracy loss.
# PyTorch Automatic Mixed Precision
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for inputs, labels in train_loader:
optimizer.zero_grad()
with autocast():
outputs = model(inputs)
loss = criterion(outputs, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
# TensorFlow Mixed Precision
from tensorflow.keras import mixed_precision
mixed_precision.set_global_policy('mixed_float16')Trade compute for memory by recomputing activations during backward pass.
# PyTorch gradient checkpointing
from torch.utils.checkpoint import checkpoint
class CheckpointedModel(nn.Module):
def forward(self, x):
# Checkpoint memory-heavy layers
x = checkpoint(self.layer1, x)
x = checkpoint(self.layer2, x)
return self.layer3(x)
# For transformers
model.gradient_checkpointing_enable()
# TensorFlow
tf.recompute_grad(layer)Other processes may be consuming GPU memory. Identify and terminate them.
# Find processes using GPU
nvidia-smi
# Or more detailed
fuser -v /dev/nvidia*
# Kill specific process
kill -9 <PID>
# In Python, ensure exclusive access
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Use only GPU 0Storing loss tensors keeps the entire computation graph in memory, causing OOM.
# Memory leak - tensors accumulate on GPU
losses = []
for epoch in range(100):
for batch in train_loader:
loss = model(batch)
losses.append(loss) # Keeps gradient graph!
loss.backward()Using .item() extracts the scalar value without keeping the graph. Clearing gradients and cache prevents accumulation.
# Proper memory management
losses = []
for epoch in range(100):
for batch in train_loader:
loss = model(batch)
losses.append(loss.item()) # .item() detaches and converts to Python float
loss.backward()
optimizer.step()
optimizer.zero_grad() # Clear gradients
torch.cuda.empty_cache() # Clear cache after epochCUDA memory allocation can fail due to fragmentation even when total free memory seems sufficient. The allocator needs contiguous memory blocks. Try torch.cuda.empty_cache() or restart your Python process to defragment.
Use torch.cuda.memory_summary() for a detailed breakdown. For specific tensors, iterate through gc.get_objects() and check tensor sizes, or use memory profiling tools like pytorch_memlab.
Yes, techniques like CPU offloading (DeepSpeed ZeRO-Offload), gradient checkpointing, or libraries like bitsandbytes for quantization can help. However, this significantly impacts performance.
This usually indicates a memory leak - tensors accumulating over iterations. Check for tensors being appended to lists, not detaching losses, or not clearing gradients properly.
Often occurs when GPU selection fails
Can follow memory issues
Memory access after failed allocation
Need help debugging CUDA errors? Download RightNow AI for intelligent error analysis and optimization suggestions.