cudaErrorInvalidDevice (101)cudaErrorInvalidDevice (error code 101) occurs when your CUDA application tries to use a GPU device that doesn't exist or isn't accessible. This commonly happens in multi-GPU systems, cloud environments, or when CUDA_VISIBLE_DEVICES is misconfigured. This error message typically appears as "invalid device ordinal" and can be frustrating in containerized environments (Docker, Kubernetes) or when switching between machines with different GPU configurations. This guide explains the causes, provides step-by-step solutions, and shows best practices for robust GPU device handling in your CUDA applications.
CUDA error: invalid device ordinal cudaErrorInvalidDevice: invalid device ordinal RuntimeError: CUDA error: invalid device ordinal CUDA_ERROR_INVALID_DEVICE
First, verify which GPUs are actually available to your application.
# Check system GPUs
nvidia-smi -L
# Check CUDA-visible GPUs
python -c "import torch; print(f'GPUs: {torch.cuda.device_count()}')"
# List all visible devices
import torch
for i in range(torch.cuda.device_count()):
print(f"GPU {i}: {torch.cuda.get_device_name(i)}")This environment variable controls which GPUs CUDA can see. It remaps device indices.
# Check current setting
echo $CUDA_VISIBLE_DEVICES
# Make all GPUs visible
unset CUDA_VISIBLE_DEVICES
# Or set specific GPUs (0-indexed)
export CUDA_VISIBLE_DEVICES=0,1 # Only GPUs 0 and 1
# In Python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Must be set BEFORE importing torch
# Important: After setting CUDA_VISIBLE_DEVICES=2,3
# Those GPUs become device 0 and 1 in CUDA!
# torch.cuda.device(0) refers to physical GPU 2Never hardcode device IDs. Always check availability first.
import torch
def get_device():
if torch.cuda.is_available():
device_count = torch.cuda.device_count()
if device_count > 0:
return torch.device("cuda:0")
return torch.device("cpu")
# Or use automatic device selection
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# For multi-GPU, validate device index
def get_gpu(device_id=0):
if not torch.cuda.is_available():
raise RuntimeError("CUDA not available")
if device_id >= torch.cuda.device_count():
raise RuntimeError(f"GPU {device_id} not found. Available: {torch.cuda.device_count()}")
return torch.device(f"cuda:{device_id}")Docker containers need explicit GPU access.
# Run with all GPUs
docker run --gpus all your-image
# Run with specific GPUs
docker run --gpus '"device=0,1"' your-image
# Docker Compose
services:
ml-service:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# Verify inside container
nvidia-smiEnsure NVIDIA drivers and CUDA toolkit are properly installed.
# Check driver
nvidia-smi
# Check CUDA version
nvcc --version
# Verify PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"
python -c "import torch; print(torch.version.cuda)"
# If driver not loaded, try
sudo modprobe nvidia
# Check for driver issues
dmesg | grep -i nvidiaThis fails if the system has fewer than 3 GPUs or if CUDA_VISIBLE_DEVICES hides GPU 2.
# Hardcoded device - breaks on systems with fewer GPUs
model = model.to("cuda:2")
data = data.to("cuda:2")This code checks availability, validates the device index, provides fallback, and uses environment variables for flexibility.
import torch
import os
def setup_device(preferred_gpu=0):
"""Robust device setup with fallback."""
if not torch.cuda.is_available():
print("CUDA not available, using CPU")
return torch.device("cpu")
gpu_count = torch.cuda.device_count()
if preferred_gpu >= gpu_count:
print(f"GPU {preferred_gpu} not found, using GPU 0")
preferred_gpu = 0
device = torch.device(f"cuda:{preferred_gpu}")
print(f"Using: {torch.cuda.get_device_name(preferred_gpu)}")
return device
device = setup_device(int(os.environ.get("GPU_ID", 0)))
model = model.to(device)Docker containers are isolated from host GPUs by default. Use --gpus all flag when running the container. Also ensure nvidia-container-toolkit is installed on the host.
It filters which physical GPUs are visible to CUDA and remaps their indices. If you set CUDA_VISIBLE_DEVICES=2,3, physical GPU 2 becomes cuda:0 and GPU 3 becomes cuda:1 in your application.
Set CUDA_VISIBLE_DEVICES=N before your script, or use torch.cuda.set_device(N) in code. The environment variable is preferred as it prevents other GPUs from being initialized.
Check if CUDA_VISIBLE_DEVICES is set restrictively, verify CUDA toolkit version matches your driver, and ensure no permission issues with /dev/nvidia* devices.
Occurs when no GPUs are available at all
Can occur after selecting wrong device
Driver version mismatch
Need help debugging CUDA errors? Download RightNow AI for intelligent error analysis and optimization suggestions.