cudaErrorLaunchTimeout (702)cudaErrorLaunchTimeout (error code 702) occurs when a CUDA kernel runs longer than the operating system's watchdog timer allows. On Windows, this is the TDR (Timeout Detection and Recovery) mechanism, typically 2 seconds. On Linux with X11, a similar timeout exists for display GPUs. This error is common when processing large datasets with a single kernel launch or when the GPU is also driving a display. The solution involves either modifying the timeout settings or restructuring your code to work in smaller chunks. This guide covers timeout management and strategies for long-running compute workloads.
CUDA error: the launch timed out and was terminated cudaErrorLaunchTimeout: the launch timed out and was terminated CUDA_ERROR_LAUNCH_TIMEOUT Display driver stopped responding and has recovered (TDR)
Modify Windows registry to allow longer kernel execution.
# Windows Registry Editor (regedit)
# Navigate to:
# HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
# Add or modify these DWORD values:
# TdrDelay = 60 (timeout in seconds, default is 2)
# TdrDdiDelay = 60 (DDI timeout)
# TdrLevel = 0 (0 to disable TDR entirely - not recommended)
# After changes, restart the computer
# Warning: Disabling TDR can cause system hangs if kernel truly hangsFor headless compute servers, disable X11 timeout.
# For headless servers (no display), set:
sudo nvidia-smi -pm 1 # Enable persistence mode
sudo nvidia-smi -c EXCLUSIVE_PROCESS # Compute mode
# If X11 is running, you can try:
# Option 1: Run without X11 (headless)
sudo systemctl stop gdm # or lightdm
# Option 2: Use a separate compute GPU
export CUDA_VISIBLE_DEVICES=1 # Use non-display GPU
# Option 3: Interactive timeout (less common)
# Add to /etc/X11/xorg.conf in Device section:
# Option "Interactive" "0"Refactor to process data incrementally.
// BAD: Process everything in one kernel
process_all<<<1, 1024>>>(huge_data, 10000000);
// GOOD: Process in chunks that fit within timeout
int chunk_size = 100000;
for (int offset = 0; offset < total; offset += chunk_size) {
int current_chunk = min(chunk_size, total - offset);
process_chunk<<<grid, block>>>(data, offset, current_chunk);
// Allow system to process display updates
cudaDeviceSynchronize();
}If kernel is just slow, optimize the algorithm.
// Profile to find bottleneck
// Use Nsight Compute or nvprof
// Common optimizations:
// 1. Use shared memory for repeated access
// 2. Ensure coalesced memory access
// 3. Reduce thread divergence
// 4. Use appropriate precision (FP16 vs FP32)
// 5. Balance grid/block dimensions
// Example: Move hot data to shared memory
__global__ void optimized_kernel(float* data, int n) {
__shared__ float shared[256];
int tid = threadIdx.x;
int gid = blockIdx.x * blockDim.x + threadIdx.x;
// Load to shared memory once
if (gid < n) shared[tid] = data[gid];
__syncthreads();
// Multiple operations on shared data (fast)
// ...
}Separate display and compute to different GPUs.
# List available GPUs
nvidia-smi -L
# In multi-GPU system, use non-display GPU for compute
export CUDA_VISIBLE_DEVICES=1 # GPU 1 for compute
# Or in code:
import torch
torch.cuda.set_device(1) # Use second GPU
# In CUDA C++:
cudaSetDevice(1); // Use non-display GPU
# Display GPU (typically GPU 0) has stricter timeouts
# Compute-only GPUs can run kernels much longerSingle-threaded work on GPU, will definitely timeout.
__global__ void process_huge_dataset(float* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx == 0) {
// All work on one thread - takes forever!
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
data[i * n + j] = expensive_computation(i, j);
}
}
}
}Parallel processing in chunks allows system to stay responsive.
__global__ void process_chunk(float* data, int n, int chunk_start, int chunk_size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x + chunk_start;
if (idx < chunk_start + chunk_size && idx < n * n) {
int i = idx / n, j = idx % n;
data[idx] = expensive_computation(i, j);
}
}
// Host code: Process in chunks
for (int i = 0; i < n * n; i += CHUNK_SIZE) {
process_chunk<<<grid, block>>>(data, n, i, CHUNK_SIZE);
cudaDeviceSynchronize(); // Yield to system
}The timeout prevents GPU hangs from freezing your display. Without it, a buggy kernel could make your system completely unresponsive. It is a safety feature.
Not recommended for development. If a kernel has a bug (infinite loop), your system will freeze and require hard reboot. Only disable on headless compute servers.
Larger input data, different GPU, or OS updates can change timing. Also, other GPU load (browsers, video) can contribute to timeout. Profile your kernel to understand actual runtime.
Timeout is a type of launch failure
Memory errors can cause kernel hangs
Asserts can cause unexpected termination
Need help debugging CUDA errors? Download RightNow AI for intelligent error analysis and optimization suggestions.