PyCUDA provides direct access to NVIDIA's CUDA API from Python. You write kernels in CUDA C and call them from Python, giving maximum control over GPU programming.
CUDA Integration: PyCUDA wraps both CUDA Runtime and Driver APIs. Kernels are written in CUDA C, compiled by nvcc at runtime, and cached for reuse. Memory transfers are automatic with GPUArray.
Install PyCUDA via pip.
pip install pycuda
# Verify installation
import pycuda.driver as cuda
import pycuda.autoinit
print(f"Device: {cuda.Device(0).name()}")Writing and calling a CUDA C kernel.
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
# CUDA C kernel
mod = SourceModule("""
__global__ void add(float *a, float *b, float *c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) c[idx] = a[idx] + b[idx];
}
""")
add_kernel = mod.get_function("add")
# Data
a = np.random.randn(1000).astype(np.float32)
b = np.random.randn(1000).astype(np.float32)
c = np.zeros_like(a)
# Launch kernel
add_kernel(
cuda.In(a), cuda.In(b), cuda.Out(c), np.int32(len(a)),
block=(256, 1, 1), grid=(4, 1, 1)
)Parallel reduction with shared memory.
mod = SourceModule("""
__global__ void reduce_sum(float *input, float *output, int n) {
extern __shared__ float sdata[];
int tid = threadIdx.x;
int i = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = (i < n) ? input[i] : 0;
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s) sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid == 0) output[blockIdx.x] = sdata[0];
}
""")
reduce = mod.get_function("reduce_sum")
# Launch with shared memory
block_size = 256
shared_size = block_size * 4 # sizeof(float)
reduce(cuda.In(data), cuda.Out(output), np.int32(n),
block=(block_size, 1, 1), grid=(num_blocks, 1, 1),
shared=shared_size)Set PYCUDA_CACHE_DIR to reuse compiled modules.
Avoid manual memory management.
Keep data on GPU between kernels.
Overlap compute and data transfer.
| Task | Performance | Notes |
|---|---|---|
| Matrix multiply | Near cuBLAS | With optimization |
| Kernel compile | 100-500ms | First compile |
| Cached kernel | <1ms | Subsequent calls |
PyCUDA for full CUDA control. Numba for pure Python kernels.
Use cuda-gdb or printf in kernel (slow).
Yes, use scikit-cuda or call via ctypes.
Optimize your PyCUDA CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.