CuPy is a NumPy-compatible library for GPU-accelerated computing. It implements the NumPy API on CUDA, allowing you to run existing NumPy code on GPU with minimal changes - often just replacing "import numpy" with "import cupy". For CUDA developers doing scientific computing, signal processing, or numerical simulations, CuPy provides immediate GPU acceleration without rewriting algorithms. It also offers low-level CUDA access through raw kernels and cuBLAS/cuFFT integration for when you need maximum performance. This guide covers CuPy's NumPy compatibility, performance optimization, custom kernels, and integration with deep learning frameworks.
CUDA Integration: CuPy directly uses CUDA libraries: cuBLAS for linear algebra, cuFFT for Fourier transforms, cuRAND for random numbers, and cuSPARSE for sparse matrices. It manages GPU memory with a memory pool to reduce allocation overhead and supports custom CUDA kernels through RawKernel and ElementwiseKernel APIs.
Install CuPy with the appropriate CUDA version.
# For CUDA 12.x
pip install cupy-cuda12x
# For CUDA 11.x
pip install cupy-cuda11x
# Or install from source for specific CUDA
pip install cupy
# Verify installation
python -c "import cupy as cp; print(f'CuPy {cp.__version__}'); x = cp.array([1,2,3]); print(f'GPU: {x.device}')"
# Check available memory
python -c "import cupy as cp; print(f'Free memory: {cp.cuda.runtime.memGetInfo()[0] / 1e9:.1f} GB')"Convert NumPy code to run on GPU with minimal changes.
import cupy as cp
import numpy as np
# Create arrays on GPU
x = cp.array([1, 2, 3, 4, 5]) # Like np.array but on GPU
y = cp.random.randn(1000, 1000) # Random on GPU
# NumPy-like operations - all run on GPU
z = cp.dot(y, y.T)
result = cp.linalg.svd(z)
# Transfer between CPU and GPU
cpu_array = np.array([1, 2, 3])
gpu_array = cp.asarray(cpu_array) # CPU -> GPU
back_to_cpu = cp.asnumpy(gpu_array) # GPU -> CPU
# Or use .get()
back_to_cpu = gpu_array.get()
# Context manager for device selection
with cp.cuda.Device(0):
a = cp.zeros((1000, 1000))
# Memory pooling (enabled by default)
mempool = cp.get_default_memory_pool()
print(f"Used memory: {mempool.used_bytes() / 1e6:.1f} MB")
# Clear memory pool
mempool.free_all_blocks()
# Example: Matrix operations
def solve_linear_system(A, b):
# All operations run on GPU
x = cp.linalg.solve(A, b)
return x
A = cp.random.randn(1000, 1000)
b = cp.random.randn(1000)
x = solve_linear_system(A, b)
# Verify
residual = cp.linalg.norm(cp.dot(A, x) - b)
print(f"Residual: {residual}")Write custom CUDA kernels for operations not in CuPy.
import cupy as cp
# ElementwiseKernel - easiest for element-by-element operations
squared_diff = cp.ElementwiseKernel(
'float32 x, float32 y', # Input types
'float32 z', # Output types
'z = (x - y) * (x - y)', # Operation
'squared_diff' # Kernel name
)
x = cp.random.randn(10000).astype(cp.float32)
y = cp.random.randn(10000).astype(cp.float32)
result = squared_diff(x, y)
# ReductionKernel - for reduce operations
sum_squared = cp.ReductionKernel(
'float32 x', # Input type
'float32 y', # Output type
'x * x', # Map expression
'a + b', # Reduce expression
'y = a', # Post-reduction
'0', # Identity value
'sum_squared' # Name
)
x = cp.random.randn(1000000).astype(cp.float32)
result = sum_squared(x) # Sum of squares
# RawKernel - full CUDA control
kernel_code = '''
extern "C" __global__
void fused_add_relu(const float* x, const float* y, float* out, int n) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < n) {
float val = x[tid] + y[tid];
out[tid] = val > 0 ? val : 0; // ReLU
}
}
'''
fused_add_relu = cp.RawKernel(kernel_code, 'fused_add_relu')
n = 1000000
x = cp.random.randn(n).astype(cp.float32)
y = cp.random.randn(n).astype(cp.float32)
out = cp.empty(n, dtype=cp.float32)
# Launch configuration
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block
fused_add_relu((blocks,), (threads_per_block,), (x, y, out, n))
# Interoperability with PyTorch (zero-copy!)
import torch
# CuPy -> PyTorch (zero-copy)
cupy_array = cp.random.randn(1000, 1000).astype(cp.float32)
torch_tensor = torch.as_tensor(cupy_array, device='cuda')
# PyTorch -> CuPy (zero-copy)
torch_tensor = torch.randn(1000, 1000, device='cuda')
cupy_array = cp.asarray(torch_tensor)Keep data on GPU as long as possible. Use cp.asarray() once at the start and cp.asnumpy() once at the end.
CuPy's memory pool reduces allocation overhead. For long-running jobs, periodically call mempool.free_all_blocks() to release unused memory.
Use += and *= instead of + and * to avoid allocating new arrays. This reduces memory pressure.
For repeated operation sequences, write a custom ElementwiseKernel to fuse them and reduce memory traffic.
float32 is 2x faster than float64 on consumer GPUs. Only use float64 when precision is critical.
For arrays larger than GPU memory, use cp.cuda.MemoryPool(cp.cuda.malloc_managed) for unified memory.
| Task | Performance | Notes |
|---|---|---|
| Matrix multiply (4096x4096) | 15x | vs NumPy on CPU |
| FFT (2^24 elements) | 50x | vs NumPy on CPU |
| SVD (1000x1000) | 8x | vs NumPy on CPU |
| Element-wise operations | 10-100x | Scales with array size |
Yes! Most NumPy code works by just replacing import numpy as np with import cupy as cp. Some functions may need adjustment, but the API is highly compatible.
CuPy includes cupyx.scipy with GPU implementations of many SciPy functions. Import from cupyx.scipy instead of scipy for GPU acceleration.
Yes, use cp.cuda.Device(n) context manager or with cp.cuda.Device(n): block. Each device has its own memory - you need to explicitly transfer data between GPUs.
CuPy focuses on NumPy compatibility for scientific computing. JAX provides automatic differentiation and XLA compilation for ML. Use CuPy for numerical code, JAX for ML research.
Better for ML, has autodiff, XLA compilation
Deep learning focused, dynamic graphs
JIT compiler for Python, supports CUDA
Optimize your CuPy CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.