Numba is a just-in-time (JIT) compiler for Python that translates Python and NumPy code to optimized machine code at runtime. Its CUDA capabilities allow you to write GPU kernels directly in Python without learning CUDA C/C++, making GPU programming accessible to Python developers. For CUDA developers, Numba provides a middle ground between high-level frameworks and low-level CUDA programming. You can write custom kernels in Python syntax while maintaining fine-grained control over memory, thread organization, and kernel execution - all without leaving the Python ecosystem. This guide covers Numba's CUDA JIT compilation, kernel development, memory management, vectorization, and integration with NumPy for high-performance scientific computing.
CUDA Integration: Numba uses LLVM to compile Python to PTX (parallel thread execution) code, which NVIDIA drivers JIT-compile to native GPU code. It provides explicit control over CUDA programming model elements: thread indexing, block dimensions, shared memory, and synchronization - all in Python syntax.
Install Numba with CUDA toolkit for GPU programming.
# Install via conda (recommended for CUDA support)
conda install numba cudatoolkit
# Or via pip (requires existing CUDA installation)
pip install numba
# Verify installation
python -c "from numba import cuda; print(f'Numba CUDA available: {cuda.is_available()}')"
# Check CUDA version
python -c "from numba import cuda; print(f'CUDA version: {cuda.runtime.get_version()}')"
# List GPUs
python -c "from numba import cuda; print(f'GPUs: {cuda.gpus}')"Write a simple CUDA kernel using Numba.
from numba import cuda
import numpy as np
import math
# CUDA kernel for vector addition
@cuda.jit
def add_kernel(x, y, out):
# Thread indexing - like CUDA C
idx = cuda.grid(1) # 1D grid
# Bounds check
if idx < x.shape[0]:
out[idx] = x[idx] + y[idx]
# Host function
def add_arrays(a, b):
n = a.shape[0]
# Allocate device memory
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_out = cuda.device_array(n)
# Configure kernel launch
threads_per_block = 256
blocks_per_grid = (n + threads_per_block - 1) // threads_per_block
# Launch kernel
add_kernel[blocks_per_grid, threads_per_block](d_a, d_b, d_out)
# Copy result back to host
result = d_out.copy_to_host()
return result
# Test
a = np.arange(1000000, dtype=np.float32)
b = np.arange(1000000, dtype=np.float32)
result = add_arrays(a, b)
print(f"Result matches: {np.allclose(result, a + b)}")
# Using @vectorize for simpler element-wise operations
from numba import vectorize
@vectorize(['float32(float32, float32)'], target='cuda')
def add_ufunc(x, y):
return x + y
# Automatically handles GPU transfer and kernel launch
result = add_ufunc(a, b)Implement tiled matrix multiplication using shared memory.
from numba import cuda
import numpy as np
# Matrix multiplication with shared memory tiling
@cuda.jit
def matmul_shared(A, B, C):
# Shared memory for tiles
TILE_SIZE = 16
sA = cuda.shared.array(shape=(TILE_SIZE, TILE_SIZE), dtype=float32)
sB = cuda.shared.array(shape=(TILE_SIZE, TILE_SIZE), dtype=float32)
# Thread and block indices
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
# Calculate global row and column
row = by * TILE_SIZE + ty
col = bx * TILE_SIZE + tx
# Accumulator
tmp = 0.0
# Loop over tiles
for tile in range((A.shape[1] + TILE_SIZE - 1) // TILE_SIZE):
# Load tile into shared memory
if row < A.shape[0] and tile * TILE_SIZE + tx < A.shape[1]:
sA[ty, tx] = A[row, tile * TILE_SIZE + tx]
else:
sA[ty, tx] = 0.0
if col < B.shape[1] and tile * TILE_SIZE + ty < B.shape[0]:
sB[ty, tx] = B[tile * TILE_SIZE + ty, col]
else:
sB[ty, tx] = 0.0
# Synchronize threads in block
cuda.syncthreads()
# Compute partial product
for k in range(TILE_SIZE):
tmp += sA[ty, k] * sB[k, tx]
# Synchronize before loading next tile
cuda.syncthreads()
# Write result
if row < C.shape[0] and col < C.shape[1]:
C[row, col] = tmp
# Matrix multiplication wrapper
def matmul_gpu(A, B):
# Ensure contiguous arrays
A = np.ascontiguousarray(A.astype(np.float32))
B = np.ascontiguousarray(B.astype(np.float32))
# Allocate output
C = np.zeros((A.shape[0], B.shape[1]), dtype=np.float32)
# Transfer to device
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
d_C = cuda.to_device(C)
# Configure launch
TILE_SIZE = 16
threads_per_block = (TILE_SIZE, TILE_SIZE)
blocks_per_grid_x = (B.shape[1] + TILE_SIZE - 1) // TILE_SIZE
blocks_per_grid_y = (A.shape[0] + TILE_SIZE - 1) // TILE_SIZE
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
# Launch kernel
matmul_shared[blocks_per_grid, threads_per_block](d_A, d_B, d_C)
# Copy back
d_C.copy_to_host(C)
return C
# Advanced: Stream and async operations
from numba import cuda
@cuda.jit
def process_kernel(data, out):
idx = cuda.grid(1)
if idx < data.shape[0]:
out[idx] = data[idx] ** 2 + 2 * data[idx] + 1
# Using streams for concurrent execution
def process_in_streams(data, n_streams=4):
stream_list = [cuda.stream() for _ in range(n_streams)]
chunk_size = len(data) // n_streams
results = []
for i, stream in enumerate(stream_list):
start = i * chunk_size
end = start + chunk_size if i < n_streams - 1 else len(data)
chunk = data[start:end]
d_chunk = cuda.to_device(chunk, stream=stream)
d_out = cuda.device_array(len(chunk), stream=stream)
threads = 256
blocks = (len(chunk) + threads - 1) // threads
process_kernel[blocks, threads, stream](d_chunk, d_out)
results.append(d_out)
# Synchronize and collect
cuda.synchronize()
return np.concatenate([r.copy_to_host() for r in results])nopython mode ensures no Python interpreter overhead. Use it for all hot paths. Falls back to object mode if it fails.
Keep data on GPU with cuda.to_device() and reuse device arrays. Transfers are expensive - batch operations on GPU.
cuda.shared.array() provides fast on-chip memory. Essential for algorithms like matrix multiply and convolution.
Multiples of 32 (warp size) work best. 128-512 threads per block is usually optimal. Profile to find best value.
@vectorize automatically generates GPU kernels for element-wise operations. Much easier than writing cuda.jit kernels.
Specify types in decorator: @jit("float32(float32, float32)") to avoid JIT overhead on first call.
| Task | Performance | Notes |
|---|---|---|
| Vector addition (10M elements) | 100x | vs pure Python |
| Matrix multiply (1024x1024) | 50x | vs NumPy on CPU |
| Monte Carlo simulation | 200x | Embarrassingly parallel workload |
| JIT compilation overhead | 100-500ms | One-time cost per function |
Use Numba when you need custom kernels with fine-grained control over threads and memory. Use CuPy when existing NumPy operations are sufficient. Numba gives more control, CuPy is easier for standard operations.
Yes! Convert tensors to NumPy arrays, process with Numba, then convert back. For PyTorch: tensor.cpu().numpy() and torch.from_numpy(). Note: this involves CPU transfers unless using CUDA array interface.
Use cuda.simulator context for CPU-based debugging with print statements. Or use cuda-gdb for actual GPU debugging. Print statements in kernels only work in simulator mode.
Numba compiles on first call (JIT). Subsequent calls are fast. Use eager compilation with signatures or warm up the function before timing.
Higher-level NumPy-on-GPU, less kernel control
More modern, better optimization, block-based model
Deep learning focus, autograd, larger ecosystem
Optimize your Numba CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.