RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

compilerPython

Numba CUDA Guide: JIT Compilation for GPU-Accelerated Python

December 25, 202514 min read

Introduction

Numba is a just-in-time (JIT) compiler for Python that translates Python and NumPy code to optimized machine code at runtime. Its CUDA capabilities allow you to write GPU kernels directly in Python without learning CUDA C/C++, making GPU programming accessible to Python developers. For CUDA developers, Numba provides a middle ground between high-level frameworks and low-level CUDA programming. You can write custom kernels in Python syntax while maintaining fine-grained control over memory, thread organization, and kernel execution - all without leaving the Python ecosystem. This guide covers Numba's CUDA JIT compilation, kernel development, memory management, vectorization, and integration with NumPy for high-performance scientific computing.

CUDA Integration: Numba uses LLVM to compile Python to PTX (parallel thread execution) code, which NVIDIA drivers JIT-compile to native GPU code. It provides explicit control over CUDA programming model elements: thread indexing, block dimensions, shared memory, and synchronization - all in Python syntax.

Key Features

✓@jit decorator for automatic CPU optimization
✓@cuda.jit for CUDA kernel development in Python
✓cuda.to_device() for GPU memory management
✓@vectorize for automatic GPU ufunc creation
✓@guvectorize for generalized ufuncs on GPU
✓Direct CUDA kernel syntax in Python
✓Shared memory and synchronization support
✓NumPy array compatibility
✓Automatic type inference and compilation
✓Support for custom CUDA device functions

Installation

Install Numba with CUDA toolkit for GPU programming.

bash

# Install via conda (recommended for CUDA support)
conda install numba cudatoolkit

# Or via pip (requires existing CUDA installation)
pip install numba

# Verify installation
python -c "from numba import cuda; print(f'Numba CUDA available: {cuda.is_available()}')"

# Check CUDA version
python -c "from numba import cuda; print(f'CUDA version: {cuda.runtime.get_version()}')"

# List GPUs
python -c "from numba import cuda; print(f'GPUs: {cuda.gpus}')"

Basic Example

Basic CUDA Kernel in Python

Write a simple CUDA kernel using Numba.

python

from numba import cuda
import numpy as np
import math

# CUDA kernel for vector addition
@cuda.jit
def add_kernel(x, y, out):
    # Thread indexing - like CUDA C
    idx = cuda.grid(1)  # 1D grid

    # Bounds check
    if idx < x.shape[0]:
        out[idx] = x[idx] + y[idx]

# Host function
def add_arrays(a, b):
    n = a.shape[0]

    # Allocate device memory
    d_a = cuda.to_device(a)
    d_b = cuda.to_device(b)
    d_out = cuda.device_array(n)

    # Configure kernel launch
    threads_per_block = 256
    blocks_per_grid = (n + threads_per_block - 1) // threads_per_block

    # Launch kernel
    add_kernel[blocks_per_grid, threads_per_block](d_a, d_b, d_out)

    # Copy result back to host
    result = d_out.copy_to_host()
    return result

# Test
a = np.arange(1000000, dtype=np.float32)
b = np.arange(1000000, dtype=np.float32)
result = add_arrays(a, b)

print(f"Result matches: {np.allclose(result, a + b)}")

# Using @vectorize for simpler element-wise operations
from numba import vectorize

@vectorize(['float32(float32, float32)'], target='cuda')
def add_ufunc(x, y):
    return x + y

# Automatically handles GPU transfer and kernel launch
result = add_ufunc(a, b)

Advanced Example

Matrix Multiplication with Shared Memory

Implement tiled matrix multiplication using shared memory.

python

from numba import cuda
import numpy as np

# Matrix multiplication with shared memory tiling
@cuda.jit
def matmul_shared(A, B, C):
    # Shared memory for tiles
    TILE_SIZE = 16
    sA = cuda.shared.array(shape=(TILE_SIZE, TILE_SIZE), dtype=float32)
    sB = cuda.shared.array(shape=(TILE_SIZE, TILE_SIZE), dtype=float32)

    # Thread and block indices
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y

    # Calculate global row and column
    row = by * TILE_SIZE + ty
    col = bx * TILE_SIZE + tx

    # Accumulator
    tmp = 0.0

    # Loop over tiles
    for tile in range((A.shape[1] + TILE_SIZE - 1) // TILE_SIZE):
        # Load tile into shared memory
        if row < A.shape[0] and tile * TILE_SIZE + tx < A.shape[1]:
            sA[ty, tx] = A[row, tile * TILE_SIZE + tx]
        else:
            sA[ty, tx] = 0.0

        if col < B.shape[1] and tile * TILE_SIZE + ty < B.shape[0]:
            sB[ty, tx] = B[tile * TILE_SIZE + ty, col]
        else:
            sB[ty, tx] = 0.0

        # Synchronize threads in block
        cuda.syncthreads()

        # Compute partial product
        for k in range(TILE_SIZE):
            tmp += sA[ty, k] * sB[k, tx]

        # Synchronize before loading next tile
        cuda.syncthreads()

    # Write result
    if row < C.shape[0] and col < C.shape[1]:
        C[row, col] = tmp

# Matrix multiplication wrapper
def matmul_gpu(A, B):
    # Ensure contiguous arrays
    A = np.ascontiguousarray(A.astype(np.float32))
    B = np.ascontiguousarray(B.astype(np.float32))

    # Allocate output
    C = np.zeros((A.shape[0], B.shape[1]), dtype=np.float32)

    # Transfer to device
    d_A = cuda.to_device(A)
    d_B = cuda.to_device(B)
    d_C = cuda.to_device(C)

    # Configure launch
    TILE_SIZE = 16
    threads_per_block = (TILE_SIZE, TILE_SIZE)
    blocks_per_grid_x = (B.shape[1] + TILE_SIZE - 1) // TILE_SIZE
    blocks_per_grid_y = (A.shape[0] + TILE_SIZE - 1) // TILE_SIZE
    blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

    # Launch kernel
    matmul_shared[blocks_per_grid, threads_per_block](d_A, d_B, d_C)

    # Copy back
    d_C.copy_to_host(C)
    return C

# Advanced: Stream and async operations
from numba import cuda

@cuda.jit
def process_kernel(data, out):
    idx = cuda.grid(1)
    if idx < data.shape[0]:
        out[idx] = data[idx] ** 2 + 2 * data[idx] + 1

# Using streams for concurrent execution
def process_in_streams(data, n_streams=4):
    stream_list = [cuda.stream() for _ in range(n_streams)]
    chunk_size = len(data) // n_streams
    results = []

    for i, stream in enumerate(stream_list):
        start = i * chunk_size
        end = start + chunk_size if i < n_streams - 1 else len(data)

        chunk = data[start:end]
        d_chunk = cuda.to_device(chunk, stream=stream)
        d_out = cuda.device_array(len(chunk), stream=stream)

        threads = 256
        blocks = (len(chunk) + threads - 1) // threads
        process_kernel[blocks, threads, stream](d_chunk, d_out)

        results.append(d_out)

    # Synchronize and collect
    cuda.synchronize()
    return np.concatenate([r.copy_to_host() for r in results])

Performance Tips

high impact

Use @jit(nopython=True) for maximum performance

nopython mode ensures no Python interpreter overhead. Use it for all hot paths. Falls back to object mode if it fails.

high impact

Minimize host-device transfers

Keep data on GPU with cuda.to_device() and reuse device arrays. Transfers are expensive - batch operations on GPU.

high impact

Use shared memory for frequently accessed data

cuda.shared.array() provides fast on-chip memory. Essential for algorithms like matrix multiply and convolution.

medium impact

Choose optimal thread block size

Multiples of 32 (warp size) work best. 128-512 threads per block is usually optimal. Profile to find best value.

medium impact

Use @vectorize for simple element-wise ops

@vectorize automatically generates GPU kernels for element-wise operations. Much easier than writing cuda.jit kernels.

low impact

Compile ahead of time with signature

Specify types in decorator: @jit("float32(float32, float32)") to avoid JIT overhead on first call.

Common Pitfalls

•Not checking cuda.is_available() before using GPU features
•Using Python objects in nopython mode - causes compilation errors
•Forgetting cuda.syncthreads() in shared memory kernels
•Not handling array boundary conditions in kernels
•Excessive host-device transfers in loops
•Using Python lists instead of NumPy arrays - poor performance

Benchmarks

Task	Performance	Notes
Vector addition (10M elements)	100x	vs pure Python
Matrix multiply (1024x1024)	50x	vs NumPy on CPU
Monte Carlo simulation	200x	Embarrassingly parallel workload
JIT compilation overhead	100-500ms	One-time cost per function

Frequently Asked Questions

When should I use Numba vs CuPy?

Use Numba when you need custom kernels with fine-grained control over threads and memory. Use CuPy when existing NumPy operations are sufficient. Numba gives more control, CuPy is easier for standard operations.

Can I use Numba with PyTorch or TensorFlow?

Yes! Convert tensors to NumPy arrays, process with Numba, then convert back. For PyTorch: tensor.cpu().numpy() and torch.from_numpy(). Note: this involves CPU transfers unless using CUDA array interface.

How do I debug Numba CUDA kernels?

Use cuda.simulator context for CPU-based debugging with print statements. Or use cuda-gdb for actual GPU debugging. Print statements in kernels only work in simulator mode.