RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

scientificPython

PyCUDA Guide: Direct CUDA Programming in Python

December 25, 202512 min read

Introduction

PyCUDA provides direct access to NVIDIA's CUDA API from Python. You write kernels in CUDA C and call them from Python, giving maximum control over GPU programming.

CUDA Integration: PyCUDA wraps both CUDA Runtime and Driver APIs. Kernels are written in CUDA C, compiled by nvcc at runtime, and cached for reuse. Memory transfers are automatic with GPUArray.

Key Features

✓Write CUDA C kernels inline in Python
✓Just-in-time compilation with nvcc
✓Automatic memory management
✓GPUArray for NumPy-like syntax
✓Direct driver API access
✓Texture and surface memory support
✓CUDA streams and events
✓Full CUDA feature set

Installation

Install PyCUDA via pip.

bash

pip install pycuda

# Verify installation
import pycuda.driver as cuda
import pycuda.autoinit
print(f"Device: {cuda.Device(0).name()}")

Basic Example

Vector Addition Kernel

Writing and calling a CUDA C kernel.

python

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

# CUDA C kernel
mod = SourceModule("""
__global__ void add(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}
""")

add_kernel = mod.get_function("add")

# Data
a = np.random.randn(1000).astype(np.float32)
b = np.random.randn(1000).astype(np.float32)
c = np.zeros_like(a)

# Launch kernel
add_kernel(
    cuda.In(a), cuda.In(b), cuda.Out(c), np.int32(len(a)),
    block=(256, 1, 1), grid=(4, 1, 1)
)

Advanced Example

Shared Memory Reduction

Parallel reduction with shared memory.

python

mod = SourceModule("""
__global__ void reduce_sum(float *input, float *output, int n) {
    extern __shared__ float sdata[];
    
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    sdata[tid] = (i < n) ? input[i] : 0;
    __syncthreads();
    
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }
    
    if (tid == 0) output[blockIdx.x] = sdata[0];
}
""")

reduce = mod.get_function("reduce_sum")

# Launch with shared memory
block_size = 256
shared_size = block_size * 4  # sizeof(float)
reduce(cuda.In(data), cuda.Out(output), np.int32(n),
       block=(block_size, 1, 1), grid=(num_blocks, 1, 1),
       shared=shared_size)

Performance Tips

medium impact

Cache compiled kernels

Set PYCUDA_CACHE_DIR to reuse compiled modules.

medium impact

Use GPUArray for simple ops

Avoid manual memory management.

high impact

Minimize host-device transfers

Keep data on GPU between kernels.

high impact

Use async transfers

Overlap compute and data transfer.

Common Pitfalls

•Forgetting pycuda.autoinit for context
•Not handling CUDA errors properly
•Kernel compile errors hard to debug
•Memory leaks if not using autoinit
•Mixing driver and runtime APIs incorrectly

Benchmarks

Task	Performance	Notes
Matrix multiply	Near cuBLAS	With optimization
Kernel compile	100-500ms	First compile
Cached kernel	<1ms	Subsequent calls

Frequently Asked Questions

PyCUDA vs Numba?

PyCUDA for full CUDA control. Numba for pure Python kernels.

How do I debug kernels?

Use cuda-gdb or printf in kernel (slow).

Can I use cuBLAS with PyCUDA?

Yes, use scikit-cuda or call via ctypes.

Resources

PyCUDA DocumentationDocumentation

↗

PyCUDA ExamplesExamples

↗

Alternatives

Numba

Pure Python kernels, easier

→

CuPy

NumPy API, less control

→

Optimize your PyCUDA CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.

PyCUDACUDA PythonGPU programmingCUDA kernelsPython GPU

Introduction

PyCUDA provides direct access to NVIDIA's CUDA API from Python. You write kernels in CUDA C and call them from Python, giving maximum control over GPU programming.

CUDA Integration: PyCUDA wraps both CUDA Runtime and Driver APIs. Kernels are written in CUDA C, compiled by nvcc at runtime, and cached for reuse. Memory transfers are automatic with GPUArray.

Basic Example

Vector Addition Kernel

Writing and calling a CUDA C kernel.

python

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

# CUDA C kernel
mod = SourceModule("""
__global__ void add(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) c[idx] = a[idx] + b[idx];
}
""")

add_kernel = mod.get_function("add")

# Data
a = np.random.randn(1000).astype(np.float32)
b = np.random.randn(1000).astype(np.float32)
c = np.zeros_like(a)

# Launch kernel
add_kernel(
    cuda.In(a), cuda.In(b), cuda.Out(c), np.int32(len(a)),
    block=(256, 1, 1), grid=(4, 1, 1)
)

Advanced Example

Shared Memory Reduction

Parallel reduction with shared memory.

python

mod = SourceModule("""
__global__ void reduce_sum(float *input, float *output, int n) {
    extern __shared__ float sdata[];
    
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    sdata[tid] = (i < n) ? input[i] : 0;
    __syncthreads();
    
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }
    
    if (tid == 0) output[blockIdx.x] = sdata[0];
}
""")

reduce = mod.get_function("reduce_sum")

# Launch with shared memory
block_size = 256
shared_size = block_size * 4  # sizeof(float)
reduce(cuda.In(data), cuda.Out(output), np.int32(n),
       block=(block_size, 1, 1), grid=(num_blocks, 1, 1),
       shared=shared_size)

Task

Performance

Notes

Matrix multiply

Near cuBLAS

With optimization

Kernel compile

100-500ms

First compile

Cached kernel

<1ms

Subsequent calls