RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

scientificPython

CuPy CUDA Guide: NumPy-Compatible GPU Computing

December 25, 202511 min read

Introduction

CuPy is a NumPy-compatible library for GPU-accelerated computing. It implements the NumPy API on CUDA, allowing you to run existing NumPy code on GPU with minimal changes - often just replacing "import numpy" with "import cupy". For CUDA developers doing scientific computing, signal processing, or numerical simulations, CuPy provides immediate GPU acceleration without rewriting algorithms. It also offers low-level CUDA access through raw kernels and cuBLAS/cuFFT integration for when you need maximum performance. This guide covers CuPy's NumPy compatibility, performance optimization, custom kernels, and integration with deep learning frameworks.

CUDA Integration: CuPy directly uses CUDA libraries: cuBLAS for linear algebra, cuFFT for Fourier transforms, cuRAND for random numbers, and cuSPARSE for sparse matrices. It manages GPU memory with a memory pool to reduce allocation overhead and supports custom CUDA kernels through RawKernel and ElementwiseKernel APIs.

Key Features

✓Drop-in NumPy replacement for GPU
✓Supports most NumPy functions on GPU
✓cuBLAS and cuFFT integration
✓Custom CUDA kernel support
✓Memory pooling for efficient allocation
✓Multi-GPU support
✓Sparse matrix operations
✓Interoperability with PyTorch/TensorFlow
✓JIT kernel compilation
✓NumPy array protocol (__cuda_array_interface__)

Installation

Install CuPy with the appropriate CUDA version.

bash

# For CUDA 12.x
pip install cupy-cuda12x

# For CUDA 11.x
pip install cupy-cuda11x

# Or install from source for specific CUDA
pip install cupy

# Verify installation
python -c "import cupy as cp; print(f'CuPy {cp.__version__}'); x = cp.array([1,2,3]); print(f'GPU: {x.device}')"

# Check available memory
python -c "import cupy as cp; print(f'Free memory: {cp.cuda.runtime.memGetInfo()[0] / 1e9:.1f} GB')"

Basic Example

NumPy to CuPy Migration

Convert NumPy code to run on GPU with minimal changes.

python

import cupy as cp
import numpy as np

# Create arrays on GPU
x = cp.array([1, 2, 3, 4, 5])  # Like np.array but on GPU
y = cp.random.randn(1000, 1000)  # Random on GPU

# NumPy-like operations - all run on GPU
z = cp.dot(y, y.T)
result = cp.linalg.svd(z)

# Transfer between CPU and GPU
cpu_array = np.array([1, 2, 3])
gpu_array = cp.asarray(cpu_array)  # CPU -> GPU
back_to_cpu = cp.asnumpy(gpu_array)  # GPU -> CPU

# Or use .get()
back_to_cpu = gpu_array.get()

# Context manager for device selection
with cp.cuda.Device(0):
    a = cp.zeros((1000, 1000))

# Memory pooling (enabled by default)
mempool = cp.get_default_memory_pool()
print(f"Used memory: {mempool.used_bytes() / 1e6:.1f} MB")

# Clear memory pool
mempool.free_all_blocks()

# Example: Matrix operations
def solve_linear_system(A, b):
    # All operations run on GPU
    x = cp.linalg.solve(A, b)
    return x

A = cp.random.randn(1000, 1000)
b = cp.random.randn(1000)
x = solve_linear_system(A, b)

# Verify
residual = cp.linalg.norm(cp.dot(A, x) - b)
print(f"Residual: {residual}")

Advanced Example

Custom CUDA Kernels in CuPy

Write custom CUDA kernels for operations not in CuPy.

python

import cupy as cp

# ElementwiseKernel - easiest for element-by-element operations
squared_diff = cp.ElementwiseKernel(
    'float32 x, float32 y',  # Input types
    'float32 z',              # Output types
    'z = (x - y) * (x - y)',  # Operation
    'squared_diff'            # Kernel name
)

x = cp.random.randn(10000).astype(cp.float32)
y = cp.random.randn(10000).astype(cp.float32)
result = squared_diff(x, y)

# ReductionKernel - for reduce operations
sum_squared = cp.ReductionKernel(
    'float32 x',           # Input type
    'float32 y',           # Output type
    'x * x',               # Map expression
    'a + b',               # Reduce expression
    'y = a',               # Post-reduction
    '0',                   # Identity value
    'sum_squared'          # Name
)

x = cp.random.randn(1000000).astype(cp.float32)
result = sum_squared(x)  # Sum of squares

# RawKernel - full CUDA control
kernel_code = '''
extern "C" __global__
void fused_add_relu(const float* x, const float* y, float* out, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        float val = x[tid] + y[tid];
        out[tid] = val > 0 ? val : 0;  // ReLU
    }
}
'''

fused_add_relu = cp.RawKernel(kernel_code, 'fused_add_relu')

n = 1000000
x = cp.random.randn(n).astype(cp.float32)
y = cp.random.randn(n).astype(cp.float32)
out = cp.empty(n, dtype=cp.float32)

# Launch configuration
threads_per_block = 256
blocks = (n + threads_per_block - 1) // threads_per_block

fused_add_relu((blocks,), (threads_per_block,), (x, y, out, n))

# Interoperability with PyTorch (zero-copy!)
import torch

# CuPy -> PyTorch (zero-copy)
cupy_array = cp.random.randn(1000, 1000).astype(cp.float32)
torch_tensor = torch.as_tensor(cupy_array, device='cuda')

# PyTorch -> CuPy (zero-copy)
torch_tensor = torch.randn(1000, 1000, device='cuda')
cupy_array = cp.asarray(torch_tensor)

Performance Tips

high impact

Minimize CPU-GPU transfers

Keep data on GPU as long as possible. Use cp.asarray() once at the start and cp.asnumpy() once at the end.

medium impact

Use memory pool effectively

CuPy's memory pool reduces allocation overhead. For long-running jobs, periodically call mempool.free_all_blocks() to release unused memory.

medium impact

Use in-place operations

Use += and *= instead of + and * to avoid allocating new arrays. This reduces memory pressure.

high impact

Fuse operations with custom kernels

For repeated operation sequences, write a custom ElementwiseKernel to fuse them and reduce memory traffic.

high impact

Use appropriate data types

float32 is 2x faster than float64 on consumer GPUs. Only use float64 when precision is critical.

medium impact

Enable unified memory for large arrays

For arrays larger than GPU memory, use cp.cuda.MemoryPool(cp.cuda.malloc_managed) for unified memory.

Common Pitfalls

•Frequent CPU-GPU transfers - use asarray/asnumpy sparingly
•Not using memory pool - causes frequent allocations
•Mixing NumPy and CuPy operations - forces transfers
•Using float64 unnecessarily - much slower than float32
•Not synchronizing before timing - CUDA operations are async
•Ignoring out-of-memory errors - check with mempool.used_bytes()

Benchmarks

Task	Performance	Notes
Matrix multiply (4096x4096)	15x	vs NumPy on CPU
FFT (2^24 elements)	50x	vs NumPy on CPU
SVD (1000x1000)	8x	vs NumPy on CPU
Element-wise operations	10-100x	Scales with array size

Frequently Asked Questions

Can I use CuPy with existing NumPy code?

Yes! Most NumPy code works by just replacing import numpy as np with import cupy as cp. Some functions may need adjustment, but the API is highly compatible.

How do I use CuPy with SciPy?

CuPy includes cupyx.scipy with GPU implementations of many SciPy functions. Import from cupyx.scipy instead of scipy for GPU acceleration.

Can CuPy use multiple GPUs?

Yes, use cp.cuda.Device(n) context manager or with cp.cuda.Device(n): block. Each device has its own memory - you need to explicitly transfer data between GPUs.

How does CuPy compare to JAX?

CuPy focuses on NumPy compatibility for scientific computing. JAX provides automatic differentiation and XLA compilation for ML. Use CuPy for numerical code, JAX for ML research.