RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

deep learningPython

JAX CUDA Optimization Guide: XLA and GPU Acceleration

December 25, 202513 min read

Introduction

JAX is a high-performance numerical computing library that combines NumPy's API with automatic differentiation, JIT compilation via XLA, and seamless parallelization. Developed by Google, it's become the framework of choice for cutting-edge research at DeepMind and Google Brain. For CUDA developers, JAX offers a unique functional programming approach where transformations like jit, grad, vmap, and pmap compose naturally. XLA compiles your Python code into optimized GPU kernels without manual CUDA coding, while automatic vectorization and parallelization enable efficient multi-GPU and TPU scaling. This guide covers JAX's GPU setup, JIT compilation, vectorization, multi-device parallelism, and best practices for high-performance GPU computing.

CUDA Integration: JAX uses XLA (Accelerated Linear Algebra) to compile Python functions into optimized GPU code. Unlike PyTorch or TensorFlow, JAX doesn't have a concept of "device placement" - data lives on devices and operations execute where the data is. XLA handles kernel fusion, memory layout optimization, and operation scheduling automatically.

Key Features

✓XLA JIT compilation with @jax.jit
✓Automatic differentiation with jax.grad
✓Automatic vectorization with jax.vmap
✓Multi-GPU parallelization with jax.pmap
✓NumPy-compatible API (jax.numpy)
✓Functional transformations that compose
✓Automatic mixed precision with jnp.bfloat16
✓Sharding for large-scale distributed training
✓Equinox and Flax for neural networks
✓Seamless TPU and GPU support

Installation

Install JAX with GPU support using pip.

bash

# Install JAX with CUDA 12 support
pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# Or for CUDA 11
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

# Verify installation
python -c "import jax; print(f'JAX {jax.__version__}'); print(f'Devices: {jax.devices()}')"

# Check GPU backend
python -c "import jax; print(jax.default_backend())"  # Should print 'gpu'

Basic Example

JIT Compilation and Automatic Differentiation

Basic JAX usage with JIT compilation and gradients.

python

import jax
import jax.numpy as jnp
from jax import grad, jit, vmap

# Simple function - automatically runs on GPU if available
def predict(params, x):
    w, b = params
    return jnp.dot(x, w) + b

def loss_fn(params, x, y):
    pred = predict(params, x)
    return jnp.mean((pred - y) ** 2)

# JIT compile for GPU optimization
loss_fn_jit = jit(loss_fn)

# Automatic differentiation
grad_fn = jit(grad(loss_fn))

# Initialize parameters on GPU
key = jax.random.PRNGKey(0)
w = jax.random.normal(key, (784, 10))
b = jnp.zeros(10)
params = (w, b)

# Training step
@jit
def train_step(params, x, y, lr=0.01):
    grads = grad(loss_fn)(params, x, y)
    new_params = jax.tree_map(lambda p, g: p - lr * g, params, grads)
    return new_params

# Batch processing with vmap (automatic vectorization)
batched_predict = vmap(predict, in_axes=(None, 0))

# Training loop
for epoch in range(100):
    params = train_step(params, x_train, y_train)
    if epoch % 10 == 0:
        loss = loss_fn_jit(params, x_train, y_train)
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

Advanced Example

Multi-GPU Training with pmap

Distributed training across multiple GPUs using pmap.

python

import jax
import jax.numpy as jnp
from jax import pmap, jit, grad
from jax.sharding import PositionalSharding

# Check available devices
devices = jax.devices()
n_devices = len(devices)
print(f"Training on {n_devices} GPUs")

# Model definition
def mlp(params, x):
    for w, b in params[:-1]:
        x = jax.nn.relu(jnp.dot(x, w) + b)
    w, b = params[-1]
    return jnp.dot(x, w) + b

def loss_fn(params, x, y):
    logits = mlp(params, x)
    return jnp.mean(jax.nn.softmax_cross_entropy_with_logits(logits, y))

# Parallelized training step
@pmap
def train_step(params, x, y, lr=0.001):
    grads = grad(loss_fn)(params, x, y)
    # All-reduce gradients across devices
    grads = jax.lax.pmean(grads, axis_name='devices')
    new_params = jax.tree_map(lambda p, g: p - lr * g, params, grads)
    return new_params

# Initialize and replicate params across devices
def init_params(key, layers):
    params = []
    for in_dim, out_dim in zip(layers[:-1], layers[1:]):
        key, subkey = jax.random.split(key)
        w = jax.random.normal(subkey, (in_dim, out_dim)) * 0.01
        b = jnp.zeros(out_dim)
        params.append((w, b))
    return params

key = jax.random.PRNGKey(0)
params = init_params(key, [784, 512, 256, 10])

# Replicate params to all devices
params = jax.device_put_replicated(params, devices)

# Split batch across devices
def split_batch(x, y):
    batch_size = x.shape[0] // n_devices
    x = x.reshape(n_devices, batch_size, -1)
    y = y.reshape(n_devices, batch_size, -1)
    return x, y

# Training loop
for epoch in range(100):
    x_split, y_split = split_batch(x_train, y_train)
    params = train_step(params, x_split, y_split)

Performance Tips

high impact

Always use @jax.jit on hot paths

JIT compiles Python functions to XLA. First call is slow (compilation), subsequent calls are fast. JIT everything in your training loop.

high impact

Use vmap instead of Python loops

jax.vmap automatically vectorizes functions across a batch dimension. It's much faster than Python for loops and generates efficient GPU code.

high impact

Avoid Python control flow in JIT

Use jax.lax.cond, jax.lax.fori_loop, and jax.lax.scan instead of if/for. Python control flow causes recompilation.

medium impact

Use bfloat16 for training

jnp.bfloat16 provides good numerical stability for training while using half the memory of float32.

medium impact

Donate buffers to reduce memory

Use donate_argnums in jit to tell XLA it can reuse input buffers: @jit(donate_argnums=(0,)).

medium impact

Profile with jax.profiler

Use jax.profiler to create traces viewable in TensorBoard or Perfetto for detailed GPU analysis.

Common Pitfalls

•Not using @jit - runs in Python interpreter, 100x slower
•Using NumPy instead of jax.numpy - data stays on CPU
•Python for loops in jitted functions - triggers recompilation
•Mutating arrays - JAX arrays are immutable, use .at[].set()
•Forgetting that JAX uses functional random - must split keys
•Not handling PyTree structures properly in custom functions
•Using global variables in jitted functions - causes subtle bugs

Benchmarks

Task	Performance	Notes
Transformer Training (steps/sec)	42	RTX 4090, sequence length 512
vmap vs for loop speedup	50-100x	Batch size 128
Multi-GPU scaling efficiency	0.95	8x A100, near-linear scaling
XLA compilation overhead	2-30s	One-time cost per unique input shape

Frequently Asked Questions

Why is JAX so fast?

JAX uses XLA to compile entire computation graphs into optimized GPU code. It fuses operations, eliminates intermediate allocations, and optimizes memory layout. The functional design enables aggressive optimization.

How do I debug JIT-compiled functions?

Remove @jit temporarily to run in eager mode. Use jax.debug.print() inside jitted functions. Enable JAX_DEBUG_NANS=True to detect NaN issues.

What's the difference between pmap and jit?

jit compiles for a single device. pmap compiles for parallel execution across multiple devices, automatically handling data sharding and gradient synchronization.