JAX is a high-performance numerical computing library that combines NumPy's API with automatic differentiation, JIT compilation via XLA, and seamless parallelization. Developed by Google, it's become the framework of choice for cutting-edge research at DeepMind and Google Brain. For CUDA developers, JAX offers a unique functional programming approach where transformations like jit, grad, vmap, and pmap compose naturally. XLA compiles your Python code into optimized GPU kernels without manual CUDA coding, while automatic vectorization and parallelization enable efficient multi-GPU and TPU scaling. This guide covers JAX's GPU setup, JIT compilation, vectorization, multi-device parallelism, and best practices for high-performance GPU computing.
CUDA Integration: JAX uses XLA (Accelerated Linear Algebra) to compile Python functions into optimized GPU code. Unlike PyTorch or TensorFlow, JAX doesn't have a concept of "device placement" - data lives on devices and operations execute where the data is. XLA handles kernel fusion, memory layout optimization, and operation scheduling automatically.
Install JAX with GPU support using pip.
# Install JAX with CUDA 12 support
pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
# Or for CUDA 11
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
# Verify installation
python -c "import jax; print(f'JAX {jax.__version__}'); print(f'Devices: {jax.devices()}')"
# Check GPU backend
python -c "import jax; print(jax.default_backend())" # Should print 'gpu'Basic JAX usage with JIT compilation and gradients.
import jax
import jax.numpy as jnp
from jax import grad, jit, vmap
# Simple function - automatically runs on GPU if available
def predict(params, x):
w, b = params
return jnp.dot(x, w) + b
def loss_fn(params, x, y):
pred = predict(params, x)
return jnp.mean((pred - y) ** 2)
# JIT compile for GPU optimization
loss_fn_jit = jit(loss_fn)
# Automatic differentiation
grad_fn = jit(grad(loss_fn))
# Initialize parameters on GPU
key = jax.random.PRNGKey(0)
w = jax.random.normal(key, (784, 10))
b = jnp.zeros(10)
params = (w, b)
# Training step
@jit
def train_step(params, x, y, lr=0.01):
grads = grad(loss_fn)(params, x, y)
new_params = jax.tree_map(lambda p, g: p - lr * g, params, grads)
return new_params
# Batch processing with vmap (automatic vectorization)
batched_predict = vmap(predict, in_axes=(None, 0))
# Training loop
for epoch in range(100):
params = train_step(params, x_train, y_train)
if epoch % 10 == 0:
loss = loss_fn_jit(params, x_train, y_train)
print(f"Epoch {epoch}, Loss: {loss:.4f}")Distributed training across multiple GPUs using pmap.
import jax
import jax.numpy as jnp
from jax import pmap, jit, grad
from jax.sharding import PositionalSharding
# Check available devices
devices = jax.devices()
n_devices = len(devices)
print(f"Training on {n_devices} GPUs")
# Model definition
def mlp(params, x):
for w, b in params[:-1]:
x = jax.nn.relu(jnp.dot(x, w) + b)
w, b = params[-1]
return jnp.dot(x, w) + b
def loss_fn(params, x, y):
logits = mlp(params, x)
return jnp.mean(jax.nn.softmax_cross_entropy_with_logits(logits, y))
# Parallelized training step
@pmap
def train_step(params, x, y, lr=0.001):
grads = grad(loss_fn)(params, x, y)
# All-reduce gradients across devices
grads = jax.lax.pmean(grads, axis_name='devices')
new_params = jax.tree_map(lambda p, g: p - lr * g, params, grads)
return new_params
# Initialize and replicate params across devices
def init_params(key, layers):
params = []
for in_dim, out_dim in zip(layers[:-1], layers[1:]):
key, subkey = jax.random.split(key)
w = jax.random.normal(subkey, (in_dim, out_dim)) * 0.01
b = jnp.zeros(out_dim)
params.append((w, b))
return params
key = jax.random.PRNGKey(0)
params = init_params(key, [784, 512, 256, 10])
# Replicate params to all devices
params = jax.device_put_replicated(params, devices)
# Split batch across devices
def split_batch(x, y):
batch_size = x.shape[0] // n_devices
x = x.reshape(n_devices, batch_size, -1)
y = y.reshape(n_devices, batch_size, -1)
return x, y
# Training loop
for epoch in range(100):
x_split, y_split = split_batch(x_train, y_train)
params = train_step(params, x_split, y_split)JIT compiles Python functions to XLA. First call is slow (compilation), subsequent calls are fast. JIT everything in your training loop.
jax.vmap automatically vectorizes functions across a batch dimension. It's much faster than Python for loops and generates efficient GPU code.
Use jax.lax.cond, jax.lax.fori_loop, and jax.lax.scan instead of if/for. Python control flow causes recompilation.
jnp.bfloat16 provides good numerical stability for training while using half the memory of float32.
Use donate_argnums in jit to tell XLA it can reuse input buffers: @jit(donate_argnums=(0,)).
Use jax.profiler to create traces viewable in TensorBoard or Perfetto for detailed GPU analysis.
| Task | Performance | Notes |
|---|---|---|
| Transformer Training (steps/sec) | 42 | RTX 4090, sequence length 512 |
| vmap vs for loop speedup | 50-100x | Batch size 128 |
| Multi-GPU scaling efficiency | 0.95 | 8x A100, near-linear scaling |
| XLA compilation overhead | 2-30s | One-time cost per unique input shape |
JAX uses XLA to compile entire computation graphs into optimized GPU code. It fuses operations, eliminates intermediate allocations, and optimizes memory layout. The functional design enables aggressive optimization.
Remove @jit temporarily to run in eager mode. Use jax.debug.print() inside jitted functions. Enable JAX_DEBUG_NANS=True to detect NaN issues.
jit compiles for a single device. pmap compiles for parallel execution across multiple devices, automatically handling data sharding and gradient synchronization.
Use jax.device_put(x, device) to move to GPU and jax.device_get(x) to move to CPU. JAX automatically executes operations where the data lives.
Imperative style, larger ecosystem, easier debugging
More production tools, better mobile support
For custom kernel development
Optimize your JAX CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.