RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

compilerPython

Triton GPU Kernel Programming Guide: CUDA Alternative

December 25, 202516 min read

Introduction

Triton is a programming language and compiler for writing highly efficient GPU kernels in Python. Developed by OpenAI, it bridges the gap between high-level frameworks and low-level CUDA - you write Python-like code and Triton compiles it to optimized GPU assembly. For CUDA developers, Triton eliminates much of the complexity of GPU programming. Instead of managing thread blocks, shared memory, and memory coalescing manually, you express algorithms at a higher level and Triton handles the optimization. It's particularly powerful for custom attention mechanisms, quantization kernels, and operations not well-supported by cuDNN. This guide covers Triton's programming model, kernel development, integration with PyTorch, and optimization techniques for writing production-quality GPU kernels.

CUDA Integration: Triton compiles Python functions to GPU code that runs alongside CUDA kernels. It can directly operate on PyTorch tensors and integrates with the CUDA ecosystem. Triton-generated kernels often match or exceed hand-written CUDA performance for many operations, especially matrix operations and attention.

Key Features

✓Python-like syntax for GPU kernel development
✓Automatic memory coalescing optimization
✓Block-level programming (vs thread-level in CUDA)
✓Automatic shared memory management
✓Seamless PyTorch integration
✓JIT compilation to PTX/CUBIN
✓Support for Tensor Cores and FP8
✓Autotune for automatic optimization
✓Powers torch.compile backend
✓Active development by OpenAI

Installation

Triton is included with PyTorch 2.0+ or can be installed separately.

bash

# Triton comes with PyTorch 2.0+
pip install torch  # Includes triton

# Or install standalone
pip install triton

# Verify installation
python -c "import triton; print(f'Triton {triton.__version__}')"

# Test basic kernel
python -c "
import triton
import triton.language as tl
import torch

@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n):
    pid = tl.program_id(0)
    offsets = pid * 1024 + tl.arange(0, 1024)
    mask = offsets < n
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, x + y, mask=mask)

x = torch.randn(10000, device='cuda')
y = torch.randn(10000, device='cuda')
out = torch.empty_like(x)
add_kernel[(10,)](x, y, out, x.numel())
print('Triton kernel works!')
"

Basic Example

Vector Addition Kernel

A simple Triton kernel demonstrating the basic programming model.

python

import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(
    x_ptr,      # Pointer to first input tensor
    y_ptr,      # Pointer to second input tensor
    out_ptr,    # Pointer to output tensor
    n_elements, # Total number of elements
    BLOCK_SIZE: tl.constexpr,  # Compile-time constant
):
    # Each program handles BLOCK_SIZE elements
    pid = tl.program_id(axis=0)  # Which block am I?

    # Calculate offsets for this block
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)

    # Mask for bounds checking
    mask = offsets < n_elements

    # Load data (masked to handle edge cases)
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)

    # Compute
    output = x + y

    # Store result
    tl.store(out_ptr + offsets, output, mask=mask)

def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    assert x.is_cuda and y.is_cuda
    output = torch.empty_like(x)
    n_elements = x.numel()

    # Calculate grid size
    BLOCK_SIZE = 1024
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)

    # Launch kernel
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=BLOCK_SIZE)

    return output

# Test
x = torch.randn(100000, device='cuda')
y = torch.randn(100000, device='cuda')
output = add(x, y)
assert torch.allclose(output, x + y)
print("Kernel verified!")

Advanced Example

Fused Softmax Kernel with Autotuning

An optimized fused softmax kernel with automatic tuning for best performance.

python

import torch
import triton
import triton.language as tl

@triton.autotune(
    configs=[
        triton.Config({'BLOCK_SIZE': 1024}, num_warps=4),
        triton.Config({'BLOCK_SIZE': 2048}, num_warps=8),
        triton.Config({'BLOCK_SIZE': 4096}, num_warps=8),
    ],
    key=['n_cols'],  # Retune when n_cols changes
)
@triton.jit
def fused_softmax_kernel(
    output_ptr, input_ptr,
    input_row_stride, output_row_stride,
    n_cols,
    BLOCK_SIZE: tl.constexpr,
):
    # Each program handles one row
    row_idx = tl.program_id(0)

    # Pointer to current row
    row_start_ptr = input_ptr + row_idx * input_row_stride

    # Load row with multiple blocks if needed
    col_offsets = tl.arange(0, BLOCK_SIZE)
    input_ptrs = row_start_ptr + col_offsets
    mask = col_offsets < n_cols

    # Load row
    row = tl.load(input_ptrs, mask=mask, other=-float('inf'))

    # Compute softmax
    row_max = tl.max(row, axis=0)
    row = row - row_max  # Numerical stability
    numerator = tl.exp(row)
    denominator = tl.sum(numerator, axis=0)
    softmax_output = numerator / denominator

    # Store result
    output_row_start_ptr = output_ptr + row_idx * output_row_stride
    output_ptrs = output_row_start_ptr + col_offsets
    tl.store(output_ptrs, softmax_output, mask=mask)

def fused_softmax(x: torch.Tensor) -> torch.Tensor:
    n_rows, n_cols = x.shape
    output = torch.empty_like(x)

    # Launch one program per row
    grid = (n_rows,)

    fused_softmax_kernel[grid](
        output, x,
        x.stride(0), output.stride(0),
        n_cols,
    )

    return output

# Benchmark against PyTorch
x = torch.randn(4096, 4096, device='cuda')

# Warmup
for _ in range(10):
    _ = fused_softmax(x)
    _ = torch.softmax(x, dim=-1)

torch.cuda.synchronize()

import time
start = time.time()
for _ in range(100):
    _ = fused_softmax(x)
torch.cuda.synchronize()
triton_time = time.time() - start

start = time.time()
for _ in range(100):
    _ = torch.softmax(x, dim=-1)
torch.cuda.synchronize()
torch_time = time.time() - start

print(f"Triton: {triton_time*10:.2f}ms, PyTorch: {torch_time*10:.2f}ms")
print(f"Speedup: {torch_time/triton_time:.2f}x")

Performance Tips

high impact

Use triton.autotune for automatic optimization

@triton.autotune tests multiple configurations and selects the best one. Include various BLOCK_SIZE and num_warps combinations.

high impact

Fuse operations to reduce memory traffic

Triton shines when you fuse multiple operations (like softmax) into a single kernel, eliminating intermediate memory reads/writes.

medium impact

Use tl.constexpr for compile-time constants

Mark block sizes and other constants as tl.constexpr to enable compiler optimizations like loop unrolling.

medium impact

Align memory accesses to 128 bytes

Choose BLOCK_SIZE to be multiples of 32 (warp size) and ideally 128 bytes for optimal memory coalescing.

high impact

Use tensor cores with tl.dot

tl.dot uses Tensor Cores automatically when shapes are compatible. Ensure dimensions are multiples of 16.

low impact

Profile with TRITON_PRINT_AUTOTUNING=1

Set this environment variable to see which configuration the autotuner selected.

Common Pitfalls

•Not using masks for boundary conditions - causes out-of-bounds access
•Incorrect stride calculations for multi-dimensional tensors
•Forgetting that Triton blocks are 1D - reshape your problem accordingly
•Not warming up before benchmarking - first call includes compilation
•Choosing BLOCK_SIZE too small - underutilizes GPU
•Not using autotune - leaving performance on the table

Benchmarks

Task	Performance	Notes
Fused Softmax speedup	1.5-3x	vs torch.softmax
Flash Attention speedup	2-4x	vs vanilla attention
Quantized MatMul	3-5x	INT8 vs FP16
Compilation time	1-5s	First call per kernel config

Frequently Asked Questions

When should I use Triton vs PyTorch?

Use Triton when: 1) You need to fuse multiple operations PyTorch does separately, 2) PyTorch doesn't have an efficient implementation for your operation, 3) You need custom quantization or precision handling. Stick with PyTorch for standard ops like matmul and conv.

How does Triton compare to writing CUDA?

Triton is 3-10x faster to develop and often achieves 80-100% of hand-tuned CUDA performance. CUDA gives more control but requires managing threads, shared memory, and synchronization manually. Triton is preferred unless you need that low-level control.

Can I use Triton kernels with PyTorch autograd?

Yes! Wrap your Triton kernel in a torch.autograd.Function and define forward() and backward() methods. The backward pass can also be a Triton kernel.

Why is my first kernel call slow?

Triton JIT compiles kernels on first use. Subsequent calls are fast. For benchmarking, always do warmup iterations. In production, you can cache compiled kernels.