RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA SAXPY Optimization: Single Precision A*X + Y

December 25, 202510 minBy RightNow AI Team

Introduction

SAXPY (Single-precision A times X Plus Y) is a fundamental BLAS Level 1 operation that computes y[i] = a*x[i] + y[i] for all elements. Despite its simplicity, SAXPY is an excellent teaching example for GPU optimization because it is purely memory-bound - performance depends entirely on memory bandwidth utilization. Modern GPUs offer 500-900 GB/s of memory bandwidth, but naive implementations often achieve less than 50 GB/s. The gap comes from uncoalesced accesses, insufficient parallelism, and missing vectorization opportunities. Properly optimized SAXPY saturates memory bandwidth, serving as a baseline for all memory-bound operations. SAXPY demonstrates key concepts that apply broadly: coalesced access patterns, grid-stride loops for large vectors, and vectorized memory transactions. Master these techniques and you'll write efficient code for any bandwidth-limited kernel.

Common Performance Issues

Insufficient parallelism - too few threads to saturate memory controllers
Non-coalesced memory access - scattered reads causing wasted bandwidth
Scalar loads instead of vectorized - missing float4 optimization
Fixed grid size - cannot handle vectors larger than thread count
Missing FMA instructions - compiler not using fused multiply-add
Separate kernel launches - overhead prevents streaming pattern

Optimization Techniques

1. Grid-Stride Loop Pattern

Use grid-stride loops to process arbitrarily large vectors with fixed grid size. This pattern improves locality and reduces kernel launch overhead.

cuda

__global__ void saxpy_grid_stride(float a, float* x, float* y, int n) {
    // Start at this thread's index
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    // Stride by total number of threads in grid
    int stride = blockDim.x * gridDim.x;

    // Process multiple elements per thread
    for (int i = idx; i < n; i += stride) {
        y[i] = a * x[i] + y[i];
    }
}

// Works for any vector size without recompiling
// Improves cache locality - consecutive iterations access nearby memory

2. Vectorized Memory Access

Use float4 loads and stores to read/write 128 bits per transaction. This reduces instruction count and improves memory throughput.

cuda

__global__ void saxpy_vectorized(float a, float4* x, float4* y, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < n / 4) {
        float4 x_vec = x[idx];
        float4 y_vec = y[idx];

        // Process 4 elements simultaneously
        y_vec.x = a * x_vec.x + y_vec.x;
        y_vec.y = a * x_vec.y + y_vec.y;
        y_vec.z = a * x_vec.z + y_vec.z;
        y_vec.w = a * x_vec.w + y_vec.w;

        y[idx] = y_vec;
    }

    // Handle tail elements (n % 4)
    for (int i = (n / 4) * 4 + threadIdx.x; i < n; i += blockDim.x) {
        ((float*)y)[i] = a * ((float*)x)[i] + ((float*)y)[i];
    }
}

// Reduces memory transactions by 4x
// Improves bandwidth from ~150 GB/s to ~600 GB/s

3. Fused Multiply-Add (FMA)

Ensure compiler generates FMA instructions (single instruction for a*b+c) for maximum throughput and numerical accuracy.

cuda

__global__ void saxpy_fma(float a, float* x, float* y, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    for (int i = idx; i < n; i += stride) {
        // Explicit FMA intrinsic (guaranteed single instruction)
        y[i] = __fmaf_rn(a, x[i], y[i]);  // rn = round to nearest
    }
}

// Compiler usually generates FMA automatically for a*x + y
// But explicit __fmaf_rn guarantees it
// FMA: 1 instruction vs 2 (multiply + add)

4. Stream-Based Parallelism

Launch SAXPY operations in multiple streams to overlap with other work and hide kernel launch overhead.

cuda

void saxpy_multi_stream(float a, float* d_x, float* d_y,
                         int n, int num_streams) {
    cudaStream_t streams[num_streams];
    for (int i = 0; i < num_streams; i++) {
        cudaStreamCreate(&streams[i]);
    }

    int chunk_size = (n + num_streams - 1) / num_streams;
    int threads = 256;
    int blocks = (chunk_size + threads - 1) / threads;

    for (int i = 0; i < num_streams; i++) {
        int offset = i * chunk_size;
        int chunk_n = min(chunk_size, n - offset);

        saxpy<<<blocks, threads, 0, streams[i]>>>(
            a, &d_x[offset], &d_y[offset], chunk_n);
    }

    // Synchronize all streams
    for (int i = 0; i < num_streams; i++) {
        cudaStreamSynchronize(streams[i]);
        cudaStreamDestroy(streams[i]);
    }
}

Implementation Comparison

Before (Naive Implementation)

Naive implementation works but achieves only 20-30% of peak memory bandwidth due to scalar access patterns.

cuda

__global__ void saxpy_naive(float a, float* x, float* y, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < n) {
        y[idx] = a * x[idx] + y[idx];
    }
}

void compute_saxpy_naive(float a, float* d_x, float* d_y, int n) {
    int threads = 256;
    int blocks = (n + threads - 1) / threads;

    saxpy_naive<<<blocks, threads>>>(a, d_x, d_y, n);
    cudaDeviceSynchronize();
}

// Issues:
// 1. Cannot handle n > grid_size (limited parallelism)
// 2. Scalar loads (misses vectorization)
// 3. May not generate FMA instructions
// 4. Poor performance for large vectors

After (Optimized Implementation)

Optimized SAXPY uses vectorized loads, grid-stride loops, and FMA instructions to saturate memory bandwidth.

cuda

__global__ void saxpy_optimized(float a, float4* x, float4* y, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;

    // Main loop: process 4 elements at a time
    for (int i = idx; i < n / 4; i += stride) {
        float4 x_vec = x[i];
        float4 y_vec = y[i];

        // Compiler generates FMA instructions automatically
        y_vec.x = __fmaf_rn(a, x_vec.x, y_vec.x);
        y_vec.y = __fmaf_rn(a, x_vec.y, y_vec.y);
        y_vec.z = __fmaf_rn(a, x_vec.z, y_vec.z);
        y_vec.w = __fmaf_rn(a, x_vec.w, y_vec.w);

        y[i] = y_vec;
    }

    // Handle remaining elements (n % 4)
    float* x_scalar = (float*)x;
    float* y_scalar = (float*)y;
    for (int i = (n / 4) * 4 + idx; i < n; i += stride) {
        y_scalar[i] = __fmaf_rn(a, x_scalar[i], y_scalar[i]);
    }
}

void compute_saxpy_optimized(float a, float* d_x, float* d_y, int n) {
    // Use moderate grid size for grid-stride loop
    int threads = 256;
    int blocks = min(1024, (n / 4 + threads - 1) / threads);

    saxpy_optimized<<<blocks, threads>>>(a, (float4*)d_x,
                                         (float4*)d_y, n);
    cudaDeviceSynchronize();
}

// Optimizations:
// 1. float4 vectorization: 4x fewer transactions
// 2. Grid-stride loop: handles any vector size
// 3. Explicit FMA: guaranteed single instruction
// 4. Tail handling: processes remaining elements
//
// Performance: 90-95% of peak memory bandwidth

Performance Results

Metric	Naive	Optimized	Improvement
Memory Bandwidth (10M elements)	145 GB/s (18% peak)	720 GB/s (90% peak)	4.97x higher
Throughput (elements/sec)	9.1 billion/sec	45 billion/sec	4.95x faster
Latency (10M elements)	1.1 ms	0.22 ms	5x faster
vs cuBLAS cublasSaxpy	20% of cuBLAS	98% of cuBLAS	4.9x closer

Frequently Asked Questions

Why is SAXPY memory-bound instead of compute-bound?

SAXPY performs only 2 FLOPs (multiply + add) per element but requires 3 memory accesses (read x, read y, write y). On modern GPUs, memory bandwidth limits performance long before compute saturation. For reference, an RTX 4090 can do 82 TFLOPs but only 1 TB/s memory, a ratio of 82:1 FLOPs per byte.

When should I use float4 vs scalar access?

Use float4 when vectors are aligned to 16-byte boundaries (n divisible by 4) and you process all elements. float4 reduces transaction count by 4x. For misaligned or sparse access patterns, scalar loads may be necessary. Always handle tail elements separately.

What grid size should I use for grid-stride loops?

Use 4-8x the number of SMs on your GPU. For example, with 128 SMs, use 512-1024 blocks. This provides enough parallelism to saturate the GPU while maintaining good locality. Too few blocks underutilize hardware; too many increase launch overhead without benefit.

Can I fuse SAXPY with other operations?

Yes! Kernel fusion is a powerful optimization. For example, fusing SAXPY with a subsequent reduction or elementwise operation eliminates intermediate memory traffic. Write custom fused kernels when memory bandwidth is the bottleneck and operations are applied to the same data.

Vector Addition

Simpler variant without scaling

→

Dot Product

Another BLAS Level 1 operation

→

Matrix-Vector Multiply (GEMV)

Uses SAXPY as building block

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA SAXPYGPU vector operationsmemory bandwidth optimizationcoalesced memory accessvectorized loads CUDAgrid-stride loop

CUDA SAXPY Optimization: Single Precision A*X + Y

Introduction

Common Performance Issues

Optimization Techniques

1. Grid-Stride Loop Pattern

2. Vectorized Memory Access

3. Fused Multiply-Add (FMA)

4. Stream-Based Parallelism

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Why is SAXPY memory-bound instead of compute-bound?

When should I use float4 vs scalar access?

What grid size should I use for grid-stride loops?

Can I fuse SAXPY with other operations?

Related Guides

CUDA SAXPY Optimization: Single Precision A*X + Y

Introduction

Common Performance Issues

Optimization Techniques

1. Grid-Stride Loop Pattern

2. Vectorized Memory Access

3. Fused Multiply-Add (FMA)

4. Stream-Based Parallelism

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Why is SAXPY memory-bound instead of compute-bound?

When should I use float4 vs scalar access?

What grid size should I use for grid-stride loops?

Can I fuse SAXPY with other operations?

Related Guides