RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Parallel Max Reduction: Finding Maximum Values on GPU

December 25, 202512 minBy RightNow AI Team

Introduction

Finding the maximum value in a large array is a fundamental operation in machine learning (softmax, attention), image processing (contrast detection), and scientific computing. While the algorithm parallels sum reduction, max reduction has unique considerations around identity values, argmax tracking, and numerical stability. This guide covers optimized max reduction patterns, including finding both the maximum value and its index (argmax), handling special cases like NaN and infinity, and integrating max reduction into larger GPU workflows.

Common Performance Issues

Wrong identity value - using 0 instead of -INFINITY causes incorrect results for negative arrays
NaN propagation - not handling NaN values can corrupt results
Lost argmax indices - finding max value without tracking original position
Multiple max handling - not deciding what to return when multiple elements equal max
Numerical instability - precision loss when comparing nearly-equal values
Inefficient argmax - running separate kernels for max and argmax

Optimization Techniques

1. Correct Identity Value

Use -INFINITY (negative infinity) as the identity value for max reduction. This ensures correct results regardless of input values.

cuda

#include <limits>
#include <cuda_fp16.h>

// For float
__device__ const float FLOAT_NEG_INF = -INFINITY;
// Or: -std::numeric_limits<float>::infinity()

// For double
__device__ const double DOUBLE_NEG_INF = -INFINITY;

// For half precision
__device__ const __half HALF_NEG_INF = __float2half(-INFINITY);

// Kernel initialization
__global__ void reduce_max(float* input, float* output, int n) {
    float max_val = -INFINITY;  // Correct identity

    for (int i = tid; i < n; i += gridSize) {
        max_val = fmaxf(max_val, input[i]);
    }
    // ... reduction continues
}

2. Fused Argmax Reduction

Track both maximum value and its index in a single pass. Use a struct or pair to carry both through the reduction tree.

cuda

struct MaxWithIndex {
    float val;
    int idx;
};

__device__ MaxWithIndex warpReduceMaxIdx(MaxWithIndex data) {
    for (int offset = 16; offset > 0; offset /= 2) {
        float other_val = __shfl_down_sync(0xffffffff, data.val, offset);
        int other_idx = __shfl_down_sync(0xffffffff, data.idx, offset);

        if (other_val > data.val ||
            (other_val == data.val && other_idx < data.idx)) {
            data.val = other_val;
            data.idx = other_idx;
        }
    }
    return data;
}

__global__ void argmax(float* input, MaxWithIndex* output, int n) {
    MaxWithIndex local = {-INFINITY, -1};
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    for (int i = tid; i < n; i += blockDim.x * gridDim.x) {
        if (input[i] > local.val) {
            local.val = input[i];
            local.idx = i;
        }
    }

    // Block-level reduction
    local = blockReduceMaxIdx(local);

    if (threadIdx.x == 0) output[blockIdx.x] = local;
}

3. NaN-Safe Max

Handle NaN values explicitly. IEEE 754 comparisons with NaN return false, which can cause incorrect results if not handled.

cuda

// fmaxf handles NaN correctly - returns the non-NaN value
// But be explicit about NaN propagation policy

// Option 1: Propagate NaN (if any input is NaN, output is NaN)
__device__ float nan_propagate_max(float a, float b) {
    if (isnan(a)) return a;
    if (isnan(b)) return b;
    return fmaxf(a, b);
}

// Option 2: Ignore NaN (only return NaN if ALL inputs are NaN)
__device__ float nan_ignore_max(float a, float b) {
    if (isnan(a)) return b;
    if (isnan(b)) return a;
    return fmaxf(a, b);
}

// Option 3: Use fmaxf directly (returns non-NaN, or NaN if both NaN)
__device__ float max_val = fmaxf(a, b);  // Standard behavior

// Reduction with NaN handling
for (int i = tid; i < n; i += gridSize) {
    float val = input[i];
    if (!isnan(val)) {
        max_val = fmaxf(max_val, val);
    }
}

4. Multi-Row Max (Batched)

Reduce multiple rows simultaneously for batched operations. Each block handles one row, maximizing parallelism for batch operations.

cuda

// Each block reduces one row
__global__ void rowwise_max(float* matrix, float* row_max,
                             int rows, int cols) {
    int row = blockIdx.x;
    if (row >= rows) return;

    float* row_ptr = matrix + row * cols;
    float max_val = -INFINITY;

    // Each thread handles multiple columns
    for (int col = threadIdx.x; col < cols; col += blockDim.x) {
        max_val = fmaxf(max_val, row_ptr[col]);
    }

    // Warp reduction
    max_val = warpReduceMax(max_val);

    // Block reduction for blocks > 32 threads
    __shared__ float shared[32];
    int lane = threadIdx.x % 32;
    int wid = threadIdx.x / 32;

    if (lane == 0) shared[wid] = max_val;
    __syncthreads();

    if (wid == 0) {
        max_val = (lane < blockDim.x / 32) ? shared[lane] : -INFINITY;
        max_val = warpReduceMax(max_val);
        if (lane == 0) row_max[row] = max_val;
    }
}

// Launch: one block per row
rowwise_max<<<batch_size, 256>>>(data, output, batch_size, seq_len);

5. Max with Safe Softmax

Combine max finding with softmax computation. First find max, then compute exp(x - max) for numerical stability.

cuda

// Two-pass approach for numerically stable softmax
// Pass 1: Find row max
// Pass 2: Compute exp(x - max) and sum, then normalize

// Or fused single-pass with online algorithm
__global__ void online_softmax(float* input, float* output, int n) {
    float max_val = -INFINITY;
    float sum = 0.0f;

    // Online max and sum computation
    for (int i = threadIdx.x; i < n; i += blockDim.x) {
        float val = input[i];
        if (val > max_val) {
            sum = sum * expf(max_val - val) + 1.0f;
            max_val = val;
        } else {
            sum += expf(val - max_val);
        }
    }

    // Reduce max across threads
    max_val = blockReduceMax(max_val);

    // Adjust sum for new global max, then reduce
    sum = sum * expf(local_max - max_val);
    sum = blockReduceSum(sum);

    // Final normalization pass
    for (int i = threadIdx.x; i < n; i += blockDim.x) {
        output[i] = expf(input[i] - max_val) / sum;
    }
}

Implementation Comparison

Before (Naive Implementation)

This naive implementation uses wrong identity values, lacks argmax, and handles NaN incorrectly.

cuda

// Naive max reduction with common mistakes
__global__ void max_naive(float* g_in, float* g_out, int n) {
    extern __shared__ float sdata[];

    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // PROBLEM 1: Using 0 as identity - wrong for negative arrays!
    sdata[tid] = (i < n) ? g_in[i] : 0;  // Should be -INFINITY
    __syncthreads();

    // PROBLEM 2: Using if/else instead of fmaxf (no NaN handling)
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            if (sdata[tid + s] > sdata[tid]) {  // NaN unsafe!
                sdata[tid] = sdata[tid + s];
            }
        }
        __syncthreads();
    }

    if (tid == 0) g_out[blockIdx.x] = sdata[0];
}

// PROBLEM 3: No argmax - need second kernel to find index
// PROBLEM 4: Not handling ties - which index if multiple max?

After (Optimized Implementation)

This optimized version tracks both max value and index, uses warp shuffles, handles ties consistently, and works with -INFINITY identity.

cuda

// Optimized max reduction with argmax support
struct ValIdx { float val; int idx; };

__device__ ValIdx warpReduceMax(ValIdx vi) {
    #pragma unroll
    for (int offset = 16; offset > 0; offset /= 2) {
        float o_val = __shfl_down_sync(0xffffffff, vi.val, offset);
        int o_idx = __shfl_down_sync(0xffffffff, vi.idx, offset);
        if (o_val > vi.val || (o_val == vi.val && o_idx < vi.idx)) {
            vi.val = o_val;
            vi.idx = o_idx;
        }
    }
    return vi;
}

__device__ ValIdx blockReduceMax(ValIdx vi) {
    __shared__ ValIdx shared[32];
    int lane = threadIdx.x % 32;
    int wid = threadIdx.x / 32;

    vi = warpReduceMax(vi);
    if (lane == 0) shared[wid] = vi;
    __syncthreads();

    vi = (threadIdx.x < blockDim.x / 32) ?
         shared[lane] : ValIdx{-INFINITY, -1};
    if (wid == 0) vi = warpReduceMax(vi);
    return vi;
}

template<int BLOCK_SIZE>
__global__ void argmax_optimized(float* input, ValIdx* output, int n) {
    ValIdx local = {-INFINITY, -1};
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int gridSize = BLOCK_SIZE * gridDim.x;

    // Grid-stride loop
    for (int i = tid; i < n; i += gridSize) {
        float val = input[i];
        if (val > local.val) {
            local.val = val;
            local.idx = i;
        }
    }

    // Block reduction
    local = blockReduceMax(local);

    if (threadIdx.x == 0) {
        output[blockIdx.x] = local;
    }
}

// Final reduction of block results (small array)
__global__ void argmax_final(ValIdx* block_results, ValIdx* output,
                              int num_blocks) {
    ValIdx local = {-INFINITY, -1};
    for (int i = threadIdx.x; i < num_blocks; i += blockDim.x) {
        ValIdx br = block_results[i];
        if (br.val > local.val ||
            (br.val == local.val && br.idx < local.idx)) {
            local = br;
        }
    }
    local = blockReduceMax(local);
    if (threadIdx.x == 0) *output = local;
}

Performance Results

Metric	Naive	Optimized	Improvement
Execution Time (16M elements)	1.9 ms	0.11 ms	17x faster
Memory Bandwidth (% peak)	42%	91%	2.2x better
With Argmax Overhead	3.8 ms (2 kernels)	0.11 ms (fused)	35x faster
Batched (1024 rows x 4096 cols)	12.4 ms	0.8 ms	15.5x faster

Frequently Asked Questions

How do I handle ties in argmax?

Decide on a tie-breaking policy: first occurrence (smallest index), last occurrence (largest index), or any (undefined behavior). The examples use first occurrence by preferring smaller indices when values are equal. For stable sorting in ML, first occurrence is typically preferred.

Should I use fmaxf or comparisons for max?

Use fmaxf() when possible. It handles NaN correctly (returns non-NaN value), compiles to a single instruction (FMNMX), and handles +0/-0 correctly. Use explicit comparisons only when you need custom NaN behavior or are tracking argmax.

How do I find top-K instead of just max?

For small K (< 32), maintain a sorted buffer in registers and insert each element. For larger K, use a heap-based approach or radix select algorithm. NVIDIA CUB provides cub::DeviceRadixSort for efficient top-K. For ML applications, consider approximate methods like random sampling.

Can I fuse max reduction with other operations?

Yes! Common fused patterns include: max + exp (softmax), max + L2 norm, max + min (range finding), and max + index for attention. Fusing saves memory bandwidth by avoiding intermediate storage. The online softmax algorithm is a great example of fusing max and sum.

Parallel Sum Reduction

Same pattern with different operator

→

Prefix Scan

Generalization of reduction pattern

→

Bitonic Sort

Alternative for top-K finding

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA max reductionGPU find maximumCUDA argmaxparallel max GPUreduce max CUDAwarp reduce max

CUDA Parallel Max Reduction: Finding Maximum Values on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Correct Identity Value

2. Fused Argmax Reduction

3. NaN-Safe Max

4. Multi-Row Max (Batched)

5. Max with Safe Softmax

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

How do I handle ties in argmax?

Should I use fmaxf or comparisons for max?

How do I find top-K instead of just max?

Can I fuse max reduction with other operations?

Related Guides

CUDA Parallel Max Reduction: Finding Maximum Values on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Correct Identity Value

2. Fused Argmax Reduction

3. NaN-Safe Max

4. Multi-Row Max (Batched)

5. Max with Safe Softmax

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

How do I handle ties in argmax?

Should I use fmaxf or comparisons for max?

How do I find top-K instead of just max?

Can I fuse max reduction with other operations?

Related Guides