RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Top-K Selection Optimization Guide

December 25, 202512 minBy RightNow AI Team

Introduction

Top-k finds the k largest elements, essential for beam search decoding and nucleus sampling in LLMs. Full sorting is O(n log n); efficient top-k uses radix select or heap for O(n + k log k).

Common Performance Issues

Using full sort
Poor parallelism for small k
Memory overhead for large vocabulary

Optimization Techniques

1. Radix Select

Binary search on bit patterns to find k-th element.

cuda

// Find threshold, then filter
__global__ void topk_radix(float* vals, int* indices, int n, int k,
                           float* topk_vals, int* topk_idx) {
    // 1. Radix select to find k-th largest value
    float threshold = radix_select_kth(vals, n, k);

    // 2. Filter elements >= threshold
    int count = 0;
    for (int i = threadIdx.x; i < n && count < k; i += blockDim.x) {
        if (vals[i] >= threshold) {
            int pos = atomicAdd(&count, 1);
            if (pos < k) {
                topk_vals[pos] = vals[i];
                topk_idx[pos] = i;
            }
        }
    }
}

Implementation Comparison

Before (Naive Implementation)

O(n log n) full sort, wasteful for small k.

cuda

void topk_naive(float* vals, int* idx, int n, int k, ...) {
    thrust::sort_by_key(vals, vals + n, idx, thrust::greater<float>());
    cudaMemcpy(topk_vals, vals, k * sizeof(float), ...);
}

After (Optimized Implementation)

Warp-cooperative heap for k≤32.

cuda

// For small k (≤32), use warp-level min-heap
__device__ void warp_topk(float* vals, int n, int k,
                          float* topk_vals, int* topk_idx) {
    // Each thread maintains one heap element
    float my_val = -INFINITY;
    int my_idx = -1;

    for (int i = threadIdx.x; i < n; i += 32) {
        float v = vals[i];
        // Find min in current top-k
        float min_val = warpReduceMin(my_val);
        if (v > min_val) {
            // Replace min with new value
            // Use ballot to coordinate replacement
            if (my_val == min_val && v > my_val) {
                my_val = v;
                my_idx = i;
            }
        }
    }

    topk_vals[threadIdx.x] = my_val;
    topk_idx[threadIdx.x] = my_idx;
}

Performance Results

Metric	Naive	Optimized	Improvement
Latency (n=50000, k=50)	1.2ms	0.08ms	15x faster

Frequently Asked Questions

Best algorithm for different k?

k≤32: warp heap. k≤1024: block heap. Large k: radix select. Very large k (>n/10): just sort.

Argsort

Full sort alternative

→

Reduction Max

Top-1 is max reduction

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA top-kk largest elementsbeam searchLLM samplingradix selectpartial sort

Optimization Techniques

1. Radix Select

Binary search on bit patterns to find k-th element.

cuda

// Find threshold, then filter
__global__ void topk_radix(float* vals, int* indices, int n, int k,
                           float* topk_vals, int* topk_idx) {
    // 1. Radix select to find k-th largest value
    float threshold = radix_select_kth(vals, n, k);

    // 2. Filter elements >= threshold
    int count = 0;
    for (int i = threadIdx.x; i < n && count < k; i += blockDim.x) {
        if (vals[i] >= threshold) {
            int pos = atomicAdd(&count, 1);
            if (pos < k) {
                topk_vals[pos] = vals[i];
                topk_idx[pos] = i;
            }
        }
    }
}

Implementation Comparison

Before (Naive Implementation)

O(n log n) full sort, wasteful for small k.

cuda

void topk_naive(float* vals, int* idx, int n, int k, ...) {
    thrust::sort_by_key(vals, vals + n, idx, thrust::greater<float>());
    cudaMemcpy(topk_vals, vals, k * sizeof(float), ...);
}

After (Optimized Implementation)

Warp-cooperative heap for k≤32.

cuda

// For small k (≤32), use warp-level min-heap
__device__ void warp_topk(float* vals, int n, int k,
                          float* topk_vals, int* topk_idx) {
    // Each thread maintains one heap element
    float my_val = -INFINITY;
    int my_idx = -1;

    for (int i = threadIdx.x; i < n; i += 32) {
        float v = vals[i];
        // Find min in current top-k
        float min_val = warpReduceMin(my_val);
        if (v > min_val) {
            // Replace min with new value
            // Use ballot to coordinate replacement
            if (my_val == min_val && v > my_val) {
                my_val = v;
                my_idx = i;
            }
        }
    }

    topk_vals[threadIdx.x] = my_val;
    topk_idx[threadIdx.x] = my_idx;
}

Metric

Naive

Optimized

Improvement

Latency (n=50000, k=50)

1.2ms

0.08ms

15x faster

CUDA Top-K Selection Optimization Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Radix Select

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Best algorithm for different k?

Related Guides

CUDA Top-K Selection Optimization Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Radix Select

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Best algorithm for different k?

Related Guides