RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Unique Elements Optimization Guide

December 25, 20259 minBy RightNow AI Team

Introduction

Finding unique elements removes duplicates from a tensor. Two main approaches: sort then adjacent-different (stable, O(n log n)), or hash set (O(n) but uses more memory).

Common Performance Issues

Preserving order (requires extra work)
Memory for sorted copy
Hash collisions

Optimization Techniques

1. Sort-Based Unique

Sort then remove adjacent duplicates.

cuda

int find_unique(int* data, int* unique_out, int n) {
    // 1. Sort
    thrust::device_vector<int> sorted(data, data + n);
    thrust::sort(sorted.begin(), sorted.end());

    // 2. Remove adjacent duplicates
    auto new_end = thrust::unique(sorted.begin(), sorted.end());
    int unique_count = new_end - sorted.begin();

    // 3. Copy result
    thrust::copy(sorted.begin(), new_end, unique_out);
    return unique_count;
}

Implementation Comparison

Before (Naive Implementation)

O(n²) nested loops.

cuda

// Slow: O(n²) for each element check
int unique_naive(int* data, int* out, int n) {
    int count = 0;
    for (int i = 0; i < n; i++) {
        bool found = false;
        for (int j = 0; j < count; j++) {
            if (out[j] == data[i]) { found = true; break; }
        }
        if (!found) out[count++] = data[i];
    }
    return count;
}

After (Optimized Implementation)

CUB provides optimized select-unique.

cuda

#include <cub/cub.cuh>

int unique_cub(int* data, int* unique_out, int n) {
    // Sort first
    thrust::sort(thrust::device, data, data + n);

    // Use CUB for unique
    int* d_num_selected;
    cudaMalloc(&d_num_selected, sizeof(int));

    size_t temp_bytes = 0;
    cub::DeviceSelect::Unique(nullptr, temp_bytes, data, unique_out,
                               d_num_selected, n);
    void* d_temp;
    cudaMalloc(&d_temp, temp_bytes);

    cub::DeviceSelect::Unique(d_temp, temp_bytes, data, unique_out,
                               d_num_selected, n);

    int num_unique;
    cudaMemcpy(&num_unique, d_num_selected, sizeof(int), D2H);
    return num_unique;
}

Performance Results

Metric	Naive	Optimized	Improvement
Unique (10M, 50% dups)	Timeout	8ms	>1000x

Frequently Asked Questions

How to preserve original order?

Track original indices, sort by (value, index), unique on value, sort result by original index.

Argsort

Unique uses sort internally

→

Histogram

Count occurrences of unique values

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA uniquededuplicatedistinct elementsremove duplicatesset operations

Optimization Techniques

1. Sort-Based Unique

Sort then remove adjacent duplicates.

cuda

int find_unique(int* data, int* unique_out, int n) {
    // 1. Sort
    thrust::device_vector<int> sorted(data, data + n);
    thrust::sort(sorted.begin(), sorted.end());

    // 2. Remove adjacent duplicates
    auto new_end = thrust::unique(sorted.begin(), sorted.end());
    int unique_count = new_end - sorted.begin();

    // 3. Copy result
    thrust::copy(sorted.begin(), new_end, unique_out);
    return unique_count;
}

Implementation Comparison

Before (Naive Implementation)

O(n²) nested loops.

cuda

// Slow: O(n²) for each element check
int unique_naive(int* data, int* out, int n) {
    int count = 0;
    for (int i = 0; i < n; i++) {
        bool found = false;
        for (int j = 0; j < count; j++) {
            if (out[j] == data[i]) { found = true; break; }
        }
        if (!found) out[count++] = data[i];
    }
    return count;
}

After (Optimized Implementation)

CUB provides optimized select-unique.

cuda

#include <cub/cub.cuh>

int unique_cub(int* data, int* unique_out, int n) {
    // Sort first
    thrust::sort(thrust::device, data, data + n);

    // Use CUB for unique
    int* d_num_selected;
    cudaMalloc(&d_num_selected, sizeof(int));

    size_t temp_bytes = 0;
    cub::DeviceSelect::Unique(nullptr, temp_bytes, data, unique_out,
                               d_num_selected, n);
    void* d_temp;
    cudaMalloc(&d_temp, temp_bytes);

    cub::DeviceSelect::Unique(d_temp, temp_bytes, data, unique_out,
                               d_num_selected, n);

    int num_unique;
    cudaMemcpy(&num_unique, d_num_selected, sizeof(int), D2H);
    return num_unique;
}

Metric

Naive

Optimized

Improvement

Unique (10M, 50% dups)

Timeout

8ms

>1000x

CUDA Unique Elements Optimization Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Sort-Based Unique

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

How to preserve original order?

Related Guides

CUDA Unique Elements Optimization Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Sort-Based Unique

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

How to preserve original order?

Related Guides