RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Tensor Concatenation Optimization Guide

December 25, 20257 minBy RightNow AI Team

Introduction

Concatenation joins tensors along a specified dimension. While conceptually simple, efficient implementation minimizes memory copies and maximizes bandwidth utilization through coalesced access.

Common Performance Issues

Uncoalesced writes for non-contiguous dims
Unnecessary intermediate copies
Poor parallelization

Optimization Techniques

1. Coalesced Concat

Ensure threads write to consecutive memory.

cuda

// Concat along batch dimension - simple memcpy
__global__ void concat_dim0(float** inputs, int* sizes, float* output,
                            int n_tensors, int inner_size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int total = 0;
    for (int t = 0; t < n_tensors; t++) total += sizes[t];

    if (tid < total * inner_size) {
        // Find which tensor and offset
        int remaining = tid;
        for (int t = 0; t < n_tensors; t++) {
            int tensor_size = sizes[t] * inner_size;
            if (remaining < tensor_size) {
                output[tid] = inputs[t][remaining];
                break;
            }
            remaining -= tensor_size;
        }
    }
}

Implementation Comparison

Before (Naive Implementation)

Sequential memcpy for each tensor.

cuda

void concat_naive(float* a, float* b, float* out, int na, int nb) {
    cudaMemcpy(out, a, na * sizeof(float), cudaMemcpyDeviceToDevice);
    cudaMemcpy(out + na, b, nb * sizeof(float), cudaMemcpyDeviceToDevice);
}

After (Optimized Implementation)

Single kernel for many tensors avoids launch overhead.

cuda

// For many small tensors, single kernel is faster
__global__ void concat_batched(float** inputs, int* offsets,
                               float* output, int n_tensors, int total) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= total) return;

    // Binary search to find source tensor
    int t = 0;
    while (t < n_tensors - 1 && tid >= offsets[t + 1]) t++;

    output[tid] = inputs[t][tid - offsets[t]];
}

// Alternative: use thrust::gather for flexibility
thrust::gather(indices.begin(), indices.end(),
               all_data.begin(), output.begin());

Performance Results

Metric	Naive	Optimized	Improvement
Throughput (10x1M tensors)	12 GB/s	380 GB/s	32x faster

Frequently Asked Questions

Concat vs stack?

Concat joins along existing dim (sizes add up). Stack creates new dim (all inputs same size). Stack = unsqueeze then concat.

Stack

Concat with new dimension

→

Split

Inverse of concat

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA concattensor concatenationjoin tensorsstack arraysmemory efficient

Optimization Techniques

1. Coalesced Concat

Ensure threads write to consecutive memory.

cuda

// Concat along batch dimension - simple memcpy
__global__ void concat_dim0(float** inputs, int* sizes, float* output,
                            int n_tensors, int inner_size) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    int total = 0;
    for (int t = 0; t < n_tensors; t++) total += sizes[t];

    if (tid < total * inner_size) {
        // Find which tensor and offset
        int remaining = tid;
        for (int t = 0; t < n_tensors; t++) {
            int tensor_size = sizes[t] * inner_size;
            if (remaining < tensor_size) {
                output[tid] = inputs[t][remaining];
                break;
            }
            remaining -= tensor_size;
        }
    }
}

Implementation Comparison

Before (Naive Implementation)

Sequential memcpy for each tensor.

cuda

void concat_naive(float* a, float* b, float* out, int na, int nb) {
    cudaMemcpy(out, a, na * sizeof(float), cudaMemcpyDeviceToDevice);
    cudaMemcpy(out + na, b, nb * sizeof(float), cudaMemcpyDeviceToDevice);
}

After (Optimized Implementation)

Single kernel for many tensors avoids launch overhead.

cuda

// For many small tensors, single kernel is faster
__global__ void concat_batched(float** inputs, int* offsets,
                               float* output, int n_tensors, int total) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= total) return;

    // Binary search to find source tensor
    int t = 0;
    while (t < n_tensors - 1 && tid >= offsets[t + 1]) t++;

    output[tid] = inputs[t][tid - offsets[t]];
}

// Alternative: use thrust::gather for flexibility
thrust::gather(indices.begin(), indices.end(),
               all_data.begin(), output.begin());

Metric

Naive

Optimized

Improvement

Throughput (10x1M tensors)

12 GB/s

380 GB/s

32x faster

CUDA Tensor Concatenation Optimization Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Coalesced Concat

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Concat vs stack?

Related Guides

CUDA Tensor Concatenation Optimization Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Coalesced Concat

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Concat vs stack?

Related Guides