RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Image Resize Optimization Guide

December 25, 20258 minBy RightNow AI Team

Introduction

Image resizing is essential for data augmentation, preprocessing, and video processing. NPP (NVIDIA Performance Primitives) provides optimized implementations, but custom kernels offer more flexibility for batched operations. This guide covers interpolation methods, batch processing, and integration with training pipelines.

Common Performance Issues

Per-image kernel launches waste GPU parallelism
Texture memory underutilized for interpolation
CPU-GPU transfers dominate for small images

Optimization Techniques

1. Batched Resize

Process multiple images in single kernel launch.

2. Texture Memory

Use texture units for free bilinear interpolation.

3. NPP Library

Use nppiResize for optimized single-image resize.

Implementation Comparison

Before (Naive Implementation)

Basic bilinear interpolation with manual coordinate calculation.

cuda

__global__ void resize_bilinear(float* src, float* dst,
                                 int src_h, int src_w,
                                 int dst_h, int dst_w, int channels) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= dst_w || y >= dst_h) return;

    float scale_x = (float)src_w / dst_w;
    float scale_y = (float)src_h / dst_h;

    float src_x = (x + 0.5f) * scale_x - 0.5f;
    float src_y = (y + 0.5f) * scale_y - 0.5f;

    int x0 = (int)floorf(src_x), x1 = x0 + 1;
    int y0 = (int)floorf(src_y), y1 = y0 + 1;

    float wx = src_x - x0, wy = src_y - y0;

    for (int c = 0; c < channels; c++) {
        float v00 = src[(y0 * src_w + x0) * channels + c];
        float v01 = src[(y0 * src_w + x1) * channels + c];
        float v10 = src[(y1 * src_w + x0) * channels + c];
        float v11 = src[(y1 * src_w + x1) * channels + c];

        float val = (1-wy) * ((1-wx)*v00 + wx*v01) + wy * ((1-wx)*v10 + wx*v11);
        dst[(y * dst_w + x) * channels + c] = val;
    }
}

After (Optimized Implementation)

Texture units provide free hardware bilinear interpolation.

cuda

// Use texture for hardware-accelerated interpolation
texture<float4, cudaTextureType2D, cudaReadModeElementType> texRef;

__global__ void resize_texture_batched(float4* dst, int dst_h, int dst_w,
                                       float scale_x, float scale_y, int batch) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int b = blockIdx.z;

    if (x >= dst_w || y >= dst_h || b >= batch) return;

    // Hardware bilinear interpolation via texture
    float src_x = (x + 0.5f) * scale_x;
    float src_y = (y + 0.5f) * scale_y;

    float4 val = tex2D(texRef, src_x, src_y);  // Free bilinear!
    dst[(b * dst_h * dst_w + y * dst_w + x)] = val;
}

// Modern approach: use cudaTextureObject_t
cudaTextureObject_t createTexture(cudaArray* array) {
    cudaResourceDesc resDesc = {};
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = array;

    cudaTextureDesc texDesc = {};
    texDesc.filterMode = cudaFilterModeLinear;  // Bilinear
    texDesc.addressMode[0] = cudaAddressModeClamp;
    texDesc.addressMode[1] = cudaAddressModeClamp;

    cudaTextureObject_t tex;
    cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
    return tex;
}

Performance Results

Metric	Naive	Optimized	Improvement
Batch resize throughput	500 img/s	5000 img/s	10x

Frequently Asked Questions

Bilinear vs bicubic?

Bicubic is sharper but 4x slower. For training data augmentation, bilinear is usually sufficient. Bicubic matters for final image quality.

2D Convolution

Both use 2D indexing

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA image resizeGPU image scalingbilinear interpolation CUDAbicubic CUDANPP resizedata augmentation GPU

Introduction

Implementation Comparison

Before (Naive Implementation)

Basic bilinear interpolation with manual coordinate calculation.

cuda

__global__ void resize_bilinear(float* src, float* dst,
                                 int src_h, int src_w,
                                 int dst_h, int dst_w, int channels) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= dst_w || y >= dst_h) return;

    float scale_x = (float)src_w / dst_w;
    float scale_y = (float)src_h / dst_h;

    float src_x = (x + 0.5f) * scale_x - 0.5f;
    float src_y = (y + 0.5f) * scale_y - 0.5f;

    int x0 = (int)floorf(src_x), x1 = x0 + 1;
    int y0 = (int)floorf(src_y), y1 = y0 + 1;

    float wx = src_x - x0, wy = src_y - y0;

    for (int c = 0; c < channels; c++) {
        float v00 = src[(y0 * src_w + x0) * channels + c];
        float v01 = src[(y0 * src_w + x1) * channels + c];
        float v10 = src[(y1 * src_w + x0) * channels + c];
        float v11 = src[(y1 * src_w + x1) * channels + c];

        float val = (1-wy) * ((1-wx)*v00 + wx*v01) + wy * ((1-wx)*v10 + wx*v11);
        dst[(y * dst_w + x) * channels + c] = val;
    }
}

After (Optimized Implementation)

Texture units provide free hardware bilinear interpolation.

cuda

// Use texture for hardware-accelerated interpolation
texture<float4, cudaTextureType2D, cudaReadModeElementType> texRef;

__global__ void resize_texture_batched(float4* dst, int dst_h, int dst_w,
                                       float scale_x, float scale_y, int batch) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;
    int b = blockIdx.z;

    if (x >= dst_w || y >= dst_h || b >= batch) return;

    // Hardware bilinear interpolation via texture
    float src_x = (x + 0.5f) * scale_x;
    float src_y = (y + 0.5f) * scale_y;

    float4 val = tex2D(texRef, src_x, src_y);  // Free bilinear!
    dst[(b * dst_h * dst_w + y * dst_w + x)] = val;
}

// Modern approach: use cudaTextureObject_t
cudaTextureObject_t createTexture(cudaArray* array) {
    cudaResourceDesc resDesc = {};
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = array;

    cudaTextureDesc texDesc = {};
    texDesc.filterMode = cudaFilterModeLinear;  // Bilinear
    texDesc.addressMode[0] = cudaAddressModeClamp;
    texDesc.addressMode[1] = cudaAddressModeClamp;

    cudaTextureObject_t tex;
    cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
    return tex;
}

Metric

Naive

Optimized

Improvement

Batch resize throughput

500 img/s

5000 img/s

10x

CUDA Image Resize Optimization Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Batched Resize

2. Texture Memory

3. NPP Library

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Bilinear vs bicubic?

Related Guides

CUDA Image Resize Optimization Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Batched Resize

2. Texture Memory

3. NPP Library

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Bilinear vs bicubic?

Related Guides