RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA 2D Convolution: Deep Learning CNN Optimization

December 25, 202518 minBy RightNow AI Team

Introduction

2D convolution is the computational backbone of Convolutional Neural Networks (CNNs), accounting for over 90% of inference time in models like ResNet and VGG. A single convolution layer may perform billions of floating-point operations, making GPU optimization essential for real-time inference. This guide covers the major approaches to implementing convolution on GPUs: direct convolution with shared memory tiling, the im2col transformation that converts convolution to matrix multiplication, and the Winograd algorithm that reduces arithmetic complexity for small filters. We also discuss when to use cuDNN versus custom kernels.

Common Performance Issues

Low arithmetic intensity - naive implementations are memory-bound
Redundant memory loads - overlapping filter regions load same data multiple times
Poor parallelization strategy - mapping threads to outputs inefficiently
Not exploiting filter reuse - same filter applies across entire input
Ignoring NCHW vs NHWC layout impact - wrong layout for algorithm
Not using tensor cores - missing 8x speedup opportunity on modern GPUs

Optimization Techniques

1. Shared Memory Tiling

Load input tiles and filter weights into shared memory to enable data reuse. Each input element is used by multiple output elements when filter overlaps.

cuda

#define TILE_SIZE 16
#define FILTER_SIZE 3
#define BLOCK_SIZE (TILE_SIZE + FILTER_SIZE - 1)  // 18

__global__ void conv2d_tiled(float* output, float* input, float* filter,
                              int W, int H, int C) {
    __shared__ float sh_input[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float sh_filter[FILTER_SIZE][FILTER_SIZE];

    int tx = threadIdx.x, ty = threadIdx.y;
    int ox = blockIdx.x * TILE_SIZE + tx;
    int oy = blockIdx.y * TILE_SIZE + ty;

    // Load filter into shared memory (first threads only)
    if (tx < FILTER_SIZE && ty < FILTER_SIZE) {
        sh_filter[ty][tx] = filter[ty * FILTER_SIZE + tx];
    }

    // Load input tile including halo region
    int ix = blockIdx.x * TILE_SIZE + tx - FILTER_SIZE/2;
    int iy = blockIdx.y * TILE_SIZE + ty - FILTER_SIZE/2;

    if (tx < BLOCK_SIZE && ty < BLOCK_SIZE) {
        if (ix >= 0 && ix < W && iy >= 0 && iy < H)
            sh_input[ty][tx] = input[iy * W + ix];
        else
            sh_input[ty][tx] = 0;  // Zero padding
    }

    __syncthreads();

    // Compute convolution
    if (tx < TILE_SIZE && ty < TILE_SIZE && ox < W && oy < H) {
        float sum = 0;
        for (int fy = 0; fy < FILTER_SIZE; fy++) {
            for (int fx = 0; fx < FILTER_SIZE; fx++) {
                sum += sh_input[ty + fy][tx + fx] * sh_filter[fy][fx];
            }
        }
        output[oy * W + ox] = sum;
    }
}

2. im2col Transformation

Convert convolution to matrix multiplication by rearranging input patches into columns. Leverage highly optimized GEMM routines for the actual computation.

cuda

// im2col: extract patches as columns
// Input: [C, H, W], Filter: [K, C, FH, FW]
// Output: [C*FH*FW, OH*OW] matrix

__global__ void im2col(float* data_col, float* data_im,
                        int C, int H, int W, int FH, int FW,
                        int OH, int OW, int pad, int stride) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int col_size = C * FH * FW * OH * OW;

    if (idx < col_size) {
        int w_out = idx % OW;
        int h_out = (idx / OW) % OH;
        int c_col = idx / (OW * OH);

        int c_im = c_col / (FH * FW);
        int h_off = (c_col / FW) % FH;
        int w_off = c_col % FW;

        int h_in = h_out * stride - pad + h_off;
        int w_in = w_out * stride - pad + w_off;

        data_col[idx] = (h_in >= 0 && h_in < H && w_in >= 0 && w_in < W)
            ? data_im[(c_im * H + h_in) * W + w_in]
            : 0;
    }
}

// Convolution via im2col + cuBLAS GEMM
// Filter: [K, C*FH*FW] @ im2col_matrix: [C*FH*FW, OH*OW]
// Result: [K, OH*OW] -> reshape to [K, OH, OW]

void conv2d_im2col(float* output, float* input, float* filter,
                    int N, int C, int H, int W, int K, int FH, int FW) {
    int OH = H - FH + 1, OW = W - FW + 1;

    // Allocate column buffer
    float* col_buffer;
    cudaMalloc(&col_buffer, C * FH * FW * OH * OW * sizeof(float));

    // im2col transform
    im2col<<<blocks, threads>>>(col_buffer, input, C, H, W, FH, FW, ...);

    // GEMM: output = filter @ col_buffer
    // Uses cuBLAS for maximum performance
    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
                OW * OH, K, C * FH * FW,
                &alpha, col_buffer, OW * OH,
                filter, C * FH * FW,
                &beta, output, OW * OH);
}

3. Winograd Convolution

For 3x3 filters, Winograd reduces multiplications by 2.25x by transforming to a different domain. Trade-off: more additions and transforms.

cuda

// Winograd F(2,3): 2x2 output tile with 3x3 filter
// Reduces 36 multiplications to 16 (2.25x reduction)

// Transform matrices
// G: filter transform (4x3)
// B: input transform (4x4)
// A: output transform (2x4)

__constant__ float G[4][3] = {
    {1, 0, 0},
    {0.5, 0.5, 0.5},
    {0.5, -0.5, 0.5},
    {0, 0, 1}
};

__constant__ float B_T[4][4] = {
    {1, 0, -1, 0},
    {0, 1, 1, 0},
    {0, -1, 1, 0},
    {0, 1, 0, -1}
};

// Winograd convolution flow:
// 1. Transform filter: U = G @ filter @ G^T (done once offline)
// 2. For each 4x4 input tile:
//    a. Transform input: V = B^T @ input @ B
//    b. Element-wise multiply: M = U ⊙ V
//    c. Transform output: Y = A^T @ M @ A
// 3. Accumulate 2x2 output tiles

__global__ void winograd_conv2d(float* output, float* V, float* U,
                                  int tiles_x, int tiles_y, int K, int C) {
    // Each thread handles one output channel for one tile
    int tile_idx = blockIdx.x;
    int k = blockIdx.y * blockDim.x + threadIdx.x;

    if (k >= K) return;

    float M[4][4];

    // Element-wise multiplication (in transform domain)
    for (int c = 0; c < C; c++) {
        for (int i = 0; i < 4; i++) {
            for (int j = 0; j < 4; j++) {
                M[i][j] += V[...] * U[...];
            }
        }
    }

    // Output transform: A^T @ M @ A
    // ... write 2x2 output tile
}

4. NHWC Layout for Tensor Cores

Use NHWC (channels-last) layout to enable Tensor Core acceleration. Modern GPUs achieve 8x+ speedup with TF32/FP16 Tensor Cores.

cuda

// cuDNN with Tensor Core support
cudnnTensorDescriptor_t input_desc, output_desc;
cudnnFilterDescriptor_t filter_desc;
cudnnConvolutionDescriptor_t conv_desc;

// Set NHWC format for Tensor Core compatibility
cudnnSetTensor4dDescriptor(input_desc, CUDNN_TENSOR_NHWC,
                           CUDNN_DATA_HALF, N, C, H, W);
cudnnSetFilter4dDescriptor(filter_desc, CUDNN_DATA_HALF,
                           CUDNN_TENSOR_NHWC, K, C, FH, FW);

// Enable Tensor Cores
cudnnSetConvolutionMathType(conv_desc, CUDNN_TENSOR_OP_MATH);

// Find fastest algorithm
cudnnConvolutionFwdAlgoPerf_t perf;
cudnnFindConvolutionForwardAlgorithm(handle, input_desc, filter_desc,
                                      conv_desc, output_desc, 1, &cnt, &perf);

// Use recommended algorithm (often IMPLICIT_PRECOMP_GEMM for Tensor Cores)
cudnnConvolutionForward(handle, &alpha, input_desc, input,
                        filter_desc, filter, conv_desc, perf.algo,
                        workspace, ws_size, &beta, output_desc, output);

5. Fused Operations

Combine convolution with bias, activation (ReLU), and batch normalization into single kernel. Reduces memory traffic significantly.

cuda

// cuDNN fusion: Conv + BiasAdd + ReLU
cudnnActivationDescriptor_t activation_desc;
cudnnSetActivationDescriptor(activation_desc, CUDNN_ACTIVATION_RELU,
                             CUDNN_PROPAGATE_NAN, 0.0);

// Fused operation: output = ReLU(Conv(input) + bias)
cudnnConvolutionBiasActivationForward(
    handle,
    &alpha1,
    input_desc, input,
    filter_desc, filter,
    conv_desc, algo,
    workspace, ws_size,
    &alpha2,
    z_desc, z,          // Optional residual add
    bias_desc, bias,
    activation_desc,
    output_desc, output
);

// Custom fused kernel approach
__global__ void conv_bias_relu(float* output, float* input,
                                float* filter, float* bias, ...) {
    // Compute convolution
    float sum = 0;
    for (...) sum += input[...] * filter[...];

    // Fuse bias addition
    sum += bias[output_channel];

    // Fuse ReLU activation
    sum = fmaxf(sum, 0.0f);

    output[idx] = sum;
}

Implementation Comparison

Before (Naive Implementation)

Naive convolution suffers from severe memory bandwidth limitations due to no data reuse.

cuda

// Naive direct convolution - highly inefficient
__global__ void conv2d_naive(float* output, float* input, float* filter,
                              int W, int H, int C, int K, int FW, int FH) {
    int ox = blockIdx.x * blockDim.x + threadIdx.x;
    int oy = blockIdx.y * blockDim.y + threadIdx.y;
    int ok = blockIdx.z;  // Output channel

    if (ox >= W || oy >= H) return;

    float sum = 0;

    // For each input channel
    for (int c = 0; c < C; c++) {
        // For each filter position
        for (int fy = 0; fy < FH; fy++) {
            for (int fx = 0; fx < FW; fx++) {
                int ix = ox + fx - FW/2;
                int iy = oy + fy - FH/2;

                if (ix >= 0 && ix < W && iy >= 0 && iy < H) {
                    // PROBLEM: No data reuse!
                    // Each thread loads filter and input independently
                    float in_val = input[(c * H + iy) * W + ix];
                    float f_val = filter[((ok * C + c) * FH + fy) * FW + fx];
                    sum += in_val * f_val;
                }
            }
        }
    }

    output[(ok * H + oy) * W + ox] = sum;
}

// Problems:
// 1. Filter loaded C*FH*FW times per output (no reuse)
// 2. Input loaded multiple times for overlapping outputs
// 3. Strided global memory access patterns
// 4. No use of shared memory or Tensor Cores

After (Optimized Implementation)

Production convolution selects algorithm based on parameters and uses optimized implementations for each case.

cuda

// Production convolution: choose algorithm based on parameters

enum ConvAlgo { DIRECT, IM2COL, WINOGRAD, CUDNN };

ConvAlgo select_algorithm(int C, int K, int FH, int FW, int H, int W) {
    // Winograd: best for 3x3, stride=1, moderate channels
    if (FH == 3 && FW == 3 && C >= 64 && K >= 64) {
        return WINOGRAD;
    }

    // im2col + GEMM: best for larger filters, batch processing
    if (FH > 3 || FW > 3 || (C * K > 10000)) {
        return IM2COL;
    }

    // cuDNN: handles all cases well, auto-tunes
    return CUDNN;
}

// Direct tiled convolution for small filters
template<int TILE, int FH, int FW>
__global__ void conv2d_direct_tiled(float* __restrict__ output,
                                     const float* __restrict__ input,
                                     const float* __restrict__ filter,
                                     int C, int K, int H, int W) {
    __shared__ float sh_input[TILE + FH - 1][TILE + FW - 1];
    __shared__ float sh_filter[FH][FW];

    int tx = threadIdx.x, ty = threadIdx.y;
    int bx = blockIdx.x, by = blockIdx.y, bk = blockIdx.z;

    float sum = 0;

    // Iterate over input channels
    for (int c = 0; c < C; c++) {
        // Collaborative load of input tile with halo
        // ... (see tiled example above)

        // Load filter for this input->output channel pair
        if (tx < FW && ty < FH) {
            sh_filter[ty][tx] = filter[(bk * C + c) * FH * FW + ty * FW + tx];
        }

        __syncthreads();

        // Compute partial sum
        if (tx < TILE && ty < TILE) {
            #pragma unroll
            for (int fy = 0; fy < FH; fy++) {
                #pragma unroll
                for (int fx = 0; fx < FW; fx++) {
                    sum += sh_input[ty + fy][tx + fx] * sh_filter[fy][fx];
                }
            }
        }

        __syncthreads();
    }

    // Write output
    int ox = bx * TILE + tx;
    int oy = by * TILE + ty;
    if (ox < W && oy < H && tx < TILE && ty < TILE) {
        output[(bk * H + oy) * W + ox] = sum;
    }
}

// Launch with template instantiation
void launch_conv2d(float* out, float* in, float* filter,
                    int C, int K, int FH, int FW, int H, int W) {
    dim3 block(18, 18);  // TILE + halo for 3x3
    dim3 grid((W + 15) / 16, (H + 15) / 16, K);

    if (FH == 3 && FW == 3) {
        conv2d_direct_tiled<16, 3, 3><<<grid, block>>>(out, in, filter, ...);
    }
    // ... other filter sizes
}

Performance Results

Metric	Naive	Optimized	Improvement
ResNet-50 Conv Layer (256 channels)	45 ms	0.8 ms (cuDNN)	56x faster
TFLOPS (3x3 conv, FP16)	1.2 TFLOPS	85 TFLOPS (Tensor Core)	71x higher
Memory Bandwidth Utilization	15%	78% (tiled)	5.2x better
1x1 Pointwise Conv	8.2 ms	0.15 ms (im2col)	55x faster

Frequently Asked Questions

When should I write custom convolution kernels?

Use cuDNN for standard convolutions—it auto-selects optimal algorithms and uses Tensor Cores. Write custom kernels when: (1) fusing multiple operations (conv+bn+relu), (2) using non-standard data types or layouts, (3) targeting specific hardware constraints, or (4) implementing novel architectures not supported by cuDNN.

Should I use NCHW or NHWC layout?

Use NHWC for Tensor Core operations and modern inference (TensorRT, cuDNN 8+). NCHW was historically preferred but NHWC enables better memory access patterns for channel-wise operations and Tensor Core compatibility. Converting layout mid-network is expensive; pick one and stick with it.

How do I handle different kernel sizes?

For 1x1: use GEMM directly (no im2col needed). For 3x3: Winograd gives best performance. For 5x5+: im2col + GEMM or FFT-based approaches. Depthwise separable convolutions need specialized kernels. cuDNN handles all cases but may not be optimal for unusual sizes.

What about depthwise separable convolution?

Depthwise convolution applies one filter per input channel (low compute intensity). Use specialized kernels that maximize memory bandwidth utilization rather than compute. cuDNN provides cudnnConvolutionForward with CUDNN_CONVOLUTION_BWD_DATA for efficient depthwise, or write custom kernels that process multiple spatial locations per thread.

Matrix Multiplication

im2col converts convolution to GEMM

→

Matrix Transpose

Layout conversion for NCHW/NHWC

→

Stencil Operations

Convolution is a type of stencil

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA convolution2D convolution GPUCNN CUDA optimizationim2col CUDAWinograd convolutioncuDNN convolution

CUDA 2D Convolution: Deep Learning CNN Optimization

Introduction

Common Performance Issues

Optimization Techniques

1. Shared Memory Tiling

2. im2col Transformation

3. Winograd Convolution

4. NHWC Layout for Tensor Cores

5. Fused Operations

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

When should I write custom convolution kernels?

Should I use NCHW or NHWC layout?

How do I handle different kernel sizes?

What about depthwise separable convolution?

Related Guides

CUDA 2D Convolution: Deep Learning CNN Optimization

Introduction

Common Performance Issues

Optimization Techniques

1. Shared Memory Tiling

2. im2col Transformation

3. Winograd Convolution

4. NHWC Layout for Tensor Cores

5. Fused Operations

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

When should I write custom convolution kernels?

Should I use NCHW or NHWC layout?

How do I handle different kernel sizes?

What about depthwise separable convolution?

Related Guides