2D convolution is the computational backbone of Convolutional Neural Networks (CNNs), accounting for over 90% of inference time in models like ResNet and VGG. A single convolution layer may perform billions of floating-point operations, making GPU optimization essential for real-time inference. This guide covers the major approaches to implementing convolution on GPUs: direct convolution with shared memory tiling, the im2col transformation that converts convolution to matrix multiplication, and the Winograd algorithm that reduces arithmetic complexity for small filters. We also discuss when to use cuDNN versus custom kernels.
Load input tiles and filter weights into shared memory to enable data reuse. Each input element is used by multiple output elements when filter overlaps.
#define TILE_SIZE 16
#define FILTER_SIZE 3
#define BLOCK_SIZE (TILE_SIZE + FILTER_SIZE - 1) // 18
__global__ void conv2d_tiled(float* output, float* input, float* filter,
int W, int H, int C) {
__shared__ float sh_input[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float sh_filter[FILTER_SIZE][FILTER_SIZE];
int tx = threadIdx.x, ty = threadIdx.y;
int ox = blockIdx.x * TILE_SIZE + tx;
int oy = blockIdx.y * TILE_SIZE + ty;
// Load filter into shared memory (first threads only)
if (tx < FILTER_SIZE && ty < FILTER_SIZE) {
sh_filter[ty][tx] = filter[ty * FILTER_SIZE + tx];
}
// Load input tile including halo region
int ix = blockIdx.x * TILE_SIZE + tx - FILTER_SIZE/2;
int iy = blockIdx.y * TILE_SIZE + ty - FILTER_SIZE/2;
if (tx < BLOCK_SIZE && ty < BLOCK_SIZE) {
if (ix >= 0 && ix < W && iy >= 0 && iy < H)
sh_input[ty][tx] = input[iy * W + ix];
else
sh_input[ty][tx] = 0; // Zero padding
}
__syncthreads();
// Compute convolution
if (tx < TILE_SIZE && ty < TILE_SIZE && ox < W && oy < H) {
float sum = 0;
for (int fy = 0; fy < FILTER_SIZE; fy++) {
for (int fx = 0; fx < FILTER_SIZE; fx++) {
sum += sh_input[ty + fy][tx + fx] * sh_filter[fy][fx];
}
}
output[oy * W + ox] = sum;
}
}Convert convolution to matrix multiplication by rearranging input patches into columns. Leverage highly optimized GEMM routines for the actual computation.
// im2col: extract patches as columns
// Input: [C, H, W], Filter: [K, C, FH, FW]
// Output: [C*FH*FW, OH*OW] matrix
__global__ void im2col(float* data_col, float* data_im,
int C, int H, int W, int FH, int FW,
int OH, int OW, int pad, int stride) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int col_size = C * FH * FW * OH * OW;
if (idx < col_size) {
int w_out = idx % OW;
int h_out = (idx / OW) % OH;
int c_col = idx / (OW * OH);
int c_im = c_col / (FH * FW);
int h_off = (c_col / FW) % FH;
int w_off = c_col % FW;
int h_in = h_out * stride - pad + h_off;
int w_in = w_out * stride - pad + w_off;
data_col[idx] = (h_in >= 0 && h_in < H && w_in >= 0 && w_in < W)
? data_im[(c_im * H + h_in) * W + w_in]
: 0;
}
}
// Convolution via im2col + cuBLAS GEMM
// Filter: [K, C*FH*FW] @ im2col_matrix: [C*FH*FW, OH*OW]
// Result: [K, OH*OW] -> reshape to [K, OH, OW]
void conv2d_im2col(float* output, float* input, float* filter,
int N, int C, int H, int W, int K, int FH, int FW) {
int OH = H - FH + 1, OW = W - FW + 1;
// Allocate column buffer
float* col_buffer;
cudaMalloc(&col_buffer, C * FH * FW * OH * OW * sizeof(float));
// im2col transform
im2col<<<blocks, threads>>>(col_buffer, input, C, H, W, FH, FW, ...);
// GEMM: output = filter @ col_buffer
// Uses cuBLAS for maximum performance
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,
OW * OH, K, C * FH * FW,
&alpha, col_buffer, OW * OH,
filter, C * FH * FW,
&beta, output, OW * OH);
}For 3x3 filters, Winograd reduces multiplications by 2.25x by transforming to a different domain. Trade-off: more additions and transforms.
// Winograd F(2,3): 2x2 output tile with 3x3 filter
// Reduces 36 multiplications to 16 (2.25x reduction)
// Transform matrices
// G: filter transform (4x3)
// B: input transform (4x4)
// A: output transform (2x4)
__constant__ float G[4][3] = {
{1, 0, 0},
{0.5, 0.5, 0.5},
{0.5, -0.5, 0.5},
{0, 0, 1}
};
__constant__ float B_T[4][4] = {
{1, 0, -1, 0},
{0, 1, 1, 0},
{0, -1, 1, 0},
{0, 1, 0, -1}
};
// Winograd convolution flow:
// 1. Transform filter: U = G @ filter @ G^T (done once offline)
// 2. For each 4x4 input tile:
// a. Transform input: V = B^T @ input @ B
// b. Element-wise multiply: M = U ⊙ V
// c. Transform output: Y = A^T @ M @ A
// 3. Accumulate 2x2 output tiles
__global__ void winograd_conv2d(float* output, float* V, float* U,
int tiles_x, int tiles_y, int K, int C) {
// Each thread handles one output channel for one tile
int tile_idx = blockIdx.x;
int k = blockIdx.y * blockDim.x + threadIdx.x;
if (k >= K) return;
float M[4][4];
// Element-wise multiplication (in transform domain)
for (int c = 0; c < C; c++) {
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 4; j++) {
M[i][j] += V[...] * U[...];
}
}
}
// Output transform: A^T @ M @ A
// ... write 2x2 output tile
}Use NHWC (channels-last) layout to enable Tensor Core acceleration. Modern GPUs achieve 8x+ speedup with TF32/FP16 Tensor Cores.
// cuDNN with Tensor Core support
cudnnTensorDescriptor_t input_desc, output_desc;
cudnnFilterDescriptor_t filter_desc;
cudnnConvolutionDescriptor_t conv_desc;
// Set NHWC format for Tensor Core compatibility
cudnnSetTensor4dDescriptor(input_desc, CUDNN_TENSOR_NHWC,
CUDNN_DATA_HALF, N, C, H, W);
cudnnSetFilter4dDescriptor(filter_desc, CUDNN_DATA_HALF,
CUDNN_TENSOR_NHWC, K, C, FH, FW);
// Enable Tensor Cores
cudnnSetConvolutionMathType(conv_desc, CUDNN_TENSOR_OP_MATH);
// Find fastest algorithm
cudnnConvolutionFwdAlgoPerf_t perf;
cudnnFindConvolutionForwardAlgorithm(handle, input_desc, filter_desc,
conv_desc, output_desc, 1, &cnt, &perf);
// Use recommended algorithm (often IMPLICIT_PRECOMP_GEMM for Tensor Cores)
cudnnConvolutionForward(handle, &alpha, input_desc, input,
filter_desc, filter, conv_desc, perf.algo,
workspace, ws_size, &beta, output_desc, output);Combine convolution with bias, activation (ReLU), and batch normalization into single kernel. Reduces memory traffic significantly.
// cuDNN fusion: Conv + BiasAdd + ReLU
cudnnActivationDescriptor_t activation_desc;
cudnnSetActivationDescriptor(activation_desc, CUDNN_ACTIVATION_RELU,
CUDNN_PROPAGATE_NAN, 0.0);
// Fused operation: output = ReLU(Conv(input) + bias)
cudnnConvolutionBiasActivationForward(
handle,
&alpha1,
input_desc, input,
filter_desc, filter,
conv_desc, algo,
workspace, ws_size,
&alpha2,
z_desc, z, // Optional residual add
bias_desc, bias,
activation_desc,
output_desc, output
);
// Custom fused kernel approach
__global__ void conv_bias_relu(float* output, float* input,
float* filter, float* bias, ...) {
// Compute convolution
float sum = 0;
for (...) sum += input[...] * filter[...];
// Fuse bias addition
sum += bias[output_channel];
// Fuse ReLU activation
sum = fmaxf(sum, 0.0f);
output[idx] = sum;
}Naive convolution suffers from severe memory bandwidth limitations due to no data reuse.
// Naive direct convolution - highly inefficient
__global__ void conv2d_naive(float* output, float* input, float* filter,
int W, int H, int C, int K, int FW, int FH) {
int ox = blockIdx.x * blockDim.x + threadIdx.x;
int oy = blockIdx.y * blockDim.y + threadIdx.y;
int ok = blockIdx.z; // Output channel
if (ox >= W || oy >= H) return;
float sum = 0;
// For each input channel
for (int c = 0; c < C; c++) {
// For each filter position
for (int fy = 0; fy < FH; fy++) {
for (int fx = 0; fx < FW; fx++) {
int ix = ox + fx - FW/2;
int iy = oy + fy - FH/2;
if (ix >= 0 && ix < W && iy >= 0 && iy < H) {
// PROBLEM: No data reuse!
// Each thread loads filter and input independently
float in_val = input[(c * H + iy) * W + ix];
float f_val = filter[((ok * C + c) * FH + fy) * FW + fx];
sum += in_val * f_val;
}
}
}
}
output[(ok * H + oy) * W + ox] = sum;
}
// Problems:
// 1. Filter loaded C*FH*FW times per output (no reuse)
// 2. Input loaded multiple times for overlapping outputs
// 3. Strided global memory access patterns
// 4. No use of shared memory or Tensor CoresProduction convolution selects algorithm based on parameters and uses optimized implementations for each case.
// Production convolution: choose algorithm based on parameters
enum ConvAlgo { DIRECT, IM2COL, WINOGRAD, CUDNN };
ConvAlgo select_algorithm(int C, int K, int FH, int FW, int H, int W) {
// Winograd: best for 3x3, stride=1, moderate channels
if (FH == 3 && FW == 3 && C >= 64 && K >= 64) {
return WINOGRAD;
}
// im2col + GEMM: best for larger filters, batch processing
if (FH > 3 || FW > 3 || (C * K > 10000)) {
return IM2COL;
}
// cuDNN: handles all cases well, auto-tunes
return CUDNN;
}
// Direct tiled convolution for small filters
template<int TILE, int FH, int FW>
__global__ void conv2d_direct_tiled(float* __restrict__ output,
const float* __restrict__ input,
const float* __restrict__ filter,
int C, int K, int H, int W) {
__shared__ float sh_input[TILE + FH - 1][TILE + FW - 1];
__shared__ float sh_filter[FH][FW];
int tx = threadIdx.x, ty = threadIdx.y;
int bx = blockIdx.x, by = blockIdx.y, bk = blockIdx.z;
float sum = 0;
// Iterate over input channels
for (int c = 0; c < C; c++) {
// Collaborative load of input tile with halo
// ... (see tiled example above)
// Load filter for this input->output channel pair
if (tx < FW && ty < FH) {
sh_filter[ty][tx] = filter[(bk * C + c) * FH * FW + ty * FW + tx];
}
__syncthreads();
// Compute partial sum
if (tx < TILE && ty < TILE) {
#pragma unroll
for (int fy = 0; fy < FH; fy++) {
#pragma unroll
for (int fx = 0; fx < FW; fx++) {
sum += sh_input[ty + fy][tx + fx] * sh_filter[fy][fx];
}
}
}
__syncthreads();
}
// Write output
int ox = bx * TILE + tx;
int oy = by * TILE + ty;
if (ox < W && oy < H && tx < TILE && ty < TILE) {
output[(bk * H + oy) * W + ox] = sum;
}
}
// Launch with template instantiation
void launch_conv2d(float* out, float* in, float* filter,
int C, int K, int FH, int FW, int H, int W) {
dim3 block(18, 18); // TILE + halo for 3x3
dim3 grid((W + 15) / 16, (H + 15) / 16, K);
if (FH == 3 && FW == 3) {
conv2d_direct_tiled<16, 3, 3><<<grid, block>>>(out, in, filter, ...);
}
// ... other filter sizes
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| ResNet-50 Conv Layer (256 channels) | 45 ms | 0.8 ms (cuDNN) | 56x faster |
| TFLOPS (3x3 conv, FP16) | 1.2 TFLOPS | 85 TFLOPS (Tensor Core) | 71x higher |
| Memory Bandwidth Utilization | 15% | 78% (tiled) | 5.2x better |
| 1x1 Pointwise Conv | 8.2 ms | 0.15 ms (im2col) | 55x faster |
Use cuDNN for standard convolutions—it auto-selects optimal algorithms and uses Tensor Cores. Write custom kernels when: (1) fusing multiple operations (conv+bn+relu), (2) using non-standard data types or layouts, (3) targeting specific hardware constraints, or (4) implementing novel architectures not supported by cuDNN.
Use NHWC for Tensor Core operations and modern inference (TensorRT, cuDNN 8+). NCHW was historically preferred but NHWC enables better memory access patterns for channel-wise operations and Tensor Core compatibility. Converting layout mid-network is expensive; pick one and stick with it.
For 1x1: use GEMM directly (no im2col needed). For 3x3: Winograd gives best performance. For 5x5+: im2col + GEMM or FFT-based approaches. Depthwise separable convolutions need specialized kernels. cuDNN handles all cases but may not be optimal for unusual sizes.
Depthwise convolution applies one filter per input channel (low compute intensity). Use specialized kernels that maximize memory bandwidth utilization rather than compute. cuDNN provides cudnnConvolutionForward with CUDNN_CONVOLUTION_BWD_DATA for efficient depthwise, or write custom kernels that process multiple spatial locations per thread.
im2col converts convolution to GEMM
Layout conversion for NCHW/NHWC
Convolution is a type of stencil
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.