Pooling operations reduce spatial dimensions in CNNs by computing max or average over local regions. While conceptually simple, efficient implementation requires careful attention to memory access patterns and thread assignment. This guide covers standard pooling, global pooling for classification heads, and the less common but important adaptive pooling.
Assign threads to output positions for coalesced writes.
Use warp shuffles for efficient spatial reduction.
Channels-last layout improves pooling memory access.
Basic pooling with channels-last layout.
__global__ void maxpool2d_naive(float* input, float* output,
int H, int W, int C, int pool_size) {
int c = blockIdx.x * blockDim.x + threadIdx.x;
int out_h = blockIdx.y;
int out_w = blockIdx.z;
if (c >= C) return;
float max_val = -INFINITY;
for (int kh = 0; kh < pool_size; kh++) {
for (int kw = 0; kw < pool_size; kw++) {
int h = out_h * pool_size + kh;
int w = out_w * pool_size + kw;
float val = input[(h * W + w) * C + c];
max_val = fmaxf(max_val, val);
}
}
output[(out_h * (W/pool_size) + out_w) * C + c] = max_val;
}Vectorized pooling processes 4 channels simultaneously.
__global__ void maxpool2d_vectorized(float4* input, float4* output,
int H, int W, int C4, int pool_size) {
// Process 4 channels at once with float4
int c4 = blockIdx.x * blockDim.x + threadIdx.x;
int out_h = blockIdx.y;
int out_w = blockIdx.z;
if (c4 >= C4) return;
float4 max_val = make_float4(-INFINITY, -INFINITY, -INFINITY, -INFINITY);
for (int kh = 0; kh < pool_size; kh++) {
for (int kw = 0; kw < pool_size; kw++) {
int h = out_h * pool_size + kh;
int w = out_w * pool_size + kw;
float4 val = input[(h * W + w) * C4 + c4];
max_val.x = fmaxf(max_val.x, val.x);
max_val.y = fmaxf(max_val.y, val.y);
max_val.z = fmaxf(max_val.z, val.z);
max_val.w = fmaxf(max_val.w, val.w);
}
}
output[(out_h * (W/pool_size) + out_w) * C4 + c4] = max_val;
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput (images/sec) | 12000 | 38000 | 3.2x |
Max pooling is slightly faster as it avoids the division. Average pooling requires counting valid elements for boundary handling.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.