Image resizing is essential for data augmentation, preprocessing, and video processing. NPP (NVIDIA Performance Primitives) provides optimized implementations, but custom kernels offer more flexibility for batched operations. This guide covers interpolation methods, batch processing, and integration with training pipelines.
Process multiple images in single kernel launch.
Use texture units for free bilinear interpolation.
Use nppiResize for optimized single-image resize.
Basic bilinear interpolation with manual coordinate calculation.
__global__ void resize_bilinear(float* src, float* dst,
int src_h, int src_w,
int dst_h, int dst_w, int channels) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= dst_w || y >= dst_h) return;
float scale_x = (float)src_w / dst_w;
float scale_y = (float)src_h / dst_h;
float src_x = (x + 0.5f) * scale_x - 0.5f;
float src_y = (y + 0.5f) * scale_y - 0.5f;
int x0 = (int)floorf(src_x), x1 = x0 + 1;
int y0 = (int)floorf(src_y), y1 = y0 + 1;
float wx = src_x - x0, wy = src_y - y0;
for (int c = 0; c < channels; c++) {
float v00 = src[(y0 * src_w + x0) * channels + c];
float v01 = src[(y0 * src_w + x1) * channels + c];
float v10 = src[(y1 * src_w + x0) * channels + c];
float v11 = src[(y1 * src_w + x1) * channels + c];
float val = (1-wy) * ((1-wx)*v00 + wx*v01) + wy * ((1-wx)*v10 + wx*v11);
dst[(y * dst_w + x) * channels + c] = val;
}
}Texture units provide free hardware bilinear interpolation.
// Use texture for hardware-accelerated interpolation
texture<float4, cudaTextureType2D, cudaReadModeElementType> texRef;
__global__ void resize_texture_batched(float4* dst, int dst_h, int dst_w,
float scale_x, float scale_y, int batch) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int b = blockIdx.z;
if (x >= dst_w || y >= dst_h || b >= batch) return;
// Hardware bilinear interpolation via texture
float src_x = (x + 0.5f) * scale_x;
float src_y = (y + 0.5f) * scale_y;
float4 val = tex2D(texRef, src_x, src_y); // Free bilinear!
dst[(b * dst_h * dst_w + y * dst_w + x)] = val;
}
// Modern approach: use cudaTextureObject_t
cudaTextureObject_t createTexture(cudaArray* array) {
cudaResourceDesc resDesc = {};
resDesc.resType = cudaResourceTypeArray;
resDesc.res.array.array = array;
cudaTextureDesc texDesc = {};
texDesc.filterMode = cudaFilterModeLinear; // Bilinear
texDesc.addressMode[0] = cudaAddressModeClamp;
texDesc.addressMode[1] = cudaAddressModeClamp;
cudaTextureObject_t tex;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
return tex;
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Batch resize throughput | 500 img/s | 5000 img/s | 10x |
Bicubic is sharper but 4x slower. For training data augmentation, bilinear is usually sufficient. Bicubic matters for final image quality.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.