2D FFT decomposes images into frequency components, enabling fast convolution, filtering, and compression. cuFFT provides highly optimized GPU implementations achieving near-peak performance. A 2D FFT is computed as 1D FFTs along rows, then 1D FFTs along columns (or vice versa).
Create plan once, execute many times for same-size transforms.
Exploit Hermitian symmetry for real data - half the storage.
Process multiple images simultaneously.
Creating and destroying plan each call wastes time.
void fft2d_naive(cufftComplex* d_data, int nx, int ny) {
cufftHandle plan;
cufftPlan2d(&plan, ny, nx, CUFFT_C2C); // Note: ny, nx order!
cufftExecC2C(plan, d_data, d_data, CUFFT_FORWARD);
cufftDestroy(plan); // Wasteful if called repeatedly!
}Reusable plans with batched execution and R2C optimization.
class FFT2D {
cufftHandle plan_c2c;
cufftHandle plan_r2c;
cufftHandle plan_c2r;
int nx, ny, batch;
public:
void init(int nx_, int ny_, int batch_ = 1) {
nx = nx_; ny = ny_; batch = batch_;
// Complex-to-complex
int n[2] = {ny, nx}; // Row-major: ny rows, nx cols
cufftPlanMany(&plan_c2c, 2, n,
NULL, 1, nx * ny, // Input: contiguous
NULL, 1, nx * ny, // Output: contiguous
CUFFT_C2C, batch);
// Real-to-complex (forward)
int inembed[2] = {ny, nx};
int onembed[2] = {ny, nx/2 + 1};
cufftPlanMany(&plan_r2c, 2, n,
inembed, 1, nx * ny,
onembed, 1, (nx/2 + 1) * ny,
CUFFT_R2C, batch);
// Complex-to-real (inverse)
cufftPlanMany(&plan_c2r, 2, n,
onembed, 1, (nx/2 + 1) * ny,
inembed, 1, nx * ny,
CUFFT_C2R, batch);
}
void forward(cufftComplex* d_in, cufftComplex* d_out) {
cufftExecC2C(plan_c2c, d_in, d_out, CUFFT_FORWARD);
}
void inverse(cufftComplex* d_in, cufftComplex* d_out) {
cufftExecC2C(plan_c2c, d_in, d_out, CUFFT_INVERSE);
// Normalize
int N = nx * ny * batch;
normalize<<<(N+255)/256, 256>>>(d_out, N, 1.0f / (nx * ny));
}
void forward_real(float* d_in, cufftComplex* d_out) {
cufftExecR2C(plan_r2c, d_in, d_out);
}
void inverse_real(cufftComplex* d_in, float* d_out) {
cufftExecC2R(plan_c2r, d_in, d_out);
normalize<<<...>>>(d_out, nx * ny * batch, 1.0f / (nx * ny));
}
};| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| 4096x4096 C2C | 15ms (plan+exec) | 4.2ms (exec only) | 3.6x faster |
| Memory (R2C vs C2C) | 128MB (C2C) | 68MB (R2C) | 1.9x less |
| Batch 100 512x512 | 420ms (sequential) | 85ms (batched) | 4.9x faster |
cuFFT follows FFTW convention with row-major storage. First dimension is number of rows (ny), second is row length (nx). This matches C array layout: data[row][col] = data[j][i] where j=0..ny-1, i=0..nx-1.
For convolution h = f * g: (1) FFT both signals, (2) pointwise multiply in frequency domain, (3) IFFT result. Pad to avoid circular convolution: pad to at least len(f) + len(g) - 1. FFT convolution is O(n log n) vs O(n²) direct.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.