Multigrid achieves O(n) complexity by combining iterative smoothing with coarse-grid correction. Smoothers (Jacobi/GS) eliminate high-frequency errors; coarse grids handle low-frequency errors. Geometric multigrid uses grid hierarchy; algebraic multigrid (AMG) constructs hierarchy from matrix coefficients.
Use Jacobi or red-black GS for GPU parallelism.
Combine restriction/prolongation with smoothing.
Solve multiple systems simultaneously.
Basic V-cycle with sequential smoothers.
void v_cycle(float** x, float** b, float** r, int* n, int levels, float h) {
// Descend: smooth and restrict
for (int l = 0; l < levels - 1; l++) {
smooth(x[l], b[l], n[l], 2); // Pre-smooth
residual(r[l], b[l], x[l], n[l], h);
restrict(b[l+1], r[l], n[l]);
zero(x[l+1], n[l+1]);
}
// Coarse solve
direct_solve(x[levels-1], b[levels-1], n[levels-1]);
// Ascend: prolongate and smooth
for (int l = levels - 2; l >= 0; l--) {
prolongate_add(x[l], x[l+1], n[l]);
smooth(x[l], b[l], n[l], 2); // Post-smooth
}
}GPU multigrid with red-black smoothers and full-weighting restriction.
class GPUMultigrid {
int levels;
float **d_x, **d_b, **d_r;
int* n;
void v_cycle() {
// Descend with parallel smoothers
for (int l = 0; l < levels - 1; l++) {
// Red-black Gauss-Seidel smoother (2 sweeps)
for (int s = 0; s < 2; s++) {
smooth_red<<<grid[l], block>>>(d_x[l], d_b[l], n[l]);
smooth_black<<<grid[l], block>>>(d_x[l], d_b[l], n[l]);
}
// Compute residual: r = b - Ax
compute_residual<<<grid[l], block>>>(d_r[l], d_b[l], d_x[l], n[l]);
// Full weighting restriction
restrict_fw<<<grid[l+1], block>>>(d_b[l+1], d_r[l], n[l]);
// Zero initial guess on coarse grid
cudaMemset(d_x[l+1], 0, n[l+1] * sizeof(float));
}
// Coarse grid solve (small enough for direct solve or many iterations)
if (n[levels-1] <= 64) {
direct_solve_small(d_x[levels-1], d_b[levels-1], n[levels-1]);
} else {
for (int s = 0; s < 50; s++) {
smooth_red<<<1, 64>>>(d_x[levels-1], d_b[levels-1], n[levels-1]);
smooth_black<<<1, 64>>>(d_x[levels-1], d_b[levels-1], n[levels-1]);
}
}
// Ascend with parallel smoothers
for (int l = levels - 2; l >= 0; l--) {
// Bilinear prolongation and correction
prolongate_add<<<grid[l], block>>>(d_x[l], d_x[l+1], n[l]);
// Post-smoothing
for (int s = 0; s < 2; s++) {
smooth_red<<<grid[l], block>>>(d_x[l], d_b[l], n[l]);
smooth_black<<<grid[l], block>>>(d_x[l], d_b[l], n[l]);
}
}
}
};
__global__ void restrict_fw(float* coarse, float* fine, int n_fine) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int n_coarse = n_fine / 2;
if (i > 0 && i < n_coarse-1 && j > 0 && j < n_coarse-1) {
int fi = 2 * i, fj = 2 * j;
// Full weighting stencil
coarse[j * n_coarse + i] = 0.0625f * (
fine[(fj-1)*n_fine + fi-1] + 2*fine[(fj-1)*n_fine + fi] + fine[(fj-1)*n_fine + fi+1] +
2*fine[fj*n_fine + fi-1] + 4*fine[fj*n_fine + fi] + 2*fine[fj*n_fine + fi+1] +
fine[(fj+1)*n_fine + fi-1] + 2*fine[(fj+1)*n_fine + fi] + fine[(fj+1)*n_fine + fi+1]
);
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Poisson 1024² V-cycle | 180ms (CG) | 8ms (MG) | 22x faster |
| Iterations to 1e-10 | 4500 (CG) | 8 (MG V-cycles) | 562x fewer |
| Scaling O(n) | O(n²) CG | O(n) MG | Optimal |
V-cycle: one coarse-grid correction per level. W-cycle: two corrections (recurse twice). F-cycle: full multigrid with nested iteration. V-cycle is simplest and often sufficient. W-cycle for harder problems. F-cycle for best initial guess.
Typically 1-3 pre and post smoothing steps. More smoothing = more work but better error reduction. Standard: 2 pre + 2 post. For harder problems or cheaper smoothers, increase to 3-4.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.