Gauss-Seidel uses updated values immediately: x_i^(k+1) = (b_i - Σ_{j<i}a_ij*x_j^(k+1) - Σ_{j>i}a_ij*x_j^(k)) / a_ii. This converges faster than Jacobi but creates data dependencies. GPU parallelism requires graph coloring: nodes of same color can update simultaneously since they don't depend on each other.
2-coloring for regular grids enables 50% parallelism.
General graph coloring for unstructured meshes.
Wavefront parallelism based on dependency levels.
Sequential Gauss-Seidel completely fails to utilize GPU parallelism.
// Completely sequential - terrible GPU utilization
__global__ void gauss_seidel_naive(float* A, float* b, float* x, int n) {
// Only one thread does all work
if (threadIdx.x != 0 || blockIdx.x != 0) return;
for (int i = 0; i < n; i++) {
float sum = b[i];
for (int j = 0; j < n; j++) {
if (j != i) sum -= A[i * n + j] * x[j];
}
x[i] = sum / A[i * n + i]; // Update immediately used by next row
}
}Red-black ordering enables 50% parallel efficiency for regular grids.
// For 2D grid with 5-point stencil
__global__ void gs_red_sweep(float* x, float* b, float h2_inv, int nx, int ny) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
// Red points: (i+j) even
if (i >= 1 && i < nx-1 && j >= 1 && j < ny-1) {
if ((i + j) % 2 == 0) {
int idx = j * nx + i;
x[idx] = 0.25f * (x[idx-1] + x[idx+1] + x[idx-nx] + x[idx+nx] - h2_inv * b[idx]);
}
}
}
__global__ void gs_black_sweep(float* x, float* b, float h2_inv, int nx, int ny) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
// Black points: (i+j) odd
if (i >= 1 && i < nx-1 && j >= 1 && j < ny-1) {
if ((i + j) % 2 == 1) {
int idx = j * nx + i;
x[idx] = 0.25f * (x[idx-1] + x[idx+1] + x[idx-nx] + x[idx+nx] - h2_inv * b[idx]);
}
}
}
void gauss_seidel_rb(float* d_x, float* d_b, int nx, int ny, int sweeps, float h) {
dim3 block(16, 16);
dim3 grid((nx + 15) / 16, (ny + 15) / 16);
float h2_inv = 1.0f / (h * h);
for (int s = 0; s < sweeps; s++) {
gs_red_sweep<<<grid, block>>>(d_x, d_b, h2_inv, nx, ny);
gs_black_sweep<<<grid, block>>>(d_x, d_b, h2_inv, nx, ny);
}
}
// For general sparse matrices: multi-color GS
void gauss_seidel_multicolor(int* d_rowPtr, int* d_colIdx, float* d_vals,
float* d_diag_inv, float* d_b, float* d_x,
int* d_colors, int* d_color_offsets, int num_colors, int n) {
for (int c = 0; c < num_colors; c++) {
int start = d_color_offsets[c];
int end = d_color_offsets[c + 1];
int count = end - start;
gs_color_sweep<<<(count+255)/256, 256>>>(
d_rowPtr, d_colIdx, d_vals, d_diag_inv, d_b, d_x, d_colors + start, count);
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Convergence rate | ρ_GS ≈ ρ_J² | 2x faster than Jacobi | 2x convergence |
| GPU efficiency (structured) | 0% (serial) | 50% (red-black) | Usable |
| vs parallel Jacobi | Better convergence | Trade-off | Problem-dependent |
Jacobi is fully parallel but converges slower. GS converges ~2x faster but needs coloring. For regular grids (FD), red-black GS is good. For unstructured meshes, Jacobi may win due to simpler parallelism. Profile both.
Use greedy coloring: for each row, find colors used by neighbors, assign smallest unused color. Typically 5-20 colors for FEM meshes. cuSPARSE provides csrcolor for automatic coloring. More colors = more kernel launches = less efficiency.
Fully parallel but slower convergence
Accelerated Gauss-Seidel with relaxation
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.