Iterative solvers find approximate solutions through successive refinement, requiring only matrix-vector products. Key methods: CG (SPD matrices), GMRES (general), BiCGSTAB (non-symmetric). Convergence depends heavily on preconditioning—transforming the system to have better spectral properties.
Incomplete factorization preconditioners for general acceleration.
Combine dot products and vector updates to reduce kernel launches.
Use FP32 for iteration, FP64 for final refinement.
Jacobi has slow convergence (depends on spectral radius) and dense matrix access.
// Jacobi - simple but slow convergence
void jacobi_solve(float* d_A, float* d_b, float* d_x, int n, int max_iter, float tol) {
float* d_x_new;
cudaMalloc(&d_x_new, n * sizeof(float));
for (int iter = 0; iter < max_iter; iter++) {
jacobi_step<<<...>>>(d_A, d_b, d_x, d_x_new, n);
cudaMemcpy(d_x, d_x_new, n * sizeof(float), D2D);
float residual = compute_residual(d_A, d_b, d_x, n);
if (residual < tol) break;
}
}
__global__ void jacobi_step(float* A, float* b, float* x, float* x_new, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i >= n) return;
float sum = b[i];
for (int j = 0; j < n; j++) {
if (j != i) sum -= A[i * n + j] * x[j];
}
x_new[i] = sum / A[i * n + i];
}PCG with ILU preconditioning is the workhorse for large SPD sparse systems.
void pcg_solve(cusparseHandle_t sparse, cublasHandle_t blas,
int n, int nnz, int* d_rowPtr, int* d_colIdx, float* d_vals,
float* d_b, float* d_x, void* precond, int max_iter, float tol) {
float *d_r, *d_p, *d_z, *d_Ap;
cudaMalloc(&d_r, n * sizeof(float));
cudaMalloc(&d_p, n * sizeof(float));
cudaMalloc(&d_z, n * sizeof(float));
cudaMalloc(&d_Ap, n * sizeof(float));
// r = b - Ax
spmv(sparse, d_rowPtr, d_colIdx, d_vals, d_x, d_r, n);
cublasSaxpy(blas, n, &minus_one, d_r, 1, d_b, 1); // r = b - Ax
// z = M^{-1} r (apply preconditioner)
apply_precond(precond, d_r, d_z, n);
cudaMemcpy(d_p, d_z, n * sizeof(float), D2D); // p = z
float rz_old;
cublasSdot(blas, n, d_r, 1, d_z, 1, &rz_old);
for (int iter = 0; iter < max_iter; iter++) {
// Ap = A * p
spmv(sparse, d_rowPtr, d_colIdx, d_vals, d_p, d_Ap, n);
// alpha = (r,z) / (p, Ap)
float pAp;
cublasSdot(blas, n, d_p, 1, d_Ap, 1, &pAp);
float alpha = rz_old / pAp;
// x = x + alpha * p
cublasSaxpy(blas, n, &alpha, d_p, 1, d_x, 1);
// r = r - alpha * Ap
float neg_alpha = -alpha;
cublasSaxpy(blas, n, &neg_alpha, d_Ap, 1, d_r, 1);
// Check convergence
float r_norm;
cublasSnrm2(blas, n, d_r, 1, &r_norm);
if (r_norm < tol) break;
// z = M^{-1} r
apply_precond(precond, d_r, d_z, n);
// beta = (r_new, z_new) / (r_old, z_old)
float rz_new;
cublasSdot(blas, n, d_r, 1, d_z, 1, &rz_new);
float beta = rz_new / rz_old;
rz_old = rz_new;
// p = z + beta * p
cublasSscal(blas, n, &beta, d_p, 1);
cublasSaxpy(blas, n, &one, d_z, 1, d_p, 1);
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| n=1M SPD, no precond | 5000 iters | 150 iters (ILU) | 33x fewer iters |
| Time per iteration | 12ms (Jacobi) | 8ms (CG) | 1.5x faster |
| Total solve time | 60s | 1.2s | 50x faster |
SPD matrices: CG (Conjugate Gradient). Symmetric indefinite: MINRES. General non-symmetric: GMRES or BiCGSTAB. GMRES is most robust but memory grows with iterations; restart to bound memory.
ILU(0): cheap, general purpose. ILU(k): more fill-in, better convergence. IC (Incomplete Cholesky): for SPD. AMG (Algebraic Multigrid): best for elliptic PDEs. Domain-specific: physics-based preconditioners.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.