BiCGSTAB (Bi-Conjugate Gradient Stabilized) solves non-symmetric systems with fixed memory cost—unlike GMRES which grows with iterations. It combines BiCG with stabilization to smooth irregular convergence. BiCGSTAB requires two matrix-vector products per iteration but uses only O(n) memory regardless of iteration count.
Left or split preconditioning improves convergence.
Monitor for near-zero denominators and restart.
Higher-degree stabilization for better smoothing.
Basic BiCGSTAB with two SpMVs per iteration and fixed O(n) memory.
void bicgstab_naive(cusparseHandle_t sparse, cublasHandle_t blas,
int n, int* rowPtr, int* colIdx, float* vals,
float* b, float* x, int max_iter, float tol) {
float *r, *r0, *p, *v, *s, *t;
// Allocate 6 vectors of length n
// r = b - Ax, r0 = r
spmv(sparse, rowPtr, colIdx, vals, x, r, n);
subtract(b, r, r, n);
cudaMemcpy(r0, r, n * sizeof(float), D2D);
cudaMemcpy(p, r, n * sizeof(float), D2D);
float rho_old = 1, alpha = 1, omega = 1;
for (int iter = 0; iter < max_iter; iter++) {
float rho;
cublasSdot(blas, n, r0, 1, r, 1, &rho);
float beta = (rho / rho_old) * (alpha / omega);
// p = r + beta*(p - omega*v)
cublasSaxpy(blas, n, &(-omega), v, 1, p, 1);
cublasSscal(blas, n, &beta, p, 1);
cublasSaxpy(blas, n, &one, r, 1, p, 1);
// v = A*p
spmv(sparse, rowPtr, colIdx, vals, p, v, n);
float r0v;
cublasSdot(blas, n, r0, 1, v, 1, &r0v);
alpha = rho / r0v;
// s = r - alpha*v
cudaMemcpy(s, r, n * sizeof(float), D2D);
cublasSaxpy(blas, n, &(-alpha), v, 1, s, 1);
// t = A*s
spmv(sparse, rowPtr, colIdx, vals, s, t, n);
float ts, tt;
cublasSdot(blas, n, t, 1, s, 1, &ts);
cublasSdot(blas, n, t, 1, t, 1, &tt);
omega = ts / tt;
// x = x + alpha*p + omega*s
cublasSaxpy(blas, n, &alpha, p, 1, x, 1);
cublasSaxpy(blas, n, &omega, s, 1, x, 1);
// r = s - omega*t
cudaMemcpy(r, s, n * sizeof(float), D2D);
cublasSaxpy(blas, n, &(-omega), t, 1, r, 1);
float r_norm;
cublasSnrm2(blas, n, r, 1, &r_norm);
if (r_norm < tol) break;
rho_old = rho;
}
}Preconditioned BiCGSTAB with breakdown detection and early termination.
void pbicgstab(cusparseHandle_t sparse, cublasHandle_t blas,
int n, int* rowPtr, int* colIdx, float* vals,
float* b, float* x, void* precond, int max_iter, float tol) {
float *r, *r0, *p, *v, *s, *t, *phat, *shat;
// Allocate 8 vectors
// r = b - Ax
spmv(sparse, rowPtr, colIdx, vals, x, r, n);
subtract(b, r, r, n);
cudaMemcpy(r0, r, n * sizeof(float), D2D);
cudaMemcpy(p, r, n * sizeof(float), D2D);
float rho_old = 1, alpha = 1, omega = 1;
for (int iter = 0; iter < max_iter; iter++) {
float rho;
cublasSdot(blas, n, r0, 1, r, 1, &rho);
// Check for breakdown
if (fabsf(rho) < 1e-30f) {
// Restart with new r0
cudaMemcpy(r0, r, n * sizeof(float), D2D);
rho_old = 1; alpha = 1; omega = 1;
continue;
}
float beta = (rho / rho_old) * (alpha / omega);
// p = r + beta*(p - omega*v)
update_p(p, r, v, beta, omega, n);
// phat = M^{-1} p
apply_precond(precond, p, phat, n);
// v = A * phat
spmv(sparse, rowPtr, colIdx, vals, phat, v, n);
float r0v;
cublasSdot(blas, n, r0, 1, v, 1, &r0v);
alpha = rho / r0v;
// s = r - alpha*v
compute_s(s, r, v, alpha, n);
// Early convergence check
float s_norm;
cublasSnrm2(blas, n, s, 1, &s_norm);
if (s_norm < tol) {
cublasSaxpy(blas, n, &alpha, phat, 1, x, 1);
break;
}
// shat = M^{-1} s
apply_precond(precond, s, shat, n);
// t = A * shat
spmv(sparse, rowPtr, colIdx, vals, shat, t, n);
float ts, tt;
cublasSdot(blas, n, t, 1, s, 1, &ts);
cublasSdot(blas, n, t, 1, t, 1, &tt);
omega = ts / tt;
// x = x + alpha*phat + omega*shat
cublasSaxpy(blas, n, &alpha, phat, 1, x, 1);
cublasSaxpy(blas, n, &omega, shat, 1, x, 1);
// r = s - omega*t
compute_r(r, s, t, omega, n);
float r_norm;
cublasSnrm2(blas, n, r, 1, &r_norm);
if (r_norm < tol) break;
rho_old = rho;
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Non-symmetric n=500k | 450 iters | 85 iters (ILU) | 5.3x fewer |
| Memory vs GMRES(50) | 6n floats | 6n floats (BiCG) vs 51n (GMRES) | 8.5x less |
| Time vs GMRES | Similar per-iter | 2 SpMV vs 1 SpMV | Trade-off |
BiCGSTAB: fixed memory, may have irregular convergence, can break down. GMRES: robust convergence, memory grows with iterations, never breaks down. Use BiCGSTAB for memory-limited cases; GMRES when robustness is critical.
Breakdown occurs when rho=(r0,r) approaches zero. This happens when r becomes orthogonal to r0. Solutions: restart with r0=r, use BiCGSTAB(l) for more stability, or switch to GMRES.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.