RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA BiCGSTAB: Stabilized Bi-Conjugate Gradient on GPU

December 25, 202512 minBy RightNow AI Team

Introduction

BiCGSTAB (Bi-Conjugate Gradient Stabilized) solves non-symmetric systems with fixed memory cost—unlike GMRES which grows with iterations. It combines BiCG with stabilization to smooth irregular convergence. BiCGSTAB requires two matrix-vector products per iteration but uses only O(n) memory regardless of iteration count.

Common Performance Issues

Irregular convergence - residual may oscillate
Breakdown possible - division by near-zero
Two SpMVs per iteration - more expensive than CG
May fail for highly non-normal matrices - GMRES more robust

Optimization Techniques

1. Preconditioning

Left or split preconditioning improves convergence.

2. Breakdown Detection

Monitor for near-zero denominators and restart.

3. BiCGSTAB(l)

Higher-degree stabilization for better smoothing.

Implementation Comparison

Before (Naive Implementation)

Basic BiCGSTAB with two SpMVs per iteration and fixed O(n) memory.

cuda

void bicgstab_naive(cusparseHandle_t sparse, cublasHandle_t blas,
                    int n, int* rowPtr, int* colIdx, float* vals,
                    float* b, float* x, int max_iter, float tol) {
    float *r, *r0, *p, *v, *s, *t;
    // Allocate 6 vectors of length n

    // r = b - Ax, r0 = r
    spmv(sparse, rowPtr, colIdx, vals, x, r, n);
    subtract(b, r, r, n);
    cudaMemcpy(r0, r, n * sizeof(float), D2D);
    cudaMemcpy(p, r, n * sizeof(float), D2D);

    float rho_old = 1, alpha = 1, omega = 1;

    for (int iter = 0; iter < max_iter; iter++) {
        float rho;
        cublasSdot(blas, n, r0, 1, r, 1, &rho);

        float beta = (rho / rho_old) * (alpha / omega);
        // p = r + beta*(p - omega*v)
        cublasSaxpy(blas, n, &(-omega), v, 1, p, 1);
        cublasSscal(blas, n, &beta, p, 1);
        cublasSaxpy(blas, n, &one, r, 1, p, 1);

        // v = A*p
        spmv(sparse, rowPtr, colIdx, vals, p, v, n);

        float r0v;
        cublasSdot(blas, n, r0, 1, v, 1, &r0v);
        alpha = rho / r0v;

        // s = r - alpha*v
        cudaMemcpy(s, r, n * sizeof(float), D2D);
        cublasSaxpy(blas, n, &(-alpha), v, 1, s, 1);

        // t = A*s
        spmv(sparse, rowPtr, colIdx, vals, s, t, n);

        float ts, tt;
        cublasSdot(blas, n, t, 1, s, 1, &ts);
        cublasSdot(blas, n, t, 1, t, 1, &tt);
        omega = ts / tt;

        // x = x + alpha*p + omega*s
        cublasSaxpy(blas, n, &alpha, p, 1, x, 1);
        cublasSaxpy(blas, n, &omega, s, 1, x, 1);

        // r = s - omega*t
        cudaMemcpy(r, s, n * sizeof(float), D2D);
        cublasSaxpy(blas, n, &(-omega), t, 1, r, 1);

        float r_norm;
        cublasSnrm2(blas, n, r, 1, &r_norm);
        if (r_norm < tol) break;

        rho_old = rho;
    }
}

After (Optimized Implementation)

Preconditioned BiCGSTAB with breakdown detection and early termination.

cuda

void pbicgstab(cusparseHandle_t sparse, cublasHandle_t blas,
               int n, int* rowPtr, int* colIdx, float* vals,
               float* b, float* x, void* precond, int max_iter, float tol) {
    float *r, *r0, *p, *v, *s, *t, *phat, *shat;
    // Allocate 8 vectors

    // r = b - Ax
    spmv(sparse, rowPtr, colIdx, vals, x, r, n);
    subtract(b, r, r, n);
    cudaMemcpy(r0, r, n * sizeof(float), D2D);
    cudaMemcpy(p, r, n * sizeof(float), D2D);

    float rho_old = 1, alpha = 1, omega = 1;

    for (int iter = 0; iter < max_iter; iter++) {
        float rho;
        cublasSdot(blas, n, r0, 1, r, 1, &rho);

        // Check for breakdown
        if (fabsf(rho) < 1e-30f) {
            // Restart with new r0
            cudaMemcpy(r0, r, n * sizeof(float), D2D);
            rho_old = 1; alpha = 1; omega = 1;
            continue;
        }

        float beta = (rho / rho_old) * (alpha / omega);
        // p = r + beta*(p - omega*v)
        update_p(p, r, v, beta, omega, n);

        // phat = M^{-1} p
        apply_precond(precond, p, phat, n);

        // v = A * phat
        spmv(sparse, rowPtr, colIdx, vals, phat, v, n);

        float r0v;
        cublasSdot(blas, n, r0, 1, v, 1, &r0v);
        alpha = rho / r0v;

        // s = r - alpha*v
        compute_s(s, r, v, alpha, n);

        // Early convergence check
        float s_norm;
        cublasSnrm2(blas, n, s, 1, &s_norm);
        if (s_norm < tol) {
            cublasSaxpy(blas, n, &alpha, phat, 1, x, 1);
            break;
        }

        // shat = M^{-1} s
        apply_precond(precond, s, shat, n);

        // t = A * shat
        spmv(sparse, rowPtr, colIdx, vals, shat, t, n);

        float ts, tt;
        cublasSdot(blas, n, t, 1, s, 1, &ts);
        cublasSdot(blas, n, t, 1, t, 1, &tt);
        omega = ts / tt;

        // x = x + alpha*phat + omega*shat
        cublasSaxpy(blas, n, &alpha, phat, 1, x, 1);
        cublasSaxpy(blas, n, &omega, shat, 1, x, 1);

        // r = s - omega*t
        compute_r(r, s, t, omega, n);

        float r_norm;
        cublasSnrm2(blas, n, r, 1, &r_norm);
        if (r_norm < tol) break;

        rho_old = rho;
    }
}

Performance Results

Metric	Naive	Optimized	Improvement
Non-symmetric n=500k	450 iters	85 iters (ILU)	5.3x fewer
Memory vs GMRES(50)	6n floats	6n floats (BiCG) vs 51n (GMRES)	8.5x less
Time vs GMRES	Similar per-iter	2 SpMV vs 1 SpMV	Trade-off

Frequently Asked Questions

BiCGSTAB vs GMRES: which should I use?

BiCGSTAB: fixed memory, may have irregular convergence, can break down. GMRES: robust convergence, memory grows with iterations, never breaks down. Use BiCGSTAB for memory-limited cases; GMRES when robustness is critical.

What causes BiCGSTAB breakdown?

Breakdown occurs when rho=(r0,r) approaches zero. This happens when r becomes orthogonal to r0. Solutions: restart with r0=r, use BiCGSTAB(l) for more stability, or switch to GMRES.

GMRES

More robust but growing memory

→

Conjugate Gradient

For SPD systems only

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA BiCGSTABbi-conjugate gradientnon-symmetric solverstabilizedKrylov methodfixed memory

Introduction

Implementation Comparison

Before (Naive Implementation)

Basic BiCGSTAB with two SpMVs per iteration and fixed O(n) memory.

cuda

void bicgstab_naive(cusparseHandle_t sparse, cublasHandle_t blas,
                    int n, int* rowPtr, int* colIdx, float* vals,
                    float* b, float* x, int max_iter, float tol) {
    float *r, *r0, *p, *v, *s, *t;
    // Allocate 6 vectors of length n

    // r = b - Ax, r0 = r
    spmv(sparse, rowPtr, colIdx, vals, x, r, n);
    subtract(b, r, r, n);
    cudaMemcpy(r0, r, n * sizeof(float), D2D);
    cudaMemcpy(p, r, n * sizeof(float), D2D);

    float rho_old = 1, alpha = 1, omega = 1;

    for (int iter = 0; iter < max_iter; iter++) {
        float rho;
        cublasSdot(blas, n, r0, 1, r, 1, &rho);

        float beta = (rho / rho_old) * (alpha / omega);
        // p = r + beta*(p - omega*v)
        cublasSaxpy(blas, n, &(-omega), v, 1, p, 1);
        cublasSscal(blas, n, &beta, p, 1);
        cublasSaxpy(blas, n, &one, r, 1, p, 1);

        // v = A*p
        spmv(sparse, rowPtr, colIdx, vals, p, v, n);

        float r0v;
        cublasSdot(blas, n, r0, 1, v, 1, &r0v);
        alpha = rho / r0v;

        // s = r - alpha*v
        cudaMemcpy(s, r, n * sizeof(float), D2D);
        cublasSaxpy(blas, n, &(-alpha), v, 1, s, 1);

        // t = A*s
        spmv(sparse, rowPtr, colIdx, vals, s, t, n);

        float ts, tt;
        cublasSdot(blas, n, t, 1, s, 1, &ts);
        cublasSdot(blas, n, t, 1, t, 1, &tt);
        omega = ts / tt;

        // x = x + alpha*p + omega*s
        cublasSaxpy(blas, n, &alpha, p, 1, x, 1);
        cublasSaxpy(blas, n, &omega, s, 1, x, 1);

        // r = s - omega*t
        cudaMemcpy(r, s, n * sizeof(float), D2D);
        cublasSaxpy(blas, n, &(-omega), t, 1, r, 1);

        float r_norm;
        cublasSnrm2(blas, n, r, 1, &r_norm);
        if (r_norm < tol) break;

        rho_old = rho;
    }
}

After (Optimized Implementation)

Preconditioned BiCGSTAB with breakdown detection and early termination.

cuda

void pbicgstab(cusparseHandle_t sparse, cublasHandle_t blas,
               int n, int* rowPtr, int* colIdx, float* vals,
               float* b, float* x, void* precond, int max_iter, float tol) {
    float *r, *r0, *p, *v, *s, *t, *phat, *shat;
    // Allocate 8 vectors

    // r = b - Ax
    spmv(sparse, rowPtr, colIdx, vals, x, r, n);
    subtract(b, r, r, n);
    cudaMemcpy(r0, r, n * sizeof(float), D2D);
    cudaMemcpy(p, r, n * sizeof(float), D2D);

    float rho_old = 1, alpha = 1, omega = 1;

    for (int iter = 0; iter < max_iter; iter++) {
        float rho;
        cublasSdot(blas, n, r0, 1, r, 1, &rho);

        // Check for breakdown
        if (fabsf(rho) < 1e-30f) {
            // Restart with new r0
            cudaMemcpy(r0, r, n * sizeof(float), D2D);
            rho_old = 1; alpha = 1; omega = 1;
            continue;
        }

        float beta = (rho / rho_old) * (alpha / omega);
        // p = r + beta*(p - omega*v)
        update_p(p, r, v, beta, omega, n);

        // phat = M^{-1} p
        apply_precond(precond, p, phat, n);

        // v = A * phat
        spmv(sparse, rowPtr, colIdx, vals, phat, v, n);

        float r0v;
        cublasSdot(blas, n, r0, 1, v, 1, &r0v);
        alpha = rho / r0v;

        // s = r - alpha*v
        compute_s(s, r, v, alpha, n);

        // Early convergence check
        float s_norm;
        cublasSnrm2(blas, n, s, 1, &s_norm);
        if (s_norm < tol) {
            cublasSaxpy(blas, n, &alpha, phat, 1, x, 1);
            break;
        }

        // shat = M^{-1} s
        apply_precond(precond, s, shat, n);

        // t = A * shat
        spmv(sparse, rowPtr, colIdx, vals, shat, t, n);

        float ts, tt;
        cublasSdot(blas, n, t, 1, s, 1, &ts);
        cublasSdot(blas, n, t, 1, t, 1, &tt);
        omega = ts / tt;

        // x = x + alpha*phat + omega*shat
        cublasSaxpy(blas, n, &alpha, phat, 1, x, 1);
        cublasSaxpy(blas, n, &omega, shat, 1, x, 1);

        // r = s - omega*t
        compute_r(r, s, t, omega, n);

        float r_norm;
        cublasSnrm2(blas, n, r, 1, &r_norm);
        if (r_norm < tol) break;

        rho_old = rho;
    }
}

Metric

Naive

Optimized

Improvement

Non-symmetric n=500k

450 iters

85 iters (ILU)

5.3x fewer

Memory vs GMRES(50)

6n floats

6n floats (BiCG) vs 51n (GMRES)

8.5x less

Time vs GMRES

Similar per-iter

2 SpMV vs 1 SpMV

Trade-off

Frequently Asked Questions

BiCGSTAB vs GMRES: which should I use?

What causes BiCGSTAB breakdown?

Breakdown occurs when rho=(r0,r) approaches zero. This happens when r becomes orthogonal to r0. Solutions: restart with r0=r, use BiCGSTAB(l) for more stability, or switch to GMRES.

CUDA BiCGSTAB: Stabilized Bi-Conjugate Gradient on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Preconditioning

2. Breakdown Detection

3. BiCGSTAB(l)

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

BiCGSTAB vs GMRES: which should I use?

What causes BiCGSTAB breakdown?

Related Guides

CUDA BiCGSTAB: Stabilized Bi-Conjugate Gradient on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Preconditioning

2. Breakdown Detection

3. BiCGSTAB(l)

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

BiCGSTAB vs GMRES: which should I use?

What causes BiCGSTAB breakdown?

Related Guides