RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Conjugate Gradient: SPD System Solver on GPU

December 25, 202512 minBy RightNow AI Team

Introduction

Conjugate Gradient (CG) is the method of choice for symmetric positive definite (SPD) systems. It finds the exact solution in n iterations (in exact arithmetic) and converges faster for well-conditioned matrices. CG requires only matrix-vector products and dot products, making it ideal for sparse matrices where explicit factorization is infeasible.

Common Performance Issues

Matrix not SPD - CG diverges or gives wrong results
No preconditioning - slow convergence for ill-conditioned systems
Loss of conjugacy in floating point - restart or use more stable variants
Not monitoring residual - running unnecessary iterations

Optimization Techniques

1. Preconditioning

IC(0), ILU(0), or AMG preconditioning accelerates convergence dramatically.

2. Fused BLAS Operations

Combine axpy operations to reduce kernel launches.

3. Pipelined CG

Overlap communication with computation in distributed settings.

Implementation Comparison

Before (Naive Implementation)

Unpreconditioned CG converges slowly for ill-conditioned matrices.

cuda

void cg_naive(cublasHandle_t blas, cusparseHandle_t sparse,
              int n, int nnz, int* rowPtr, int* colIdx, float* vals,
              float* b, float* x, int max_iter, float tol) {
    float *r, *p, *Ap;
    cudaMalloc(&r, n * sizeof(float));
    cudaMalloc(&p, n * sizeof(float));
    cudaMalloc(&Ap, n * sizeof(float));

    // r = b - Ax, p = r
    spmv(sparse, rowPtr, colIdx, vals, x, r, n);
    subtract(b, r, r, n);
    cudaMemcpy(p, r, n * sizeof(float), D2D);

    float rsold;
    cublasSdot(blas, n, r, 1, r, 1, &rsold);

    for (int i = 0; i < max_iter; i++) {
        spmv(sparse, rowPtr, colIdx, vals, p, Ap, n);
        float pAp;
        cublasSdot(blas, n, p, 1, Ap, 1, &pAp);
        float alpha = rsold / pAp;

        // x = x + alpha*p, r = r - alpha*Ap
        cublasSaxpy(blas, n, &alpha, p, 1, x, 1);
        float neg_alpha = -alpha;
        cublasSaxpy(blas, n, &neg_alpha, Ap, 1, r, 1);

        float rsnew;
        cublasSdot(blas, n, r, 1, r, 1, &rsnew);
        if (sqrtf(rsnew) < tol) break;

        float beta = rsnew / rsold;
        // p = r + beta*p
        cublasSscal(blas, n, &beta, p, 1);
        cublasSaxpy(blas, n, &one, r, 1, p, 1);
        rsold = rsnew;
    }
}

After (Optimized Implementation)

IC(0) preconditioning provides significant speedup for SPD systems.

cuda

void pcg_ic0(cusparseHandle_t sparse, cublasHandle_t blas,
             int n, int nnz, int* rowPtr, int* colIdx, float* vals,
             float* b, float* x, int max_iter, float tol) {
    // Setup IC(0) preconditioner
    csric02Info_t icInfo;
    cusparseCreateCsric02Info(&icInfo);

    // Analyze and factor
    int bufSize;
    cusparseScsric02_bufferSize(sparse, n, nnz, descr, vals, rowPtr, colIdx, icInfo, &bufSize);
    void* buffer; cudaMalloc(&buffer, bufSize);
    cusparseScsric02_analysis(sparse, n, nnz, descr, vals, rowPtr, colIdx, icInfo, ...);
    cusparseScsric02(sparse, n, nnz, descr, vals, rowPtr, colIdx, icInfo, ...);

    float *r, *z, *p, *Ap;
    // ... allocate vectors

    // r = b - Ax
    spmv(sparse, rowPtr, colIdx, vals, x, r, n);
    subtract(b, r, r, n);

    // z = M^{-1} r (triangular solves with L, L^T)
    apply_ic_precond(sparse, icInfo, r, z, n);

    cudaMemcpy(p, z, n * sizeof(float), D2D);
    float rz_old;
    cublasSdot(blas, n, r, 1, z, 1, &rz_old);

    for (int iter = 0; iter < max_iter; iter++) {
        spmv(sparse, rowPtr, colIdx, vals, p, Ap, n);
        float pAp;
        cublasSdot(blas, n, p, 1, Ap, 1, &pAp);
        float alpha = rz_old / pAp;

        cublasSaxpy(blas, n, &alpha, p, 1, x, 1);
        float neg_alpha = -alpha;
        cublasSaxpy(blas, n, &neg_alpha, Ap, 1, r, 1);

        float r_norm;
        cublasSnrm2(blas, n, r, 1, &r_norm);
        if (r_norm < tol) break;

        apply_ic_precond(sparse, icInfo, r, z, n);

        float rz_new;
        cublasSdot(blas, n, r, 1, z, 1, &rz_new);
        float beta = rz_new / rz_old;
        rz_old = rz_new;

        cublasSscal(blas, n, &beta, p, 1);
        cublasSaxpy(blas, n, &one, z, 1, p, 1);
    }
}

Performance Results

Metric	Naive	Optimized	Improvement
Poisson 3D 100³	2400 iters	45 iters (IC0)	53x fewer
Total time n=1M	28s	0.6s	47x faster
Time per iteration	12ms	14ms (+precond)	Similar per-iter

Frequently Asked Questions

How do I know if my matrix is SPD?

SPD means A=A^T and all eigenvalues > 0. Equivalently: x^TAx > 0 for all x≠0. Common SPD matrices: graph Laplacians, covariance matrices, Hessians of convex functions, discretized Poisson operators.

What if CG diverges?

Likely causes: (1) matrix not SPD - use GMRES instead, (2) numerical issues - check for NaN/Inf, (3) preconditioner failure - IC may fail for indefinite matrices. Verify A is SPD by checking eigenvalues of small test case.

GMRES

General solver when CG not applicable

→

BiCGSTAB

Non-symmetric alternative

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA conjugate gradientCG solverSPD matrixKrylov methodpreconditioned CGsparse solver GPU

Introduction

Implementation Comparison

Before (Naive Implementation)

Unpreconditioned CG converges slowly for ill-conditioned matrices.

cuda

void cg_naive(cublasHandle_t blas, cusparseHandle_t sparse,
              int n, int nnz, int* rowPtr, int* colIdx, float* vals,
              float* b, float* x, int max_iter, float tol) {
    float *r, *p, *Ap;
    cudaMalloc(&r, n * sizeof(float));
    cudaMalloc(&p, n * sizeof(float));
    cudaMalloc(&Ap, n * sizeof(float));

    // r = b - Ax, p = r
    spmv(sparse, rowPtr, colIdx, vals, x, r, n);
    subtract(b, r, r, n);
    cudaMemcpy(p, r, n * sizeof(float), D2D);

    float rsold;
    cublasSdot(blas, n, r, 1, r, 1, &rsold);

    for (int i = 0; i < max_iter; i++) {
        spmv(sparse, rowPtr, colIdx, vals, p, Ap, n);
        float pAp;
        cublasSdot(blas, n, p, 1, Ap, 1, &pAp);
        float alpha = rsold / pAp;

        // x = x + alpha*p, r = r - alpha*Ap
        cublasSaxpy(blas, n, &alpha, p, 1, x, 1);
        float neg_alpha = -alpha;
        cublasSaxpy(blas, n, &neg_alpha, Ap, 1, r, 1);

        float rsnew;
        cublasSdot(blas, n, r, 1, r, 1, &rsnew);
        if (sqrtf(rsnew) < tol) break;

        float beta = rsnew / rsold;
        // p = r + beta*p
        cublasSscal(blas, n, &beta, p, 1);
        cublasSaxpy(blas, n, &one, r, 1, p, 1);
        rsold = rsnew;
    }
}

After (Optimized Implementation)

IC(0) preconditioning provides significant speedup for SPD systems.

cuda

void pcg_ic0(cusparseHandle_t sparse, cublasHandle_t blas,
             int n, int nnz, int* rowPtr, int* colIdx, float* vals,
             float* b, float* x, int max_iter, float tol) {
    // Setup IC(0) preconditioner
    csric02Info_t icInfo;
    cusparseCreateCsric02Info(&icInfo);

    // Analyze and factor
    int bufSize;
    cusparseScsric02_bufferSize(sparse, n, nnz, descr, vals, rowPtr, colIdx, icInfo, &bufSize);
    void* buffer; cudaMalloc(&buffer, bufSize);
    cusparseScsric02_analysis(sparse, n, nnz, descr, vals, rowPtr, colIdx, icInfo, ...);
    cusparseScsric02(sparse, n, nnz, descr, vals, rowPtr, colIdx, icInfo, ...);

    float *r, *z, *p, *Ap;
    // ... allocate vectors

    // r = b - Ax
    spmv(sparse, rowPtr, colIdx, vals, x, r, n);
    subtract(b, r, r, n);

    // z = M^{-1} r (triangular solves with L, L^T)
    apply_ic_precond(sparse, icInfo, r, z, n);

    cudaMemcpy(p, z, n * sizeof(float), D2D);
    float rz_old;
    cublasSdot(blas, n, r, 1, z, 1, &rz_old);

    for (int iter = 0; iter < max_iter; iter++) {
        spmv(sparse, rowPtr, colIdx, vals, p, Ap, n);
        float pAp;
        cublasSdot(blas, n, p, 1, Ap, 1, &pAp);
        float alpha = rz_old / pAp;

        cublasSaxpy(blas, n, &alpha, p, 1, x, 1);
        float neg_alpha = -alpha;
        cublasSaxpy(blas, n, &neg_alpha, Ap, 1, r, 1);

        float r_norm;
        cublasSnrm2(blas, n, r, 1, &r_norm);
        if (r_norm < tol) break;

        apply_ic_precond(sparse, icInfo, r, z, n);

        float rz_new;
        cublasSdot(blas, n, r, 1, z, 1, &rz_new);
        float beta = rz_new / rz_old;
        rz_old = rz_new;

        cublasSscal(blas, n, &beta, p, 1);
        cublasSaxpy(blas, n, &one, z, 1, p, 1);
    }
}

Metric

Naive

Optimized

Improvement

Poisson 3D 100³

2400 iters

45 iters (IC0)

53x fewer

Total time n=1M

28s

0.6s

47x faster

Time per iteration

12ms

14ms (+precond)

Similar per-iter

Frequently Asked Questions

CUDA Conjugate Gradient: SPD System Solver on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Preconditioning

2. Fused BLAS Operations

3. Pipelined CG

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

How do I know if my matrix is SPD?

What if CG diverges?

Related Guides

CUDA Conjugate Gradient: SPD System Solver on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Preconditioning

2. Fused BLAS Operations

3. Pipelined CG

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

How do I know if my matrix is SPD?

What if CG diverges?

Related Guides