RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Iterative Solver: Krylov Methods on GPU

December 25, 202515 minBy RightNow AI Team

Introduction

Iterative solvers find approximate solutions through successive refinement, requiring only matrix-vector products. Key methods: CG (SPD matrices), GMRES (general), BiCGSTAB (non-symmetric). Convergence depends heavily on preconditioning—transforming the system to have better spectral properties.

Common Performance Issues

No preconditioning - slow or no convergence
Wrong method for matrix type - CG fails for non-SPD
Not monitoring convergence - wasting iterations
Recomputing preconditioner - should reuse when possible

Optimization Techniques

1. ILU/IC Preconditioning

Incomplete factorization preconditioners for general acceleration.

2. Fused Kernels

Combine dot products and vector updates to reduce kernel launches.

3. Mixed Precision

Use FP32 for iteration, FP64 for final refinement.

Implementation Comparison

Before (Naive Implementation)

Jacobi has slow convergence (depends on spectral radius) and dense matrix access.

cuda

// Jacobi - simple but slow convergence
void jacobi_solve(float* d_A, float* d_b, float* d_x, int n, int max_iter, float tol) {
    float* d_x_new;
    cudaMalloc(&d_x_new, n * sizeof(float));

    for (int iter = 0; iter < max_iter; iter++) {
        jacobi_step<<<...>>>(d_A, d_b, d_x, d_x_new, n);
        cudaMemcpy(d_x, d_x_new, n * sizeof(float), D2D);

        float residual = compute_residual(d_A, d_b, d_x, n);
        if (residual < tol) break;
    }
}

__global__ void jacobi_step(float* A, float* b, float* x, float* x_new, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    float sum = b[i];
    for (int j = 0; j < n; j++) {
        if (j != i) sum -= A[i * n + j] * x[j];
    }
    x_new[i] = sum / A[i * n + i];
}

After (Optimized Implementation)

PCG with ILU preconditioning is the workhorse for large SPD sparse systems.

cuda

void pcg_solve(cusparseHandle_t sparse, cublasHandle_t blas,
               int n, int nnz, int* d_rowPtr, int* d_colIdx, float* d_vals,
               float* d_b, float* d_x, void* precond, int max_iter, float tol) {
    float *d_r, *d_p, *d_z, *d_Ap;
    cudaMalloc(&d_r, n * sizeof(float));
    cudaMalloc(&d_p, n * sizeof(float));
    cudaMalloc(&d_z, n * sizeof(float));
    cudaMalloc(&d_Ap, n * sizeof(float));

    // r = b - Ax
    spmv(sparse, d_rowPtr, d_colIdx, d_vals, d_x, d_r, n);
    cublasSaxpy(blas, n, &minus_one, d_r, 1, d_b, 1);  // r = b - Ax

    // z = M^{-1} r (apply preconditioner)
    apply_precond(precond, d_r, d_z, n);

    cudaMemcpy(d_p, d_z, n * sizeof(float), D2D);  // p = z

    float rz_old;
    cublasSdot(blas, n, d_r, 1, d_z, 1, &rz_old);

    for (int iter = 0; iter < max_iter; iter++) {
        // Ap = A * p
        spmv(sparse, d_rowPtr, d_colIdx, d_vals, d_p, d_Ap, n);

        // alpha = (r,z) / (p, Ap)
        float pAp;
        cublasSdot(blas, n, d_p, 1, d_Ap, 1, &pAp);
        float alpha = rz_old / pAp;

        // x = x + alpha * p
        cublasSaxpy(blas, n, &alpha, d_p, 1, d_x, 1);

        // r = r - alpha * Ap
        float neg_alpha = -alpha;
        cublasSaxpy(blas, n, &neg_alpha, d_Ap, 1, d_r, 1);

        // Check convergence
        float r_norm;
        cublasSnrm2(blas, n, d_r, 1, &r_norm);
        if (r_norm < tol) break;

        // z = M^{-1} r
        apply_precond(precond, d_r, d_z, n);

        // beta = (r_new, z_new) / (r_old, z_old)
        float rz_new;
        cublasSdot(blas, n, d_r, 1, d_z, 1, &rz_new);
        float beta = rz_new / rz_old;
        rz_old = rz_new;

        // p = z + beta * p
        cublasSscal(blas, n, &beta, d_p, 1);
        cublasSaxpy(blas, n, &one, d_z, 1, d_p, 1);
    }
}

Performance Results

Metric	Naive	Optimized	Improvement
n=1M SPD, no precond	5000 iters	150 iters (ILU)	33x fewer iters
Time per iteration	12ms (Jacobi)	8ms (CG)	1.5x faster
Total solve time	60s	1.2s	50x faster

Frequently Asked Questions

Which iterative method should I use?

SPD matrices: CG (Conjugate Gradient). Symmetric indefinite: MINRES. General non-symmetric: GMRES or BiCGSTAB. GMRES is most robust but memory grows with iterations; restart to bound memory.

How to choose a preconditioner?

ILU(0): cheap, general purpose. ILU(k): more fill-in, better convergence. IC (Incomplete Cholesky): for SPD. AMG (Algebraic Multigrid): best for elliptic PDEs. Domain-specific: physics-based preconditioners.

Conjugate Gradient

Primary method for SPD systems

→

GMRES

Method for general non-symmetric systems

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA iterative solverKrylov methodsconjugate gradientGMRESBiCGSTABpreconditioner

Implementation Comparison

Before (Naive Implementation)

Jacobi has slow convergence (depends on spectral radius) and dense matrix access.

cuda

// Jacobi - simple but slow convergence
void jacobi_solve(float* d_A, float* d_b, float* d_x, int n, int max_iter, float tol) {
    float* d_x_new;
    cudaMalloc(&d_x_new, n * sizeof(float));

    for (int iter = 0; iter < max_iter; iter++) {
        jacobi_step<<<...>>>(d_A, d_b, d_x, d_x_new, n);
        cudaMemcpy(d_x, d_x_new, n * sizeof(float), D2D);

        float residual = compute_residual(d_A, d_b, d_x, n);
        if (residual < tol) break;
    }
}

__global__ void jacobi_step(float* A, float* b, float* x, float* x_new, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    float sum = b[i];
    for (int j = 0; j < n; j++) {
        if (j != i) sum -= A[i * n + j] * x[j];
    }
    x_new[i] = sum / A[i * n + i];
}

After (Optimized Implementation)

PCG with ILU preconditioning is the workhorse for large SPD sparse systems.

cuda

void pcg_solve(cusparseHandle_t sparse, cublasHandle_t blas,
               int n, int nnz, int* d_rowPtr, int* d_colIdx, float* d_vals,
               float* d_b, float* d_x, void* precond, int max_iter, float tol) {
    float *d_r, *d_p, *d_z, *d_Ap;
    cudaMalloc(&d_r, n * sizeof(float));
    cudaMalloc(&d_p, n * sizeof(float));
    cudaMalloc(&d_z, n * sizeof(float));
    cudaMalloc(&d_Ap, n * sizeof(float));

    // r = b - Ax
    spmv(sparse, d_rowPtr, d_colIdx, d_vals, d_x, d_r, n);
    cublasSaxpy(blas, n, &minus_one, d_r, 1, d_b, 1);  // r = b - Ax

    // z = M^{-1} r (apply preconditioner)
    apply_precond(precond, d_r, d_z, n);

    cudaMemcpy(d_p, d_z, n * sizeof(float), D2D);  // p = z

    float rz_old;
    cublasSdot(blas, n, d_r, 1, d_z, 1, &rz_old);

    for (int iter = 0; iter < max_iter; iter++) {
        // Ap = A * p
        spmv(sparse, d_rowPtr, d_colIdx, d_vals, d_p, d_Ap, n);

        // alpha = (r,z) / (p, Ap)
        float pAp;
        cublasSdot(blas, n, d_p, 1, d_Ap, 1, &pAp);
        float alpha = rz_old / pAp;

        // x = x + alpha * p
        cublasSaxpy(blas, n, &alpha, d_p, 1, d_x, 1);

        // r = r - alpha * Ap
        float neg_alpha = -alpha;
        cublasSaxpy(blas, n, &neg_alpha, d_Ap, 1, d_r, 1);

        // Check convergence
        float r_norm;
        cublasSnrm2(blas, n, d_r, 1, &r_norm);
        if (r_norm < tol) break;

        // z = M^{-1} r
        apply_precond(precond, d_r, d_z, n);

        // beta = (r_new, z_new) / (r_old, z_old)
        float rz_new;
        cublasSdot(blas, n, d_r, 1, d_z, 1, &rz_new);
        float beta = rz_new / rz_old;
        rz_old = rz_new;

        // p = z + beta * p
        cublasSscal(blas, n, &beta, d_p, 1);
        cublasSaxpy(blas, n, &one, d_z, 1, d_p, 1);
    }
}

Metric

Naive

Optimized

Improvement

n=1M SPD, no precond

5000 iters

150 iters (ILU)

33x fewer iters

Time per iteration

12ms (Jacobi)

8ms (CG)

1.5x faster

Total solve time

60s

1.2s

50x faster

Frequently Asked Questions

Which iterative method should I use?

SPD matrices: CG (Conjugate Gradient). Symmetric indefinite: MINRES. General non-symmetric: GMRES or BiCGSTAB. GMRES is most robust but memory grows with iterations; restart to bound memory.

CUDA Iterative Solver: Krylov Methods on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. ILU/IC Preconditioning

2. Fused Kernels

3. Mixed Precision

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Which iterative method should I use?

How to choose a preconditioner?

Related Guides

CUDA Iterative Solver: Krylov Methods on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. ILU/IC Preconditioning

2. Fused Kernels

3. Mixed Precision

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Which iterative method should I use?

How to choose a preconditioner?

Related Guides