RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Jacobi Iteration: Simple Parallel Iterative Solver

December 25, 202510 minBy RightNow AI Team

Introduction

Jacobi iteration solves Ax=b by: x^(k+1) = D^(-1)(b - (L+U)x^(k)) where A=D+L+U. Each component updates independently, making it embarrassingly parallel. While slow to converge as a standalone solver, Jacobi is valuable as a smoother in multigrid and as a simple preconditioner.

Common Performance Issues

Slow convergence - spectral radius may be close to 1
Divergence for non-diagonally dominant - need damping
Two copies of x needed - cannot update in-place
Not competitive as standalone solver - use CG/GMRES instead

Optimization Techniques

1. Damped Jacobi

Use x_new = (1-ω)x + ω*x_jacobi for stability.

2. Fused Kernel

Combine diagonal scaling and update in single kernel.

3. Weighted Jacobi

Optimal ω=2/3 for Poisson equation smoothing.

Implementation Comparison

Before (Naive Implementation)

Dense Jacobi iteration - inefficient for sparse matrices.

cuda

__global__ void jacobi_naive(float* A, float* b, float* x, float* x_new, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    float sum = b[i];
    float diag = A[i * n + i];

    for (int j = 0; j < n; j++) {
        if (j != i) {
            sum -= A[i * n + j] * x[j];
        }
    }

    x_new[i] = sum / diag;
}

void jacobi_solve(float* d_A, float* d_b, float* d_x, int n, int max_iter, float tol) {
    float* d_x_new;
    cudaMalloc(&d_x_new, n * sizeof(float));

    for (int iter = 0; iter < max_iter; iter++) {
        jacobi_naive<<<(n+255)/256, 256>>>(d_A, d_b, d_x, d_x_new, n);
        cudaMemcpy(d_x, d_x_new, n * sizeof(float), D2D);

        // Check convergence (expensive)
        float residual = compute_residual(d_A, d_b, d_x, n);
        if (residual < tol) break;
    }
}

After (Optimized Implementation)

Sparse damped Jacobi with precomputed diagonal inverse for smoothing.

cuda

__global__ void jacobi_sparse_damped(int* rowPtr, int* colIdx, float* vals,
                                       float* diag_inv, float* b, float* x, float* x_new,
                                       int n, float omega) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    float sum = b[i];

    // Sparse row iteration
    for (int k = rowPtr[i]; k < rowPtr[i+1]; k++) {
        int j = colIdx[k];
        if (j != i) {
            sum -= vals[k] * x[j];
        }
    }

    float x_jacobi = sum * diag_inv[i];

    // Damped update: x_new = (1-ω)x + ω*x_jacobi
    x_new[i] = (1.0f - omega) * x[i] + omega * x_jacobi;
}

void jacobi_smoother(cusparseHandle_t sparse, int* d_rowPtr, int* d_colIdx, float* d_vals,
                     float* d_diag_inv, float* d_b, float* d_x, int n, int sweeps, float omega) {
    float* d_x_new;
    cudaMalloc(&d_x_new, n * sizeof(float));

    for (int s = 0; s < sweeps; s++) {
        jacobi_sparse_damped<<<(n+255)/256, 256>>>(
            d_rowPtr, d_colIdx, d_vals, d_diag_inv, d_b, d_x, d_x_new, n, omega);

        // Swap pointers
        float* tmp = d_x; d_x = d_x_new; d_x_new = tmp;
    }
}

Performance Results

Metric	Naive	Optimized	Improvement
Convergence (Poisson)	ρ ≈ 0.99 (slow)	ρ ≈ 0.97 (ω=2/3)	3x faster convergence
As smoother (3 sweeps)	3.5ms	3.5ms	Same (parallel)
vs Gauss-Seidel	Fully parallel	Fully parallel	Better parallelism

Frequently Asked Questions

When should I use Jacobi?

Use Jacobi as: (1) smoother in multigrid (2-3 sweeps), (2) simple preconditioner (diagonal scaling), (3) parallel baseline for comparison. Do not use as standalone solver - CG/BiCGSTAB are much faster.

What damping factor should I use?

For Poisson equation: ω=2/3 is optimal for smoothing. For general SPD: ω = 2/(λ_max + λ_min) is optimal, but λ estimation is expensive. Start with ω=0.8 and adjust. For convergence guarantee, need ω < 2/ρ(D^(-1)(L+U)).

Gauss-Seidel

Sequential updates, faster convergence

→

Multigrid

Uses Jacobi as smoother

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA Jacobiiterative solverparallel iterationsmoothermultigridpreconditioner

Implementation Comparison

Before (Naive Implementation)

Dense Jacobi iteration - inefficient for sparse matrices.

cuda

__global__ void jacobi_naive(float* A, float* b, float* x, float* x_new, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    float sum = b[i];
    float diag = A[i * n + i];

    for (int j = 0; j < n; j++) {
        if (j != i) {
            sum -= A[i * n + j] * x[j];
        }
    }

    x_new[i] = sum / diag;
}

void jacobi_solve(float* d_A, float* d_b, float* d_x, int n, int max_iter, float tol) {
    float* d_x_new;
    cudaMalloc(&d_x_new, n * sizeof(float));

    for (int iter = 0; iter < max_iter; iter++) {
        jacobi_naive<<<(n+255)/256, 256>>>(d_A, d_b, d_x, d_x_new, n);
        cudaMemcpy(d_x, d_x_new, n * sizeof(float), D2D);

        // Check convergence (expensive)
        float residual = compute_residual(d_A, d_b, d_x, n);
        if (residual < tol) break;
    }
}

After (Optimized Implementation)

Sparse damped Jacobi with precomputed diagonal inverse for smoothing.

cuda

__global__ void jacobi_sparse_damped(int* rowPtr, int* colIdx, float* vals,
                                       float* diag_inv, float* b, float* x, float* x_new,
                                       int n, float omega) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;

    float sum = b[i];

    // Sparse row iteration
    for (int k = rowPtr[i]; k < rowPtr[i+1]; k++) {
        int j = colIdx[k];
        if (j != i) {
            sum -= vals[k] * x[j];
        }
    }

    float x_jacobi = sum * diag_inv[i];

    // Damped update: x_new = (1-ω)x + ω*x_jacobi
    x_new[i] = (1.0f - omega) * x[i] + omega * x_jacobi;
}

void jacobi_smoother(cusparseHandle_t sparse, int* d_rowPtr, int* d_colIdx, float* d_vals,
                     float* d_diag_inv, float* d_b, float* d_x, int n, int sweeps, float omega) {
    float* d_x_new;
    cudaMalloc(&d_x_new, n * sizeof(float));

    for (int s = 0; s < sweeps; s++) {
        jacobi_sparse_damped<<<(n+255)/256, 256>>>(
            d_rowPtr, d_colIdx, d_vals, d_diag_inv, d_b, d_x, d_x_new, n, omega);

        // Swap pointers
        float* tmp = d_x; d_x = d_x_new; d_x_new = tmp;
    }
}

Metric

Naive

Optimized

Improvement

Convergence (Poisson)

ρ ≈ 0.99 (slow)

ρ ≈ 0.97 (ω=2/3)

3x faster convergence

As smoother (3 sweeps)

3.5ms

Same (parallel)

vs Gauss-Seidel

Fully parallel

Better parallelism

Frequently Asked Questions

CUDA Jacobi Iteration: Simple Parallel Iterative Solver

Introduction

Common Performance Issues

Optimization Techniques

1. Damped Jacobi

2. Fused Kernel

3. Weighted Jacobi

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

When should I use Jacobi?

What damping factor should I use?

Related Guides

CUDA Jacobi Iteration: Simple Parallel Iterative Solver

Introduction

Common Performance Issues

Optimization Techniques

1. Damped Jacobi

2. Fused Kernel

3. Weighted Jacobi

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

When should I use Jacobi?

What damping factor should I use?

Related Guides