RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA SOR: Successive Over-Relaxation on GPU

December 25, 202510 minBy RightNow AI Team

Introduction

SOR accelerates Gauss-Seidel by over-relaxation: x_new = ω*x_GS + (1-ω)*x_old. With optimal ω, convergence is O(1/n) vs O(1/n²) for Gauss-Seidel. Finding optimal ω requires eigenvalue estimation. For 2D Poisson: ω_opt = 2/(1 + sin(π/n)).

Common Performance Issues

Wrong ω diverges - need ω ∈ (0,2)
ω=1 is just Gauss-Seidel - missing acceleration
Optimal ω problem-dependent - need estimation
Same parallelization issues as Gauss-Seidel

Optimization Techniques

1. Optimal ω Estimation

Use power iteration to estimate spectral radius for ω_opt.

2. Red-Black SOR

Same coloring as Gauss-Seidel with SOR update.

3. SSOR Preconditioning

Symmetric SOR as preconditioner for CG.

Implementation Comparison

Before (Naive Implementation)

Sequential SOR with no parallelism.

cuda

void sor_naive(float* d_A, float* d_b, float* d_x, int n, float omega, int max_iter) {
    for (int iter = 0; iter < max_iter; iter++) {
        sor_sweep<<<1, 1>>>(d_A, d_b, d_x, n, omega);  // Sequential!
    }
}

__global__ void sor_sweep(float* A, float* b, float* x, int n, float omega) {
    for (int i = 0; i < n; i++) {
        float sigma = 0;
        for (int j = 0; j < n; j++) {
            if (j != i) sigma += A[i*n + j] * x[j];
        }
        float x_gs = (b[i] - sigma) / A[i*n + i];
        x[i] = omega * x_gs + (1 - omega) * x[i];
    }
}

After (Optimized Implementation)

Red-black SOR with optimal ω for Poisson equation.

cuda

__global__ void sor_red(float* x, float* b, float omega, float h2_inv, int nx, int ny) {
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
    int j = blockIdx.y * blockDim.y + threadIdx.y + 1;

    if (i < nx-1 && j < ny-1 && (i+j) % 2 == 0) {
        int idx = j * nx + i;
        float x_gs = 0.25f * (x[idx-1] + x[idx+1] + x[idx-nx] + x[idx+nx] - h2_inv * b[idx]);
        x[idx] = omega * x_gs + (1 - omega) * x[idx];
    }
}

__global__ void sor_black(float* x, float* b, float omega, float h2_inv, int nx, int ny) {
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
    int j = blockIdx.y * blockDim.y + threadIdx.y + 1;

    if (i < nx-1 && j < ny-1 && (i+j) % 2 == 1) {
        int idx = j * nx + i;
        float x_gs = 0.25f * (x[idx-1] + x[idx+1] + x[idx-nx] + x[idx+nx] - h2_inv * b[idx]);
        x[idx] = omega * x_gs + (1 - omega) * x[idx];
    }
}

float compute_optimal_omega(int n) {
    // For 2D Poisson equation
    return 2.0f / (1.0f + sinf(M_PI / n));
}

Performance Results

Metric	Naive	Optimized	Improvement
Convergence (Poisson 128²)	1650 iters (GS)	127 iters (SOR ω=1.93)	13x fewer
GPU parallel efficiency	0%	50% (red-black)	Usable

Frequently Asked Questions

How to find optimal ω?

For model problems (Poisson): ω_opt = 2/(1+sin(πh)). For general matrices: estimate ρ(G_J) via power iteration, then ω_opt = 2/(1+sqrt(1-ρ²)). If unknown, use ω=1.5-1.8 as starting point.

What is SSOR preconditioning?

Symmetric SOR: apply SOR forward then backward. SSOR is symmetric, so it can precondition CG. SSOR preconditioner: M = (D+ωL)D^{-1}(D+ωU)/ω(2-ω). Common choice for simple preconditioning.

Gauss-Seidel

SOR with ω=1

→

Multigrid

Often uses SOR as smoother

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA SORsuccessive over-relaxationrelaxation factoromega optimalaccelerated iterationred-black SOR

Implementation Comparison

Before (Naive Implementation)

Sequential SOR with no parallelism.

cuda

void sor_naive(float* d_A, float* d_b, float* d_x, int n, float omega, int max_iter) {
    for (int iter = 0; iter < max_iter; iter++) {
        sor_sweep<<<1, 1>>>(d_A, d_b, d_x, n, omega);  // Sequential!
    }
}

__global__ void sor_sweep(float* A, float* b, float* x, int n, float omega) {
    for (int i = 0; i < n; i++) {
        float sigma = 0;
        for (int j = 0; j < n; j++) {
            if (j != i) sigma += A[i*n + j] * x[j];
        }
        float x_gs = (b[i] - sigma) / A[i*n + i];
        x[i] = omega * x_gs + (1 - omega) * x[i];
    }
}

After (Optimized Implementation)

Red-black SOR with optimal ω for Poisson equation.

cuda

__global__ void sor_red(float* x, float* b, float omega, float h2_inv, int nx, int ny) {
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
    int j = blockIdx.y * blockDim.y + threadIdx.y + 1;

    if (i < nx-1 && j < ny-1 && (i+j) % 2 == 0) {
        int idx = j * nx + i;
        float x_gs = 0.25f * (x[idx-1] + x[idx+1] + x[idx-nx] + x[idx+nx] - h2_inv * b[idx]);
        x[idx] = omega * x_gs + (1 - omega) * x[idx];
    }
}

__global__ void sor_black(float* x, float* b, float omega, float h2_inv, int nx, int ny) {
    int i = blockIdx.x * blockDim.x + threadIdx.x + 1;
    int j = blockIdx.y * blockDim.y + threadIdx.y + 1;

    if (i < nx-1 && j < ny-1 && (i+j) % 2 == 1) {
        int idx = j * nx + i;
        float x_gs = 0.25f * (x[idx-1] + x[idx+1] + x[idx-nx] + x[idx+nx] - h2_inv * b[idx]);
        x[idx] = omega * x_gs + (1 - omega) * x[idx];
    }
}

float compute_optimal_omega(int n) {
    // For 2D Poisson equation
    return 2.0f / (1.0f + sinf(M_PI / n));
}

Metric

Naive

Optimized

Improvement

Convergence (Poisson 128²)

1650 iters (GS)

127 iters (SOR ω=1.93)

13x fewer

GPU parallel efficiency

50% (red-black)

Usable

Frequently Asked Questions

How to find optimal ω?

For model problems (Poisson): ω_opt = 2/(1+sin(πh)). For general matrices: estimate ρ(G_J) via power iteration, then ω_opt = 2/(1+sqrt(1-ρ²)). If unknown, use ω=1.5-1.8 as starting point.

What is SSOR preconditioning?

Symmetric SOR: apply SOR forward then backward. SSOR is symmetric, so it can precondition CG. SSOR preconditioner: M = (D+ωL)D^{-1}(D+ωU)/ω(2-ω). Common choice for simple preconditioning.

CUDA SOR: Successive Over-Relaxation on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Optimal ω Estimation

2. Red-Black SOR

3. SSOR Preconditioning

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

How to find optimal ω?

What is SSOR preconditioning?

Related Guides

CUDA SOR: Successive Over-Relaxation on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Optimal ω Estimation

2. Red-Black SOR

3. SSOR Preconditioning

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

How to find optimal ω?

What is SSOR preconditioning?

Related Guides