RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Triangular Solve: Fast Back/Forward Substitution on GPU

December 25, 202510 minBy RightNow AI Team

Introduction

Triangular systems Lx=b (lower) or Ux=b (upper) are solved by substitution in O(n²) operations. These are the final step of LU, QR, and Cholesky-based solvers. GPU parallelization is challenging due to data dependencies, but batched operations and blocked algorithms achieve good throughput.

Common Performance Issues

Sequential dependencies limit parallelism - each unknown depends on previous
Not using cuBLAS - custom implementations are slower
Single RHS inefficiency - trsm for matrices is more efficient than trsv for vectors
Not batching small systems - GPU underutilized

Optimization Techniques

1. cuBLAS trsv/trsm

Use highly optimized triangular solvers from cuBLAS.

2. Batched Triangular Solve

Solve many small triangular systems in parallel.

3. Multiple RHS

Use trsm with matrix RHS instead of multiple trsv calls.

Implementation Comparison

Before (Naive Implementation)

Naive sequential substitution completely fails to utilize GPU parallelism.

cuda

// Upper triangular solve - completely sequential
__global__ void back_sub_naive(float* U, float* b, float* x, int n) {
    if (threadIdx.x != 0) return;

    for (int i = n - 1; i >= 0; i--) {
        float sum = b[i];
        for (int j = i + 1; j < n; j++) {
            sum -= U[i * n + j] * x[j];
        }
        x[i] = sum / U[i * n + i];
    }
}

After (Optimized Implementation)

cuBLAS provides optimized single, multi-RHS, and batched triangular solvers.

cuda

// Single system solve
void triangular_solve(cublasHandle_t handle, float* d_T, float* d_b, int n,
                      bool upper, bool transpose) {
    cublasFillMode_t uplo = upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER;
    cublasOperation_t trans = transpose ? CUBLAS_OP_T : CUBLAS_OP_N;

    float alpha = 1.0f;
    cublasStrsv(handle, uplo, trans, CUBLAS_DIAG_NON_UNIT, n, d_T, n, d_b, 1);
}

// Multiple RHS (more efficient)
void triangular_solve_multi(cublasHandle_t handle, float* d_T, float* d_B,
                            int n, int nrhs, bool upper) {
    float alpha = 1.0f;
    cublasStrsm(handle, CUBLAS_SIDE_LEFT,
                upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER,
                CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, nrhs, &alpha, d_T, n, d_B, n);
}

// Batched small systems
void triangular_solve_batched(cublasHandle_t handle, float** d_Ts, float** d_bs,
                               int n, int batch_size, bool upper) {
    float alpha = 1.0f;
    cublasStrsvBatched(handle,
                       upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER,
                       CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, d_Ts, n, d_bs, 1, batch_size);
}

Performance Results

Metric	Naive	Optimized	Improvement
Single 4096x4096 solve	85ms	2.1ms (cuBLAS)	40x faster
1000 RHS together	85s (sequential)	180ms (trsm)	472x faster
Batch 10000 64x64	12s	15ms (batched)	800x faster

Frequently Asked Questions

Why is triangular solve hard to parallelize?

Each element x[i] depends on x[i+1...n] (back-sub) or x[1...i-1] (forward-sub). This creates a chain of dependencies. GPU parallelism comes from: (1) multiple RHS, (2) batching independent systems, (3) blocked algorithms with parallel sub-blocks.

When should I use trsm vs trsv?

Use trsv for single right-hand side vector. Use trsm for multiple RHS—it is more efficient than calling trsv multiple times. If you have 1 RHS but plan to add more, structure code for trsm from the start.

QR Solve

Uses triangular solve after Q^T application

→

Band Solve

Triangular with limited bandwidth

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA triangular solveback substitutionforward substitutioncuBLAS trsvcuBLAS trsmLU solve

Implementation Comparison

Before (Naive Implementation)

Naive sequential substitution completely fails to utilize GPU parallelism.

cuda

// Upper triangular solve - completely sequential
__global__ void back_sub_naive(float* U, float* b, float* x, int n) {
    if (threadIdx.x != 0) return;

    for (int i = n - 1; i >= 0; i--) {
        float sum = b[i];
        for (int j = i + 1; j < n; j++) {
            sum -= U[i * n + j] * x[j];
        }
        x[i] = sum / U[i * n + i];
    }
}

After (Optimized Implementation)

cuBLAS provides optimized single, multi-RHS, and batched triangular solvers.

cuda

// Single system solve
void triangular_solve(cublasHandle_t handle, float* d_T, float* d_b, int n,
                      bool upper, bool transpose) {
    cublasFillMode_t uplo = upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER;
    cublasOperation_t trans = transpose ? CUBLAS_OP_T : CUBLAS_OP_N;

    float alpha = 1.0f;
    cublasStrsv(handle, uplo, trans, CUBLAS_DIAG_NON_UNIT, n, d_T, n, d_b, 1);
}

// Multiple RHS (more efficient)
void triangular_solve_multi(cublasHandle_t handle, float* d_T, float* d_B,
                            int n, int nrhs, bool upper) {
    float alpha = 1.0f;
    cublasStrsm(handle, CUBLAS_SIDE_LEFT,
                upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER,
                CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, nrhs, &alpha, d_T, n, d_B, n);
}

// Batched small systems
void triangular_solve_batched(cublasHandle_t handle, float** d_Ts, float** d_bs,
                               int n, int batch_size, bool upper) {
    float alpha = 1.0f;
    cublasStrsvBatched(handle,
                       upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER,
                       CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, d_Ts, n, d_bs, 1, batch_size);
}

Metric

Naive

Optimized

Improvement

Single 4096x4096 solve

85ms

2.1ms (cuBLAS)

40x faster

1000 RHS together

85s (sequential)

180ms (trsm)

472x faster

Batch 10000 64x64

12s

15ms (batched)

800x faster

Frequently Asked Questions

CUDA Triangular Solve: Fast Back/Forward Substitution on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. cuBLAS trsv/trsm

2. Batched Triangular Solve

3. Multiple RHS

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Why is triangular solve hard to parallelize?

When should I use trsm vs trsv?

Related Guides

CUDA Triangular Solve: Fast Back/Forward Substitution on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. cuBLAS trsv/trsm

2. Batched Triangular Solve

3. Multiple RHS

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Why is triangular solve hard to parallelize?

When should I use trsm vs trsv?

Related Guides