Triangular systems Lx=b (lower) or Ux=b (upper) are solved by substitution in O(n²) operations. These are the final step of LU, QR, and Cholesky-based solvers. GPU parallelization is challenging due to data dependencies, but batched operations and blocked algorithms achieve good throughput.
Use highly optimized triangular solvers from cuBLAS.
Solve many small triangular systems in parallel.
Use trsm with matrix RHS instead of multiple trsv calls.
Naive sequential substitution completely fails to utilize GPU parallelism.
// Upper triangular solve - completely sequential
__global__ void back_sub_naive(float* U, float* b, float* x, int n) {
if (threadIdx.x != 0) return;
for (int i = n - 1; i >= 0; i--) {
float sum = b[i];
for (int j = i + 1; j < n; j++) {
sum -= U[i * n + j] * x[j];
}
x[i] = sum / U[i * n + i];
}
}cuBLAS provides optimized single, multi-RHS, and batched triangular solvers.
// Single system solve
void triangular_solve(cublasHandle_t handle, float* d_T, float* d_b, int n,
bool upper, bool transpose) {
cublasFillMode_t uplo = upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER;
cublasOperation_t trans = transpose ? CUBLAS_OP_T : CUBLAS_OP_N;
float alpha = 1.0f;
cublasStrsv(handle, uplo, trans, CUBLAS_DIAG_NON_UNIT, n, d_T, n, d_b, 1);
}
// Multiple RHS (more efficient)
void triangular_solve_multi(cublasHandle_t handle, float* d_T, float* d_B,
int n, int nrhs, bool upper) {
float alpha = 1.0f;
cublasStrsm(handle, CUBLAS_SIDE_LEFT,
upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER,
CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, nrhs, &alpha, d_T, n, d_B, n);
}
// Batched small systems
void triangular_solve_batched(cublasHandle_t handle, float** d_Ts, float** d_bs,
int n, int batch_size, bool upper) {
float alpha = 1.0f;
cublasStrsvBatched(handle,
upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER,
CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT, n, d_Ts, n, d_bs, 1, batch_size);
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Single 4096x4096 solve | 85ms | 2.1ms (cuBLAS) | 40x faster |
| 1000 RHS together | 85s (sequential) | 180ms (trsm) | 472x faster |
| Batch 10000 64x64 | 12s | 15ms (batched) | 800x faster |
Each element x[i] depends on x[i+1...n] (back-sub) or x[1...i-1] (forward-sub). This creates a chain of dependencies. GPU parallelism comes from: (1) multiple RHS, (2) batching independent systems, (3) blocked algorithms with parallel sub-blocks.
Use trsv for single right-hand side vector. Use trsm for multiple RHS—it is more efficient than calling trsv multiple times. If you have 1 RHS but plan to add more, structure code for trsm from the start.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.