Banded matrices have nonzeros only near the diagonal, common in finite difference/element methods, splines, and time series. Exploiting band structure reduces O(n³) dense solve to O(n·k²) where k is bandwidth. Tridiagonal (k=1) systems have specialized O(n) algorithms. General banded systems use LU factorization with band storage.
Specialized tridiagonal solver with O(n) complexity.
LU for general bands with O(n·k²) complexity.
Parallel tridiagonal algorithm with O(n/p + log p) parallel time.
Dense solve ignores band structure, wasting O(n²/k) memory and O(n³/k²) time.
// Treating banded matrix as dense - extremely wasteful!
void band_solve_naive(float* d_A, float* d_b, float* d_x, int n, int bandwidth) {
// Allocate full n x n matrix (wasteful!)
// Use dense LU solve (O(n³) instead of O(n·k²))
cusolverDnSgetrf(solver, n, n, d_A, n, d_work, d_pivot, d_info);
cusolverDnSgetrs(solver, CUBLAS_OP_N, n, 1, d_A, n, d_pivot, d_b, n, d_info);
}Tridiagonal achieves O(n) via Thomas algorithm; general bands use O(n·k²) band LU.
// Tridiagonal solve (bandwidth = 1)
void tridiagonal_solve(cusolverSpHandle_t solver, int n,
float* d_lower, float* d_diag, float* d_upper, float* d_b) {
// cuSOLVER tridiagonal solver - O(n) complexity
cusolverSpSgtsv2(solver, n, 1, d_lower, d_diag, d_upper, d_b, ...);
}
// Batched tridiagonal (many independent systems)
void tridiagonal_solve_batched(cusolverSpHandle_t solver, int n, int batch,
float* d_lower, float* d_diag, float* d_upper, float* d_b) {
cusolverSpSgtsv2StridedBatch(solver, n, d_lower, d_diag, d_upper, d_b, batch, n);
}
// General band solve via band LU
void band_solve(float* d_AB, float* d_b, int n, int kl, int ku) {
// d_AB in band storage format: (2*kl + ku + 1) x n
// Band LU factorization
int ldab = 2 * kl + ku + 1;
// Use cuSOLVER or custom band LU
band_lu_factor(d_AB, n, kl, ku);
band_lu_solve(d_AB, d_b, n, kl, ku);
}
__global__ void thomas_algorithm(float* a, float* b, float* c, float* d, float* x, int n) {
// Forward elimination
for (int i = 1; i < n; i++) {
float w = a[i] / b[i-1];
b[i] -= w * c[i-1];
d[i] -= w * d[i-1];
}
// Back substitution
x[n-1] = d[n-1] / b[n-1];
for (int i = n-2; i >= 0; i--) {
x[i] = (d[i] - c[i] * x[i+1]) / b[i];
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Tridiagonal n=1M | 45s (dense) | 8ms (gtsv) | 5625x faster |
| Pentadiagonal n=100k | 12s (dense) | 25ms (band) | 480x faster |
| Memory n=1M tridiag | 4TB (dense) | 12MB (band) | 333000x less |
Band format stores only the nonzero diagonals. For an n×n matrix with kl lower and ku upper diagonals, store as (kl+ku+1)×n array. Row j of band = diagonal (j-kl) of original. This saves O(n²) to O(n·k) memory.
Thomas algorithm is sequential O(n). Cyclic reduction is parallel O(log n) steps but more work overall. For GPU with many cores, cyclic reduction (or hybrid) is faster for large n. For batched small systems, Thomas with one system per thread is efficient.
Band solve uses triangular solve internally
Band is special case of sparse
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.