QR decomposition factors A = QR where Q is orthogonal and R is upper triangular. Solving Ax = b becomes: (1) compute Q^Tb, (2) solve Rx = Q^Tb via back-substitution. QR is more stable than LU for ill-conditioned matrices since orthogonal transformations preserve condition number.
Apply Householder reflectors without forming Q explicitly.
Optimized application of Q to vectors/matrices.
Apply multiple reflectors as block operations.
Forming explicit Q wastes O(m²) memory and O(m²n) time.
void qr_solve_naive(float* d_A, float* d_b, float* d_x, int m, int n) {
float *d_Q, *d_R;
cudaMalloc(&d_Q, m * m * sizeof(float)); // Full Q is huge!
cudaMalloc(&d_R, m * n * sizeof(float));
// Form explicit Q and R (expensive!)
form_explicit_qr(d_A, d_Q, d_R, m, n);
// y = Q^T b (matrix-vector multiply)
float* d_y;
cudaMalloc(&d_y, m * sizeof(float));
cublasSgemv(blas, CUBLAS_OP_T, m, m, &one, d_Q, m, d_b, 1, &zero, d_y, 1);
// Solve R x = y (only first n components)
cublasSrsv(blas, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT,
n, d_R, m, d_y, 1);
cudaMemcpy(d_x, d_y, n * sizeof(float), D2D);
}Implicit Q application avoids O(m²) memory and extra computation.
void qr_solve_optimized(cusolverDnHandle_t solver, cublasHandle_t blas,
float* d_A, float* d_b, float* d_x, int m, int n) {
// QR factorization (stores R in upper triangle, Householder vectors below)
float* d_tau;
cudaMalloc(&d_tau, n * sizeof(float));
int lwork;
cusolverDnSgeqrf_bufferSize(solver, m, n, d_A, m, &lwork);
float* d_work;
cudaMalloc(&d_work, lwork * sizeof(float));
cusolverDnSgeqrf(solver, m, n, d_A, m, d_tau, d_work, lwork, d_info);
// Apply Q^T to b implicitly (no explicit Q formed)
cusolverDnSormqr_bufferSize(solver, CUBLAS_SIDE_LEFT, CUBLAS_OP_T,
m, 1, n, d_A, m, d_tau, d_b, m, &lwork);
cudaMalloc(&d_work, lwork * sizeof(float));
cusolverDnSormqr(solver, CUBLAS_SIDE_LEFT, CUBLAS_OP_T,
m, 1, n, d_A, m, d_tau, d_b, m, d_work, lwork, d_info);
// Solve R x = Q^T b (R is in upper triangle of d_A)
cublasSrsv(blas, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N, CUBLAS_DIAG_NON_UNIT,
n, d_A, m, d_b, 1);
cudaMemcpy(d_x, d_b, n * sizeof(float), D2D);
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| 4096x512 solve | 320ms (explicit Q) | 85ms (implicit) | 3.8x faster |
| Memory usage | O(m²) | O(mn) | 8x less for m=4096 |
LU is 2x faster but less stable. Use QR when: κ(A) > 1e6, accuracy is critical, or A is nearly singular. LU suffices for well-conditioned systems.
Factor once (geqrf), then apply Q^T and solve R for each RHS. For many RHS, use ormqr with multiple columns and trsm instead of trsv.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.