lstsq provides a complete least squares solution including: (1) solution vector x, (2) sum of squared residuals, (3) effective rank of A, (4) singular values. This matches NumPy's linalg.lstsq interface. The implementation uses SVD for maximum numerical stability and accurate rank estimation, though QR can be faster when only the solution is needed.
Compute residual during solve without extra matmul.
Stop SVD early once singular values fall below threshold.
Skip expensive outputs (singular values) if not requested.
Separate SVD for solve and for rank wastes computation.
struct LstsqResult {
float* x; // Solution
float* residuals; // Sum of squared residuals
int rank; // Effective rank
float* s; // Singular values
};
LstsqResult lstsq_naive(float* d_A, float* d_b, int m, int n) {
// 1. Solve via pseudo-inverse
float* d_x = pseudo_inverse_solve(d_A, d_b, m, n);
// 2. Compute residual separately (extra matmul)
float* d_Ax;
cudaMalloc(&d_Ax, m * sizeof(float));
cublasSgemv(...); // Ax
float residual = norm_squared(d_b - d_Ax); // ||b - Ax||²
// 3. SVD for rank and singular values (redundant!)
float* d_s = compute_singular_values(d_A, m, n);
int rank = count_above_threshold(d_s, min(m,n), tol);
return {d_x, residual, rank, d_s};
}Single SVD provides solution, rank, and singular values efficiently.
struct LstsqResult {
float* x; float residual; int rank; float* s;
};
LstsqResult lstsq_optimized(cusolverDnHandle_t solver, cublasHandle_t blas,
float* d_A, float* d_b, int m, int n, float rcond) {
int min_mn = min(m, n);
float *d_U, *d_S, *d_VT, *d_x;
cudaMalloc(&d_U, m * min_mn * sizeof(float));
cudaMalloc(&d_S, min_mn * sizeof(float));
cudaMalloc(&d_VT, min_mn * n * sizeof(float));
cudaMalloc(&d_x, n * sizeof(float));
// Single SVD for everything
cusolverDnSgesvd(solver, 'S', 'S', m, n, d_A, m,
d_S, d_U, m, d_VT, min_mn, ...);
// Determine rank from singular values
float sigma_max;
cudaMemcpy(&sigma_max, d_S, sizeof(float), D2H);
float threshold = rcond * sigma_max;
int rank = count_above_threshold_gpu(d_S, min_mn, threshold);
// Solve: x = V * S⁺ * U^T * b (using truncated SVD)
float* d_UTb;
cudaMalloc(&d_UTb, min_mn * sizeof(float));
cublasSgemv(blas, CUBLAS_OP_T, m, min_mn, &one, d_U, m, d_b, 1, &zero, d_UTb, 1);
apply_truncated_sinv<<<...>>>(d_UTb, d_S, min_mn, threshold);
cublasSgemv(blas, CUBLAS_OP_T, min_mn, n, &one, d_VT, min_mn, d_UTb, 1, &zero, d_x, 1);
// Residual: ||b - Ax||² = ||b||² - ||U^T b||² for overdetermined
float residual = (m > n) ? norm_sq(d_b) - norm_sq(d_UTb, rank) : 0.0f;
return {d_x, residual, rank, d_S};
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| 5000x100 full output | 180ms (2 SVDs) | 95ms (1 SVD) | 1.9x faster |
| Memory usage | 2x SVD storage | 1x SVD storage | 2x less |
Default: rcond = max(m,n) * machine_epsilon. For noisy data, use larger rcond (e.g., 1e-6) to ignore small singular values. rcond=None or -1 uses the default; rcond=0 uses machine precision only.
Residuals are only meaningful for overdetermined systems (m > n). For underdetermined (m < n) or rank-deficient systems, minimum-norm solution has zero residual for the projected problem.
Core algorithm, lstsq adds diagnostics
lstsq uses truncated pseudo-inverse
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.