RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA lstsq: NumPy-Compatible Least Squares on GPU

December 25, 202510 minBy RightNow AI Team

Introduction

lstsq provides a complete least squares solution including: (1) solution vector x, (2) sum of squared residuals, (3) effective rank of A, (4) singular values. This matches NumPy's linalg.lstsq interface. The implementation uses SVD for maximum numerical stability and accurate rank estimation, though QR can be faster when only the solution is needed.

Common Performance Issues

SVD overhead - full SVD when only solution needed
Residual computation - extra memory and computation for ||b - Ax||²
Rank estimation - threshold selection affects reported rank
Not returning all outputs - wasting SVD computation

Optimization Techniques

1. Fused Residual Computation

Compute residual during solve without extra matmul.

2. Truncated SVD for Rank

Stop SVD early once singular values fall below threshold.

3. Conditional Output

Skip expensive outputs (singular values) if not requested.

Implementation Comparison

Before (Naive Implementation)

Separate SVD for solve and for rank wastes computation.

cuda

struct LstsqResult {
    float* x;      // Solution
    float* residuals;  // Sum of squared residuals
    int rank;      // Effective rank
    float* s;      // Singular values
};

LstsqResult lstsq_naive(float* d_A, float* d_b, int m, int n) {
    // 1. Solve via pseudo-inverse
    float* d_x = pseudo_inverse_solve(d_A, d_b, m, n);

    // 2. Compute residual separately (extra matmul)
    float* d_Ax;
    cudaMalloc(&d_Ax, m * sizeof(float));
    cublasSgemv(...);  // Ax
    float residual = norm_squared(d_b - d_Ax);  // ||b - Ax||²

    // 3. SVD for rank and singular values (redundant!)
    float* d_s = compute_singular_values(d_A, m, n);
    int rank = count_above_threshold(d_s, min(m,n), tol);

    return {d_x, residual, rank, d_s};
}

After (Optimized Implementation)

Single SVD provides solution, rank, and singular values efficiently.

cuda

struct LstsqResult {
    float* x; float residual; int rank; float* s;
};

LstsqResult lstsq_optimized(cusolverDnHandle_t solver, cublasHandle_t blas,
                            float* d_A, float* d_b, int m, int n, float rcond) {
    int min_mn = min(m, n);
    float *d_U, *d_S, *d_VT, *d_x;
    cudaMalloc(&d_U, m * min_mn * sizeof(float));
    cudaMalloc(&d_S, min_mn * sizeof(float));
    cudaMalloc(&d_VT, min_mn * n * sizeof(float));
    cudaMalloc(&d_x, n * sizeof(float));

    // Single SVD for everything
    cusolverDnSgesvd(solver, 'S', 'S', m, n, d_A, m,
                     d_S, d_U, m, d_VT, min_mn, ...);

    // Determine rank from singular values
    float sigma_max;
    cudaMemcpy(&sigma_max, d_S, sizeof(float), D2H);
    float threshold = rcond * sigma_max;
    int rank = count_above_threshold_gpu(d_S, min_mn, threshold);

    // Solve: x = V * S⁺ * U^T * b (using truncated SVD)
    float* d_UTb;
    cudaMalloc(&d_UTb, min_mn * sizeof(float));
    cublasSgemv(blas, CUBLAS_OP_T, m, min_mn, &one, d_U, m, d_b, 1, &zero, d_UTb, 1);
    apply_truncated_sinv<<<...>>>(d_UTb, d_S, min_mn, threshold);
    cublasSgemv(blas, CUBLAS_OP_T, min_mn, n, &one, d_VT, min_mn, d_UTb, 1, &zero, d_x, 1);

    // Residual: ||b - Ax||² = ||b||² - ||U^T b||² for overdetermined
    float residual = (m > n) ? norm_sq(d_b) - norm_sq(d_UTb, rank) : 0.0f;

    return {d_x, residual, rank, d_S};
}

Performance Results

Metric	Naive	Optimized	Improvement
5000x100 full output	180ms (2 SVDs)	95ms (1 SVD)	1.9x faster
Memory usage	2x SVD storage	1x SVD storage	2x less

Frequently Asked Questions

What rcond value should I use?

Default: rcond = max(m,n) * machine_epsilon. For noisy data, use larger rcond (e.g., 1e-6) to ignore small singular values. rcond=None or -1 uses the default; rcond=0 uses machine precision only.

When are residuals meaningful?

Residuals are only meaningful for overdetermined systems (m > n). For underdetermined (m < n) or rank-deficient systems, minimum-norm solution has zero residual for the projected problem.

Least Squares

Core algorithm, lstsq adds diagnostics

→

Pseudo-Inverse

lstsq uses truncated pseudo-inverse

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA lstsqnumpy lstsq GPUleast squares residualsrank estimationSVD least squarescuSOLVER gelsd

Introduction

Implementation Comparison

Before (Naive Implementation)

Separate SVD for solve and for rank wastes computation.

cuda

struct LstsqResult {
    float* x;      // Solution
    float* residuals;  // Sum of squared residuals
    int rank;      // Effective rank
    float* s;      // Singular values
};

LstsqResult lstsq_naive(float* d_A, float* d_b, int m, int n) {
    // 1. Solve via pseudo-inverse
    float* d_x = pseudo_inverse_solve(d_A, d_b, m, n);

    // 2. Compute residual separately (extra matmul)
    float* d_Ax;
    cudaMalloc(&d_Ax, m * sizeof(float));
    cublasSgemv(...);  // Ax
    float residual = norm_squared(d_b - d_Ax);  // ||b - Ax||²

    // 3. SVD for rank and singular values (redundant!)
    float* d_s = compute_singular_values(d_A, m, n);
    int rank = count_above_threshold(d_s, min(m,n), tol);

    return {d_x, residual, rank, d_s};
}

After (Optimized Implementation)

Single SVD provides solution, rank, and singular values efficiently.

cuda

struct LstsqResult {
    float* x; float residual; int rank; float* s;
};

LstsqResult lstsq_optimized(cusolverDnHandle_t solver, cublasHandle_t blas,
                            float* d_A, float* d_b, int m, int n, float rcond) {
    int min_mn = min(m, n);
    float *d_U, *d_S, *d_VT, *d_x;
    cudaMalloc(&d_U, m * min_mn * sizeof(float));
    cudaMalloc(&d_S, min_mn * sizeof(float));
    cudaMalloc(&d_VT, min_mn * n * sizeof(float));
    cudaMalloc(&d_x, n * sizeof(float));

    // Single SVD for everything
    cusolverDnSgesvd(solver, 'S', 'S', m, n, d_A, m,
                     d_S, d_U, m, d_VT, min_mn, ...);

    // Determine rank from singular values
    float sigma_max;
    cudaMemcpy(&sigma_max, d_S, sizeof(float), D2H);
    float threshold = rcond * sigma_max;
    int rank = count_above_threshold_gpu(d_S, min_mn, threshold);

    // Solve: x = V * S⁺ * U^T * b (using truncated SVD)
    float* d_UTb;
    cudaMalloc(&d_UTb, min_mn * sizeof(float));
    cublasSgemv(blas, CUBLAS_OP_T, m, min_mn, &one, d_U, m, d_b, 1, &zero, d_UTb, 1);
    apply_truncated_sinv<<<...>>>(d_UTb, d_S, min_mn, threshold);
    cublasSgemv(blas, CUBLAS_OP_T, min_mn, n, &one, d_VT, min_mn, d_UTb, 1, &zero, d_x, 1);

    // Residual: ||b - Ax||² = ||b||² - ||U^T b||² for overdetermined
    float residual = (m > n) ? norm_sq(d_b) - norm_sq(d_UTb, rank) : 0.0f;

    return {d_x, residual, rank, d_S};
}

Metric

Naive

Optimized

Improvement

5000x100 full output

180ms (2 SVDs)

95ms (1 SVD)

1.9x faster

Memory usage

2x SVD storage

1x SVD storage

2x less

Frequently Asked Questions

What rcond value should I use?

Default: rcond = max(m,n) * machine_epsilon. For noisy data, use larger rcond (e.g., 1e-6) to ignore small singular values. rcond=None or -1 uses the default; rcond=0 uses machine precision only.

When are residuals meaningful?

Residuals are only meaningful for overdetermined systems (m > n). For underdetermined (m < n) or rank-deficient systems, minimum-norm solution has zero residual for the projected problem.

CUDA lstsq: NumPy-Compatible Least Squares on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Fused Residual Computation

2. Truncated SVD for Rank

3. Conditional Output

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

What rcond value should I use?

When are residuals meaningful?

Related Guides

CUDA lstsq: NumPy-Compatible Least Squares on GPU

Introduction

Common Performance Issues

Optimization Techniques

1. Fused Residual Computation

2. Truncated SVD for Rank

3. Conditional Output

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

What rcond value should I use?

When are residuals meaningful?

Related Guides