The nuclear norm ||A||_* = sum of singular values is the tightest convex relaxation of matrix rank. It's fundamental to matrix completion (Netflix problem), robust PCA, and low-rank recovery algorithms. Computing nuclear norm requires SVD, making it expensive—O(mn²) for m×n matrix. However, for iterative optimization, we often only need the gradient or approximate values via randomized methods.
Use optimized cuSOLVER gesvd for accurate singular value computation.
For approximate nuclear norm, use randomized SVD which is much faster for low-rank matrices.
Get upper/lower bounds on nuclear norm without full SVD.
Computing full U, V matrices wastes memory and time.
float nuclear_norm_naive(float* d_A, int m, int n) {
float *d_U, *d_S, *d_V;
cudaMalloc(&d_U, m * m * sizeof(float));
cudaMalloc(&d_S, min(m,n) * sizeof(float));
cudaMalloc(&d_V, n * n * sizeof(float));
// Full SVD (wasteful - we don't need U, V)
compute_full_svd(d_A, d_U, d_S, d_V, m, n);
float* h_S = new float[min(m,n)];
cudaMemcpy(h_S, d_S, min(m,n) * sizeof(float), D2H);
float norm = 0;
for (int i = 0; i < min(m,n); i++) norm += h_S[i];
return norm;
}Computes only singular values (not U, V) for efficiency.
#include <cusolverDn.h>
float nuclear_norm(cusolverDnHandle_t handle, cublasHandle_t cublas,
float* d_A, int m, int n) {
int min_mn = min(m, n);
float* d_S;
cudaMalloc(&d_S, min_mn * sizeof(float));
int lwork;
cusolverDnSgesvd_bufferSize(handle, m, n, &lwork);
float* d_work;
cudaMalloc(&d_work, lwork * sizeof(float));
int* d_info;
cudaMalloc(&d_info, sizeof(int));
// Compute singular values only (no U, V)
cusolverDnSgesvd(handle, 'N', 'N', m, n, d_A, m,
d_S, NULL, m, NULL, n, d_work, lwork, NULL, d_info);
// Sum singular values
float norm;
cublasSasum(cublas, min_mn, d_S, 1, &norm);
return norm;
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| 4096x4096 matrix (full SVD) | 2.8s | 1.9s (S only) | 1.5x faster |
| 4096x4096 rank-100 (randomized) | 1.9s (exact) | 180ms (k=120) | 10x faster |
Use for matrix completion (filling missing entries), robust PCA (separating low-rank + sparse), and collaborative filtering. It promotes low-rank solutions.
The subgradient is UV^T where A = USV^T is the SVD. You need full SVD at each iteration, which is expensive. Proximal methods can reduce SVD frequency.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.