Sparse matrices appear in graph neural networks, recommender systems, and pruned neural networks. When matrices are sufficiently sparse (>90%), sparse formats save memory and computation. NVIDIA's cuSPARSE and Tensor Core sparse support make GPU sparse operations practical. This guide covers sparse formats, cuSPARSE operations, and when sparsity helps vs. hurts performance.
CSR for SpMV, COO for construction, BSR for block-sparse.
2:4 structured sparsity gets 2x speedup on Ampere+.
Better load balancing for power-law graphs.
Basic CSR SpMV has poor parallelism for rows with many non-zeros.
// CSR format: row_ptr, col_idx, values
__global__ void spmv_csr_naive(int* row_ptr, int* col_idx, float* values,
float* x, float* y, int num_rows) {
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row >= num_rows) return;
float sum = 0.0f;
int row_start = row_ptr[row];
int row_end = row_ptr[row + 1];
for (int j = row_start; j < row_end; j++) {
sum += values[j] * x[col_idx[j]];
}
y[row] = sum;
}cuSPARSE provides optimized SpMM with multiple algorithms.
#include <cusparse.h>
void sparse_matmul(cusparseHandle_t handle,
cusparseSpMatDescr_t A, // Sparse matrix
cusparseDnMatDescr_t B, // Dense matrix
cusparseDnMatDescr_t C) { // Output
float alpha = 1.0f, beta = 0.0f;
size_t bufferSize;
// Get buffer size
cusparseSpMM_bufferSize(handle, CUSPARSE_OPERATION_NON_TRANSPOSE,
CUSPARSE_OPERATION_NON_TRANSPOSE,
&alpha, A, B, &beta, C,
CUDA_R_32F, CUSPARSE_SPMM_ALG_DEFAULT,
&bufferSize);
void* buffer;
cudaMalloc(&buffer, bufferSize);
// Execute SpMM
cusparseSpMM(handle, CUSPARSE_OPERATION_NON_TRANSPOSE,
CUSPARSE_OPERATION_NON_TRANSPOSE,
&alpha, A, B, &beta, C,
CUDA_R_32F, CUSPARSE_SPMM_ALG_DEFAULT, buffer);
}
// For 2:4 structured sparsity (Ampere+ Tensor Cores):
// Use cusparseLtMatmul for 2x speedup with 50% sparsity| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| SpMV throughput | 50 GB/s | 300 GB/s | 6x with cuSPARSE |
| 2:4 Sparse TC | 1x (dense) | 2x | With structured sparsity |
Generally when sparsity > 90% for SpMV, > 80% for SpMM. Structured 2:4 sparsity is always 2x at 50% sparsity on Tensor Cores.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.