Matrix trace—the sum of diagonal elements—is one of the simplest matrix operations but appears frequently in machine learning (regularization terms, loss functions) and physics (quantum mechanics, tensor contractions). While trivially parallel, efficient trace computation requires attention to memory access patterns. The diagonal elements are strided in memory, so naive implementations suffer from poor cache utilization.
Use warp shuffle reduction to efficiently sum diagonal elements.
Compute trace of many matrices in parallel, one thread block per matrix.
Use compensated summation for better numerical precision on large matrices.
Single-threaded loop completely wastes GPU parallelism.
__global__ void trace_naive(float* A, int n, float* result) {
if (threadIdx.x == 0 && blockIdx.x == 0) {
float sum = 0.0f;
for (int i = 0; i < n; i++) {
sum += A[i * n + i];
}
*result = sum;
}
}Parallel reduction with warp shuffles for efficient diagonal summation.
__global__ void trace_optimized(float* A, int n, float* result) {
__shared__ float shared[32];
int tid = threadIdx.x;
int lane = tid % 32;
int warp = tid / 32;
float sum = 0.0f;
for (int i = tid; i < n; i += blockDim.x) {
sum += A[i * n + i];
}
// Warp reduction
for (int offset = 16; offset > 0; offset /= 2)
sum += __shfl_down_sync(0xffffffff, sum, offset);
if (lane == 0) shared[warp] = sum;
__syncthreads();
if (warp == 0) {
sum = (tid < blockDim.x / 32) ? shared[lane] : 0.0f;
for (int offset = 16; offset > 0; offset /= 2)
sum += __shfl_down_sync(0xffffffff, sum, offset);
if (tid == 0) *result = sum;
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Single 4096x4096 trace | 0.8ms | 0.02ms | 40x faster |
| Batch 1000 256x256 traces | 12ms (sequential) | 0.15ms | 80x faster |
No direct trace function. Options: cublasSasum on extracted diagonal, cublasSdot with ones vector, or custom kernel. Custom kernels are usually fastest for batched operations.
tr(AB) = sum of elementwise product of A and B^T. Avoids computing full matrix product. Use cublasSdot or sum A[i,j]*B[j,i] directly.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.