Mean Squared Error (MSE) is the fundamental regression loss. While mathematically simple, efficient CUDA implementation requires careful attention to reduction and memory access patterns. MSE gradients are trivial (2*(pred-target)/n), enabling fused forward-backward kernels.
Compute loss and gradient in same kernel pass.
__global__ void mse_fused(float* pred, float* target, float* grad,
float* loss, int n) {
__shared__ float s_sum;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float local_loss = 0;
if (idx < n) {
float diff = pred[idx] - target[idx];
grad[idx] = 2.0f * diff / n; // Gradient
local_loss = diff * diff; // Squared error
}
// Reduce for total loss
local_loss = blockReduceSum(local_loss);
if (threadIdx.x == 0) atomicAdd(loss, local_loss / n);
}Multiple kernels with intermediate storage.
void mse_naive(float* pred, float* target, float* loss, int n) {
// Kernel 1: squared differences
squared_diff<<<blocks, threads>>>(pred, target, diff_sq, n);
// Kernel 2: sum reduction
reduce_sum<<<...>>>(diff_sq, sum, n);
// Kernel 3: divide by n
*loss = sum / n;
}Float4 vectorization with efficient hierarchical reduction.
__global__ void mse_optimized(float4* pred, float4* target,
float* loss, int n4) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float local_sum = 0;
for (int i = idx; i < n4; i += blockDim.x * gridDim.x) {
float4 p = pred[i];
float4 t = target[i];
float4 d = make_float4(p.x-t.x, p.y-t.y, p.z-t.z, p.w-t.w);
local_sum += d.x*d.x + d.y*d.y + d.z*d.z + d.w*d.w;
}
// Two-level reduction: warp then block
local_sum = warpReduceSum(local_sum);
__shared__ float warp_sums[32];
int lane = threadIdx.x % 32;
int warp = threadIdx.x / 32;
if (lane == 0) warp_sums[warp] = local_sum;
__syncthreads();
if (warp == 0) {
local_sum = (lane < blockDim.x/32) ? warp_sums[lane] : 0;
local_sum = warpReduceSum(local_sum);
if (lane == 0) atomicAdd(loss, local_sum);
}
}
// Host: divide final sum by n| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput | 120 GB/s | 410 GB/s | 3.4x faster |
| Kernel launches | 3 | 1 | 3x fewer |
MSE penalizes large errors more (squared). MAE is robust to outliers. MSE gradients are smooth (2*diff), MAE has discontinuous gradient at 0.
Use Kahan summation or double precision for accumulator. Or normalize inputs to reasonable range.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.