Loading...
ReLU (Rectified Linear Unit) is the most common activation function in deep learning. Despite its simplicity (max(0,x)), efficient CUDA implementation leverages vectorization and fusion with preceding operations for optimal performance.
Process 4 elements per thread with float4.
__global__ void relu_vec4(float4* x, float4* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float4 v = x[idx];
y[idx] = make_float4(fmaxf(0,v.x), fmaxf(0,v.y), fmaxf(0,v.z), fmaxf(0,v.w));
}
}One element per thread.
__global__ void relu_naive(float* x, float* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) y[idx] = fmaxf(0, x[idx]);
}Fuse ReLU into GEMM for zero overhead activation.
// In cuBLAS/CUTLASS GEMM epilogue:
// C = max(0, alpha*A*B + beta*C)
cublasLtMatmulDescSetAttribute(desc, CUBLASLT_MATMUL_DESC_EPILOGUE,
&CUBLASLT_EPILOGUE_RELU, sizeof(epilogue));| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput | 800 GB/s | 950 GB/s (fused) | 19% + fusion |
Rarely. Always fuse with preceding GEMM/Conv when possible. Separate kernels only for debugging or unusual architectures.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.