Loading...
Leaky ReLU addresses the "dying ReLU" problem by allowing small negative values. Implementation uses conditional multiplication with configurable slope (typically 0.01).
Avoid divergence with arithmetic selection.
__device__ float leaky_relu(float x, float alpha) {
return fmaxf(x, 0) + alpha * fminf(x, 0);
// Or: x * (x > 0 ? 1 : alpha)
}Conditional with potential branch divergence.
__global__ void leaky_relu_naive(float* x, float* y, float alpha, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) y[idx] = x[idx] > 0 ? x[idx] : alpha * x[idx];
}Branchless with float4 vectorization.
__global__ void leaky_relu_opt(float4* x, float4* y, float alpha, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float4 v = x[idx];
y[idx] = make_float4(
fmaxf(v.x,0) + alpha*fminf(v.x,0),
fmaxf(v.y,0) + alpha*fminf(v.y,0),
fmaxf(v.z,0) + alpha*fminf(v.z,0),
fmaxf(v.w,0) + alpha*fminf(v.w,0));
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput | 650 GB/s | 890 GB/s | 37% faster |
Default is 0.01. Some use 0.1-0.3. PReLU learns the optimal alpha per channel.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.