Loading...
SELU is scale * (x for x>0, α*(exp(x)-1) for x≤0) with specific constants (λ≈1.0507, α≈1.6733) that make activations self-normalizing. Networks using only SELU can train without batch normalization.
Use exact SELU constants for self-normalization.
#define SELU_LAMBDA 1.0507009873554804934193349852946f
#define SELU_ALPHA 1.6732632423543772848170429916717f
__device__ float selu(float x) {
return SELU_LAMBDA * ((x > 0) ? x : SELU_ALPHA * (expf(x) - 1.0f));
}Simple SELU with approximate constants.
__global__ void selu_basic(float* x, float* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float v = x[idx];
y[idx] = 1.0507f * ((v > 0) ? v : 1.6733f * (expf(v) - 1.0f));
}
}Vectorized with fast intrinsics.
__global__ void selu_opt(float4* x, float4* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float4 v = x[idx];
y[idx] = make_float4(
SELU_LAMBDA * (v.x > 0 ? v.x : SELU_ALPHA * (__expf(v.x) - 1)),
SELU_LAMBDA * (v.y > 0 ? v.y : SELU_ALPHA * (__expf(v.y) - 1)),
SELU_LAMBDA * (v.z > 0 ? v.z : SELU_ALPHA * (__expf(v.z) - 1)),
SELU_LAMBDA * (v.w > 0 ? v.w : SELU_ALPHA * (__expf(v.w) - 1))
);
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput | 360 GB/s | 600 GB/s | 67% faster |
For fully connected networks where you want to avoid batch/layer norm. Requires lecun_normal initialization and AlphaDropout. Not suitable for CNNs with skip connections.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.