Loading...
Sigmoid squashes values to (0,1), essential for binary classification outputs and attention gates. Naive exp(-x) overflows for large positive x; stable implementation handles all float32 values.
Handle positive and negative x differently.
__device__ float sigmoid_stable(float x) {
if (x >= 0) {
float z = expf(-x);
return 1.0f / (1.0f + z);
} else {
float z = expf(x);
return z / (1.0f + z);
}
}Overflows for x > 88.
__global__ void sigmoid_naive(float* x, float* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) y[idx] = 1.0f / (1.0f + expf(-x[idx])); // Overflows!
}Uses fast intrinsics and stable formulation.
// Fast approximation (< 0.3% error)
__device__ float sigmoid_fast(float x) {
return 0.5f + 0.5f * tanhf(0.5f * x); // tanh is well-optimized
}
// Or use CUDA intrinsics
__global__ void sigmoid_opt(float* x, float* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float v = x[idx];
y[idx] = (v >= 0) ? 1.0f / (1.0f + __expf(-v))
: __expf(v) / (1.0f + __expf(v));
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput | 420 GB/s | 680 GB/s | 62% faster |
| Numerical range | Fails x>88 | All float32 | Robust |
Sigmoid: (0,1) output for probabilities. Tanh: (-1,1) output, zero-centered. tanh(x) = 2*sigmoid(2x) - 1.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.