Loading...
GELU is x*Φ(x) where Φ is the Gaussian CDF. It's the standard activation for transformers (BERT, GPT, etc). Two implementations: exact using erf(), or fast tanh approximation.
Approximate GELU with tanh for speed.
// GELU_tanh(x) ≈ 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x³)))
__device__ float gelu_tanh(float x) {
const float c = 0.7978845608028654f; // sqrt(2/π)
const float k = 0.044715f;
return 0.5f * x * (1.0f + tanhf(c * (x + k * x * x * x)));
}Exact but slower due to erf computation.
__global__ void gelu_exact(float* x, float* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float v = x[idx];
y[idx] = 0.5f * v * (1.0f + erff(v / sqrtf(2.0f)));
}
}Vectorized tanh approximation or cuDNN.
__global__ void gelu_fast(float4* x, float4* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float4 v = x[idx];
y[idx] = make_float4(
gelu_tanh(v.x), gelu_tanh(v.y),
gelu_tanh(v.z), gelu_tanh(v.w)
);
}
}
// Even faster: use cuDNN activation descriptor
cudnnActivationDescriptor_t act;
cudnnSetActivationDescriptor(act, CUDNN_ACTIVATION_GELU_APPROX_TANH, ...);| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput (exact) | 280 GB/s | 280 GB/s | N/A |
| Throughput (tanh) | 450 GB/s | 620 GB/s | 2.2x faster than exact |
Tanh approximation is faster and used by most frameworks (PyTorch default). Exact erf version matches mathematical definition. Difference is <0.01 for typical values.
GELU is standard for transformers. Its smooth probabilistic gating helps with gradient flow in deep attention networks.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.