Activation functions are element-wise operations applied after linear layers. While individually simple, they're applied billions of times in large models. The key optimization is fusion - combining activations with adjacent operations to reduce memory bandwidth. This guide covers common activations, their GPU implementations, and fusion strategies.
Combine matmul and activation in single kernel write.
Use float4 for 4x memory throughput.
Use tanh approximation for faster GELU.
Separate kernels require extra memory read/write.
// Separate kernel - extra memory round-trip
__global__ void relu(float* x, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) y[i] = fmaxf(0.0f, x[i]);
}
__global__ void gelu_exact(float* x, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
float val = x[i];
y[i] = 0.5f * val * (1.0f + erff(val * 0.7071067811865476f));
}
}
__global__ void silu(float* x, float* y, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
float val = x[i];
y[i] = val / (1.0f + expf(-val)); // x * sigmoid(x)
}
}Fusing activation with GEMM epilogue eliminates memory round-trip.
// Fused in epilogue of GEMM kernel
// During matrix multiply output write, apply activation
__device__ float gelu_approx(float x) {
// Fast approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
const float c = 0.7978845608f; // sqrt(2/pi)
const float k = 0.044715f;
float x3 = x * x * x;
return 0.5f * x * (1.0f + tanhf(c * (x + k * x3)));
}
__device__ float silu(float x) {
return x / (1.0f + expf(-x));
}
// In CUTLASS or custom GEMM epilogue:
template<typename Activation>
__global__ void gemm_with_activation(float* A, float* B, float* C, int M, int N, int K) {
// ... GEMM computation ...
// Fused activation in output write
float result = /* gemm result */;
if constexpr (std::is_same_v<Activation, GELU>) {
result = gelu_approx(result);
} else if constexpr (std::is_same_v<Activation, SiLU>) {
result = silu(result);
}
C[output_idx] = result;
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Fused vs Separate | 1x | 1.3-1.5x | Reduced memory traffic |
ReLU is fastest (single compare). GELU is slower due to exp/tanh. The approximate GELU is ~2x faster than exact while being nearly identical for training.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.