Loading...
Tanh is the zero-centered activation used in RNNs, LSTMs, and some attention mechanisms. CUDA provides optimized tanhf() intrinsic that handles numerical edge cases.
tanhf() is highly optimized in CUDA math library.
__global__ void tanh_vec4(float4* x, float4* y, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float4 v = x[idx];
y[idx] = make_float4(tanhf(v.x), tanhf(v.y), tanhf(v.z), tanhf(v.w));
}
}Manual implementation with two exp calls.
__device__ float tanh_manual(float x) {
float ep = expf(x), em = expf(-x);
return (ep - em) / (ep + em); // Two exp calls!
}Fuse sigmoid and tanh for LSTM gates.
// In LSTM: gates use sigmoid, cell state uses tanh
// Fuse all gate computations
__global__ void lstm_gates_fused(float4* gates, float4* output, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
float4 g = gates[idx];
// i, f, o gates: sigmoid; g gate: tanh
output[idx] = make_float4(
sigmoid_stable(g.x), // input gate
sigmoid_stable(g.y), // forget gate
sigmoid_stable(g.z), // output gate
tanhf(g.w) // cell gate
);
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput | 380 GB/s (manual) | 720 GB/s (intrinsic) | 89% faster |
Yes, in LSTMs, GRUs, and some attention layers. For feedforward, ReLU/GELU are more common due to gradient properties.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.