NVIDIA Apex provides PyTorch extensions for mixed precision training, distributed training utilities, and fused optimizers. While PyTorch now has native AMP, Apex still offers unique optimizations.
CUDA Integration: Apex provides fused CUDA kernels that combine multiple operations into single kernels, reducing memory bandwidth and kernel launch overhead. These are particularly effective for normalization and optimizer steps.
Build from source for all features.
git clone https://github.com/NVIDIA/apex.git
cd apex
# Install with CUDA extensions
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
--config-settings "--build-option=--cpp_ext" \
--config-settings "--build-option=--cuda_ext" ./Using fused Adam for faster training.
from apex.optimizers import FusedAdam, FusedLAMB
from apex.normalization import FusedLayerNorm
# Replace nn.LayerNorm with FusedLayerNorm
model = MyModel()
for module in model.modules():
if isinstance(module, nn.LayerNorm):
# Replace with fused version
pass
# Use fused optimizer
optimizer = FusedAdam(model.parameters(), lr=1e-4)Efficient parameter updates.
from apex import multi_tensor_applier
import amp_C
# Multi-tensor scale (for gradient clipping)
max_grad_norm = 1.0
total_norm, _ = multi_tensor_applier(
amp_C.multi_tensor_l2norm,
overflow_buf,
[grads],
False # per tensor norm
)
# Scale all gradients at once
clip_coef = max_grad_norm / (total_norm + 1e-6)
if clip_coef < 1:
multi_tensor_applier(
amp_C.multi_tensor_scale,
overflow_buf,
[grads],
clip_coef
)5-10% training speedup.
Faster for transformer models.
Batch parameter updates.
Proper normalization across GPUs.
| Task | Performance | Notes |
|---|---|---|
| FusedAdam | 5-15% faster | vs torch.optim.Adam |
| FusedLayerNorm | 10-20% faster | vs nn.LayerNorm |
| Multi-tensor ops | 20-40% faster | For many parameters |
Use PyTorch AMP for basic mixed precision. Apex for fused kernels.
Yes, but some features moving to PyTorch core.
For transformers, fused ops give measurable speedup.
Optimize your Apex CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.