Horovod is Uber's distributed training framework that enables easy multi-GPU and multi-node training. It uses ring-allreduce for efficient gradient synchronization and supports PyTorch, TensorFlow, and MXNet.
CUDA Integration: Horovod uses NCCL for GPU-to-GPU communication, providing optimized allreduce operations. It supports NVLink and InfiniBand for high-bandwidth multi-node training.
Install with GPU support.
# Install NCCL first
conda install -c conda-forge nccl
# Install Horovod with PyTorch
HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[pytorch]
# Verify
horovodrun --check-buildBasic distributed training.
import torch
import horovod.torch as hvd
# Initialize
hvd.init()
torch.cuda.set_device(hvd.local_rank())
# Model and optimizer
model = MyModel().cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01 * hvd.size())
optimizer = hvd.DistributedOptimizer(optimizer)
# Broadcast parameters
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
# Training loop
for data, target in train_loader:
optimizer.zero_grad()
output = model(data)
loss = F.cross_entropy(output, target)
loss.backward()
optimizer.step()Launch across multiple machines.
# train.py
import horovod.torch as hvd
hvd.init()
print(f"Rank {hvd.rank()}/{hvd.size()}, Local {hvd.local_rank()}")
# Partition data
train_sampler = torch.utils.data.distributed.DistributedSampler(
dataset, num_replicas=hvd.size(), rank=hvd.rank()
)
train_loader = DataLoader(dataset, sampler=train_sampler, batch_size=32)
# Launch with horovodrun
# horovodrun -np 4 -H server1:2,server2:2 python train.pyFastest for NVIDIA GPUs.
Multiply LR by number of workers.
hvd.Compression.fp16 for bandwidth savings.
Use horovod fusion threshold.
| Task | Performance | Notes |
|---|---|---|
| ResNet-50 8 GPU | 7.5x scaling | Single node |
| BERT 32 GPU | 28x scaling | 4 nodes |
| Communication overhead | <10% | With NVLink |
DDP is simpler for single-node. Horovod better for multi-node.
Use hvd.local_size() and hvd.local_rank().
Yes, but performance limited by slowest GPU.
Optimize your Horovod CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.