DeepSpeed is Microsoft's library for training and inference of large deep learning models. It introduces ZeRO (Zero Redundancy Optimizer) for memory-efficient training of models with billions of parameters.
CUDA Integration: DeepSpeed uses CUDA for all GPU operations, NCCL for communication, and custom CUDA kernels for fused operations. ZeRO-Offload uses pinned memory for efficient CPU-GPU transfers.
Install DeepSpeed.
pip install deepspeed
# With specific ops
DS_BUILD_OPS=1 pip install deepspeed
# Verify
ds_reportBasic DeepSpeed training setup.
import deepspeed
import torch
model = MyLargeModel()
optimizer = torch.optim.AdamW(model.parameters())
# DeepSpeed config
ds_config = {
"train_batch_size": 32,
"gradient_accumulation_steps": 4,
"fp16": {"enabled": True},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {"device": "cpu"}
}
}
model_engine, optimizer, _, _ = deepspeed.initialize(
model=model,
optimizer=optimizer,
config=ds_config
)
for batch in train_loader:
loss = model_engine(batch)
model_engine.backward(loss)
model_engine.step()Train models larger than GPU memory.
ds_config = {
"train_batch_size": 64,
"gradient_accumulation_steps": 16,
"fp16": {"enabled": True},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/local_nvme"
},
"offload_param": {
"device": "nvme",
"nvme_path": "/local_nvme"
},
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 5e8,
"contiguous_gradients": True,
"reduce_bucket_size": 5e8
},
"activation_checkpointing": {
"partition_activations": True,
"cpu_checkpointing": True
}
}Stage 3 has more overhead, use if needed.
Trades compute for memory.
Affects communication efficiency.
For models > GPU + CPU memory.
| Task | Performance | Notes |
|---|---|---|
| GPT-3 175B | 50% memory reduction | ZeRO-3 |
| BERT training | 2x throughput | With optimizations |
| Inference | 7x speedup | DeepSpeed Inference |
Start with 2. Use 3 if model doesn't fit. Use offload for very large models.
DeepSpeed has more features. FSDP is PyTorch native.
Yes, Transformers has built-in DeepSpeed integration.
Optimize your DeepSpeed CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.