Megatron-LM is NVIDIA's framework for training large transformer models. It pioneered efficient tensor and pipeline parallelism, enabling training of models with hundreds of billions of parameters.
CUDA Integration: Megatron uses custom CUDA kernels for fused operations, NCCL for tensor parallelism communication, and optimized attention implementations. It's designed specifically for NVIDIA GPUs with NVLink.
Clone and set up Megatron.
git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
pip install -r requirements.txt
# Also install apex for fused kernels
git clone https://github.com/NVIDIA/apex.git
cd apex && pip install -v --no-cache-dir ./Launch GPT training with tensor parallelism.
# pretrain_gpt.sh
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT"
torchrun $DISTRIBUTED_ARGS pretrain_gpt.py \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 4 \
--global-batch-size 32 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--train-iters 500000 \
--lr 0.00015 \
--fp16Training across multiple machines.
# On each node, set MASTER_ADDR to node 0's IP
export MASTER_ADDR=10.0.0.1
export MASTER_PORT=6000
# Node 0 (rank 0-7)
export NODE_RANK=0
# Node 1 (rank 8-15)
export NODE_RANK=1
torchrun \
--nproc_per_node 8 \
--nnodes 2 \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
pretrain_gpt.py \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 4 \
--num-layers 96 \
--hidden-size 12288 \
--num-attention-heads 96High bandwidth NVLink required.
Lower bandwidth requirement.
Further memory reduction.
Balance memory and throughput.
| Task | Performance | Notes |
|---|---|---|
| GPT-3 175B | 502 petaFLOP/s | 3072 A100 GPUs |
| Scaling efficiency | >90% | Up to 1000s GPUs |
| Memory efficiency | 3x reduction | With optimizations |
Can be combined. Megatron for parallelism, DeepSpeed for ZeRO.
At least 2 for tensor parallel. 4+ recommended.
Possible but inefficient. Designed for A100/H100.
Optimize your Megatron CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.