RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

deep learningPython

DeepSpeed Guide: Training Large Models Efficiently

December 25, 202515 min read

Introduction

DeepSpeed is Microsoft's library for training and inference of large deep learning models. It introduces ZeRO (Zero Redundancy Optimizer) for memory-efficient training of models with billions of parameters.

CUDA Integration: DeepSpeed uses CUDA for all GPU operations, NCCL for communication, and custom CUDA kernels for fused operations. ZeRO-Offload uses pinned memory for efficient CPU-GPU transfers.

Key Features

✓ZeRO-1/2/3 optimizer states sharding
✓ZeRO-Offload to CPU/NVMe
✓Pipeline parallelism
✓Tensor parallelism
✓1-bit Adam optimizer
✓Sparse attention
✓Inference optimization
✓MoE (Mixture of Experts) support

Installation

Install DeepSpeed.

bash

pip install deepspeed

# With specific ops
DS_BUILD_OPS=1 pip install deepspeed

# Verify
ds_report

Basic Example

ZeRO-2 Training

Basic DeepSpeed training setup.

python

import deepspeed
import torch

model = MyLargeModel()
optimizer = torch.optim.AdamW(model.parameters())

# DeepSpeed config
ds_config = {
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "fp16": {"enabled": True},
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {"device": "cpu"}
    }
}

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config
)

for batch in train_loader:
    loss = model_engine(batch)
    model_engine.backward(loss)
    model_engine.step()

Advanced Example

ZeRO-3 with Offload

Train models larger than GPU memory.

python

ds_config = {
    "train_batch_size": 64,
    "gradient_accumulation_steps": 16,
    "fp16": {"enabled": True},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/local_nvme"
        },
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/local_nvme"
        },
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 5e8,
        "contiguous_gradients": True,
        "reduce_bucket_size": 5e8
    },
    "activation_checkpointing": {
        "partition_activations": True,
        "cpu_checkpointing": True
    }
}

Performance Tips

high impact

Use ZeRO stage 2 first

Stage 3 has more overhead, use if needed.

high impact

Enable gradient checkpointing

Trades compute for memory.

medium impact

Tune bucket sizes

Affects communication efficiency.

medium impact

Use NVMe offload

For models > GPU + CPU memory.

Common Pitfalls

•Using ZeRO-3 when ZeRO-2 is sufficient
•Not tuning gradient accumulation steps
•Forgetting to use model_engine.backward()
•Checkpointing not compatible with some layers
•NVMe not fast enough for offload

Benchmarks

Task	Performance	Notes
GPT-3 175B	50% memory reduction	ZeRO-3
BERT training	2x throughput	With optimizations
Inference	7x speedup	DeepSpeed Inference

Frequently Asked Questions

What ZeRO stage should I use?

Start with 2. Use 3 if model doesn't fit. Use offload for very large models.

DeepSpeed vs FSDP?

DeepSpeed has more features. FSDP is PyTorch native.

Can I use with HuggingFace?

Yes, Transformers has built-in DeepSpeed integration.

Resources

DeepSpeed DocumentationDocumentation

↗

DeepSpeed ExamplesExamples

↗

Alternatives

Horovod

Simpler, less memory optimization

→

Megatron-LM

Tensor parallelism focused

→

Optimize your DeepSpeed CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.

DeepSpeedZeROlarge model trainingMicrosoft DeepSpeedmodel parallelism

Introduction

Basic Example

ZeRO-2 Training

Basic DeepSpeed training setup.

python

import deepspeed
import torch

model = MyLargeModel()
optimizer = torch.optim.AdamW(model.parameters())

# DeepSpeed config
ds_config = {
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "fp16": {"enabled": True},
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {"device": "cpu"}
    }
}

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config
)

for batch in train_loader:
    loss = model_engine(batch)
    model_engine.backward(loss)
    model_engine.step()

Advanced Example

ZeRO-3 with Offload

Train models larger than GPU memory.

python

ds_config = {
    "train_batch_size": 64,
    "gradient_accumulation_steps": 16,
    "fp16": {"enabled": True},
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/local_nvme"
        },
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/local_nvme"
        },
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_prefetch_bucket_size": 5e8,
        "contiguous_gradients": True,
        "reduce_bucket_size": 5e8
    },
    "activation_checkpointing": {
        "partition_activations": True,
        "cpu_checkpointing": True
    }
}

Task

Performance

Notes

GPT-3 175B

50% memory reduction

ZeRO-3

BERT training

2x throughput

With optimizations

Inference

7x speedup

DeepSpeed Inference