╭────────────╮ │ MULTI-GPU │ ├────────────┤ │ [0]··[1] │ │ │····│ │ │ [2]··[3] │ │ ▸ scaling │ ╰────────────╯
Profile and optimize across multiple GPUs. Compare kernel performance side-by-side, analyze NVLink communication, and identify load balancing issues before they become production problems.
Run the same kernel on A100 and H100 simultaneously. See exactly which hardware is better for your workload.
Multi-GPU code often has load imbalance. See per-GPU utilization and identify which device is the bottleneck.
NVLink topology matters. Understand inter-GPU bandwidth and optimize data placement for your specific system.
Run benchmarks across multiple GPUs and see results side-by-side. Make informed hardware decisions based on your actual workload.
matmul_kernel [2048x2048] GPU 0: RTX 4090 ├─ Time: 4.2ms ├─ Bandwidth: 892 GB/s └─ SM Util: 94% GPU 1: H100 ├─ Time: 1.8ms (2.3x faster) ├─ Bandwidth: 2.1 TB/s └─ SM Util: 89%
Visualize your NVLink topology and measure actual peer-to-peer bandwidth. Understand where data placement matters for your multi-GPU workloads.
NVLink Topology ┌─────────┐ NVLink 4 ┌─────────┐ │ GPU 0 │◀──────────▶│ GPU 1 │ │ H100 │ 900 GB/s │ H100 │ └────┬────┘ └────┬────┘ │ │ │ NVLink 4 │ └────────────────────────┘
Monitor per-GPU utilization and identify imbalances. See which GPU is waiting and get suggestions for better work distribution.
Multi-GPU profiling for up to 6 GPUs in Pro. Enterprise supports 100+.