Profile and optimize CUDA applications across multiple GPUs with cross-device performance analysis and load balancing insights.
Pro Feature: Multi-GPU profiling requires RightNow Pro or Mega tier. Free tier supports single GPU profiling only.
Multi-GPU Profiling Capabilities
Profile CUDA kernels across multiple NVIDIA GPUs simultaneously, compare performance characteristics, and identify load balancing opportunities.
Cross-GPU Comparison
Compare kernel performance across different GPU models (RTX 3090 vs RTX 4090, A100 vs H100)
Load Balancing Analysis
Identify workload imbalances and optimize GPU utilization across devices
Scaling Efficiency
Measure multi-GPU speedup and identify scaling bottlenecks (PCIe, NVLink, memory)
GPU Selection and Filtering
RightNow AI provides an intuitive GPU selector dropdown that detects all available NVIDIA GPUs and allows per-device profiling control.
Automatic GPU Detection
- Detects all CUDA-capable GPUs via
nvidia-smi and CUDA runtime API - Displays GPU name, architecture (Ampere, Ada Lovelace, Hopper), and compute capability
- Shows available memory, SM count, and CUDA driver version
- Auto-initializes all detected GPUs for profiling readiness
Profiling Modes
Single GPU Mode
Profile kernels on a specific GPU device
- Select target GPU from dropdown
- All profiling operations target selected device
- Ideal for testing architecture-specific optimizations
Multi-GPU Mode
Profile the same kernel across all available GPUs
- Enable "Profile All GPUs" option
- Parallel profiling on multiple devices
- Side-by-side performance comparison
Cross-GPU Performance Metrics
Compare comprehensive performance metrics across different GPU architectures:
Per-Device Metrics
- Execution Time: Kernel runtime on each GPU
- SM Efficiency: Per-device streaming multiprocessor utilization
- Memory Bandwidth: Achieved vs theoretical bandwidth by GPU
- Occupancy: Active warps percentage per SM architecture
- Power Draw: Real-time power consumption per GPU
- Temperature: Thermal monitoring for each device
Comparative Metrics
- Relative Speedup: Performance ratio between GPUs
- Architecture Efficiency Gap: Ampere vs Ada vs Hopper comparison
- Memory Bottleneck Analysis: Identify memory-bound GPUs
- Scaling Efficiency: Multi-GPU speedup vs linear scaling
- Load Imbalance Factor: Workload distribution variance
- PCIe/NVLink Overhead: Inter-GPU communication costs
Visual Comparison: Multi-GPU profiling results display side-by-side charts showing performance deltas and bottleneck identification.
Multi-GPU Profiling Workflow
Step 1: Configure Target GPUs
- Open GPU selector dropdown in profiling panel
- Select "Profile All GPUs" or choose specific devices
- Verify all target GPUs are initialized (green status indicator)
Step 2: Execute Multi-GPU Profiling
- Click gutter play button or use
Profile Kernel command - RightNow AI orchestrates parallel profiling across selected GPUs
- Each GPU executes independent Nsight Compute profiling session
- Results aggregated and stored per-device with isolated state
Step 3: Analyze Comparative Results
Multi-GPU profiling panel displays:
- Performance Delta Table: Side-by-side metric comparison
- Speedup Chart: Relative performance visualization
- Bottleneck Heatmap: Color-coded efficiency indicators per GPU
- Architecture Recommendations: AI-generated optimization suggestions per device
Example: Multi-GPU Performance Comparison
Profiling the same matrix multiplication kernel on RTX 3090 (Ampere) vs RTX 4090 (Ada Lovelace):
__global__ void matmul_kernel(float* C, const float* A, const float* B, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < N && col < N) {
float sum = 0.0f;
for (int k = 0; k < N; k++) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
}
// Multi-GPU profiling automatically compares performance:
//
// GPU 0: RTX 3090 (Ampere, sm_86)
// - Execution Time: 3.42 ms
// - SM Efficiency: 68%
// - Memory Bandwidth: 712 GB/s (81% of peak 880 GB/s)
// - Occupancy: 75%
//
// GPU 1: RTX 4090 (Ada Lovelace, sm_89)
// - Execution Time: 2.18 ms (1.57x faster)
// - SM Efficiency: 79%
// - Memory Bandwidth: 846 GB/s (82% of peak 1008 GB/s)
// - Occupancy: 82%
//
// AI Analysis: RTX 4090 achieves 57% speedup due to:
// - 36% more CUDA cores (16384 vs 10496)
// - 15% higher memory bandwidth (1008 GB/s vs 880 GB/s)
// - Improved L2 cache architecture (72MB vs 6MB)
//
// Recommendation: Kernel is memory-bound. Consider using shared memory
// tiling to reduce global memory accesses and improve cache utilization.
Load Balancing Analysis
Multi-GPU profiling identifies workload distribution inefficiencies and suggests load balancing strategies.
Load Imbalance Detection
- Measures execution time variance across GPUs
- Calculates load imbalance factor:
max(time) / avg(time) - 1 - Identifies underutilized GPUs and workload bottlenecks
- Suggests dynamic work distribution strategies
Multi-GPU Scaling Efficiency
- Measures speedup:
T(1 GPU) / T(N GPUs) - Compares against ideal linear scaling (Nx speedup)
- Identifies PCIe/NVLink communication overhead
- Suggests peer-to-peer memory access optimizations
Multi-GPU Data Storage
Profiling data is stored separately per GPU to maintain state isolation and enable historical comparisons.
- Per-GPU Storage: Each GPU has isolated profiling data storage in
.rightnow/profiling/gpu-{deviceId}/ - Configuration-Aware: Results stored separately per build configuration (debug/release)
- Content-Based Keying: Profiling data keyed by kernel content hash + GPU ID
- Session Persistence: Multi-GPU profiling sessions persist across editor restarts
- Cross-Session Comparison: Compare profiling results from different sessions
Data Integrity: Multi-GPU storage manager prevents cross-contamination between GPU profiling sessions and validates data integrity on read.
Best Practices
System Requirements
- Requires RightNow Pro or Mega subscription tier
- All GPUs must have CUDA compute capability 3.5 or higher
- NVIDIA drivers must be identical across all GPUs
- Sufficient PCIe lanes for parallel profiling (x8 minimum per GPU)
Performance Considerations
- Parallel profiling may temporarily increase system load
- NCU profiling with hardware counters requires elevated permissions on all GPUs
- Profiling time scales linearly with number of selected GPUs
- Consider profiling subset of GPUs for faster iteration