╭────────────╮ │ PROFILER │ ├────────────┤ │ SM ▅ 94% │ │ MEM ▇ 87% │ │ BW █ 72% │ │ ▸ profiling│ ╰────────────╯
Profile kernels with NVIDIA Nsight Compute and benchmark across different configurations. See exactly how your kernels perform with live metrics—memory bandwidth, SM occupancy, warp efficiency—and compare results across grid sizes, block dimensions, and more.
Is your kernel memory-bound or compute-bound? The profiler tells you exactly where time is spent so you know what to optimize.
See performance data right next to your code. No context switching between profiler output and your editor.
Compare performance across runs. See if your optimizations actually helped or if you need to try a different approach.
The profiler collects over 50 hardware counters from your GPU, organized into categories that help you understand different aspects of kernel performance.
Compute ├─ SM Occupancy achieved vs theoretical ├─ Warp Efficiency active threads per warp └─ IPC instructions per cycle Memory ├─ Bandwidth GB/s achieved ├─ L1 Hit Rate cache efficiency ├─ L2 Hit Rate cache efficiency └─ Coalescing memory access pattern Execution ├─ Divergence branch efficiency └─ Stalls pipeline stall reasons
Seamless integration with NVIDIA Nsight Compute. Profile directly from the editor with one click—results appear inline with your code.
__global__ void matmul(float* A, float* B, float* C) { // ▸ MEM: 89% | SM: 45% | L2: 0.3ms // ⚠ Low SM occupancy - increase block size int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; ... }
Benchmark your kernels across different configurations to find the optimal setup. Compare grid sizes, block dimensions, and launch parameters—see which combination delivers the best performance for your specific workload.
matmul_kernel [2048x2048] - Configuration Sweep Config A: Grid(64,64) Block(16,16) ├─ Time: 4.2ms ├─ SM Util: 67% └─ Occupancy: 50% Config B: Grid(32,32) Block(32,32) ← Best ├─ Time: 2.8ms (1.5x faster) ├─ SM Util: 94% └─ Occupancy: 75% Config C: Grid(128,128) Block(8,8) ├─ Time: 5.1ms ├─ SM Util: 45% └─ Occupancy: 25%
Test multiple grid/block combinations automatically
Visual comparison of metrics across configs
Automatic recommendation of optimal settings
Profiler and Benchmarking are free with RightNow. Download and start optimizing today.