The CUDA Development Workflow Is Broken
You're switching between four different applications to profile a single kernel. Nsight Compute for metrics. Visual Studio for code. A terminal for compilation. nvidia-smi in another window. By the time you find the memory bottleneck, you've forgotten what you were optimizing.