RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

Benchmarking

Measure and compare CUDA kernel performance

Quick Start

Getting Started with Benchmarking

Open Benchmark Panel: Click "Benchmark" in the bottom panel or use command palette: "CUDA: Open Benchmark View"
Configure Benchmark: Set iterations, warmup runs, and data sizes in the configuration view
Run Benchmark: Click "Run Benchmark" button and watch results appear in real-time
Compare Kernels: Select two benchmarked kernels to see side-by-side performance comparison

Configuration

Benchmark Parameters

Configure these parameters for accurate performance measurement:

Iterations: Number of times to run the kernel (e.g., 100 for reliable statistics)
Warmup Runs: Initial runs to stabilize GPU and caches (e.g., 10 runs)
Data Sizes: Small, medium, large test configurations for different workloads
Timing Method: CUDA Events or CPU timers for different precision needs

Running a Benchmark

Configure your benchmark settings
Click "Run Benchmark" button
Watch progress bar as results stream in
See live metrics updating in real-time
Stop button available to cancel long-running benchmarks

Results and Analysis

Live Results

During benchmark execution:

Running Status: Shows current iteration (e.g., "Running 45/100")
Time Graph: Real-time performance visualization
Statistics: Live mean, min, max updates
Progress Bar: Visual completion indicator

Final Results

After benchmark completion:

Execution Time: Average kernel runtime in milliseconds
Statistical Analysis: Mean, median, standard deviation, variance
Performance Metrics: Throughput (GB/s), efficiency percentages
Distribution Graph: Histogram showing timing distribution
Export Button: Save results to CSV or JSON format

Key Metrics

Timing Statistics:

Mean: Average execution time across all iterations
Median: Middle value, less affected by outliers
Standard Deviation: Measure of timing consistency
Min/Max: Best and worst case performance

Performance Indicators:

Throughput: Data processed per second (GB/s)
Occupancy: GPU resource utilization percentage
Efficiency: Comparison to theoretical peak performance

Comparing Kernels

How to Compare

Benchmark your original kernel implementation
Make optimizations to your code
Benchmark your optimized version
Click "Compare" button
View side-by-side comparison

Comparison View Shows:

Side-by-side metrics: Direct comparison of all performance indicators
Speedup percentage: How much faster/slower (e.g., "2.3x faster")
Winner highlighted: Better performing kernel shown in green
Regression detection: Automatic warning if performance decreased

Multi-Kernel Comparison

Compare multiple optimization approaches:

Benchmark baseline implementation
Test different optimization strategies
Compare all versions simultaneously
Identify best performing approach
Export comparison data for reports

Advanced Features

Data Size Scaling

Test performance across different input sizes:

Small: Cache-friendly workloads
Medium: Typical production sizes
Large: Memory-bandwidth bound scenarios
Custom: Define specific test configurations

Statistical Analysis

Advanced statistical tools for reliable results:

Outlier Detection: Identifies and filters anomalous measurements
Confidence Intervals: Statistical significance of improvements
Variance Analysis: Consistency and stability metrics
Percentiles: P50, P95, P99 for latency analysis

Export and Reporting

Share benchmark results:

CSV Export: Raw data for further analysis
JSON Format: Structured data with metadata
Charts: Export visualizations as PNG/SVG
Reports: Generate formatted performance reports

Best Practices

For Accurate Results

Use at least 100 iterations for statistical reliability
Always include warmup runs to stabilize GPU clocks and caches
Keep data sizes consistent when comparing kernels
Benchmark before and after each optimization
Close other GPU applications for isolated measurements
Run benchmarks multiple times and average results

Important Considerations

Close other GPU applications for accurate results
Let GPU stabilize between tests (thermal throttling can affect results)
Use same benchmark configuration when comparing kernels
Monitor GPU temperature and clock speeds during benchmarking
Consider power and thermal limits for production scenarios

Pro Tip: Combine benchmarking with Real-Time Profiling to understand why performance changes occur, not just measure that they do.