RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

RightNow AI - Best AI Code Editor for GPU Kernels (2026)

Changelog

Latest updates and improvements to RightNow AI

VERSIONS

│

│

│

│

│

│

│

│

│

│

│

│

Want a feature?

Submit request →

1.0.0Feb 4, 2026

Agents, Skills, and More Languages

Create custom agents, extend capabilities with skills and MCPs, and develop GPU kernels in CUDA, Triton, Mojo, PyTorch, Numba, and more.

+ NEW

• Custom agents with skills and MCP integrations for GPU workflows

• Clear model selection across cloud LLMs and local GPU-backed models

• Numba support with native docs, autocomplete, emulation, profiling, and benchmarking

• CUDA Tile support with native docs, autocomplete, emulation, profiling, and benchmarking

• Mojo support with native docs, autocomplete, emulation, profiling, and benchmarking

→ IMPROVED

• More stable SSH and remote GPU workflows after the migration

• Updated chat and agent sessions experience on the new core

Forge 0.1.0Jan 5, 2026

Introducing Forge

CLI Swarm Agent for generating production-ready CUDA/Triton kernels. Up to 5x faster than torch.compile() with 97.6% correctness rate.

+ NEW

• 32 Parallel Coder+Judge Agent Pairs: Swarm architecture generates and validates kernels concurrently

• MAP-Elites Evolutionary Optimizer: 36 behavior cells with 4 specialized islands (memory_bound, compute_bound, fused_ops, tensor_cores)

• Pattern RAG System: 1,711 CUTLASS patterns + 113 Triton patterns for context-aware generation

• Up to 5x Speedup: Llama-3.1-8B (5.2x), Qwen2.5-7B (4.2x), Mistral-7B (3.4x), SDXL UNet (2.9x) vs torch.compile

• Multiple Input Types: HuggingFace model IDs, KernelBench tasks (250+), or custom PyTorch files

• Dual Output Formats: Triton (Python GPU kernels) or native CUDA C++ - drop-in PyTorch replacement

• Three Optimization Modes: --turbo (fast ~2min), default (balanced), --quality (maximum optimization)

• Interactive CLI: forge command launches wizard, plus forge browse for KernelBench task browser

• Session Management: Track past optimizations with forge session list

• Credit System: 1 credit per KernelBench/custom kernel, 1-2 credits for HuggingFace models

→ IMPROVED

• 97.6% Correctness Rate: Tiered evaluation pipeline (Dedup → Compile → Test → Benchmark)

• 250k Tokens/Second: Powered by fine-tuned NVIDIA Nemotron 3 Nano 30B for fast generation

• Automatic Tensor Core Optimization: WMMA, TMA for Hopper architecture

• Cross-platform Install: npm, npx, curl (macOS/Linux), PowerShell (Windows)

0.2.0Jan 1, 2026

PyTorch Kernel Support

Profile, benchmark, and emulate PyTorch kernels directly in the editor. Same workflow as CUDA, Triton, TileLang, and CUTE.

+ NEW

• PyTorch Kernel Profiling: Profile custom PyTorch kernels with full NCU integration

• PyTorch Benchmarking: Run statistical timing analysis on PyTorch operations

• PyTorch Emulation: Test PyTorch kernels across 86+ GPU architectures without hardware

• Automatic Kernel Detection: Detects PyTorch kernels from your code automatically

• Cross-DSL Comparison: Compare PyTorch kernel performance side-by-side with CUDA/Triton implementations

• Unified Workflow: Same profiling methods (NCU Full, Fast, Static, Line-by-Line) work across all supported languages

0.1.0Dec 8, 2025

Multi-DSL GPU Development Platform

RightNow AI now supports Triton, TileLang, and CUTE alongside native CUDA, with intelligent documentation retrieval that understands your GPU and code context.

+ NEW

• Multi-DSL Platform: CUDA, Triton, TileLang & CUTE support with automatic detection and compilation

• Unified Profiling Experience: Profile Triton and TileLang kernels with NCU just like CUDA kernels

• CUTLASS Auto-Detection: Automatically finds your CUTLASS installation across common locations

• Enhanced Language Features: Full semantic highlighting, hover documentation, and go-to-definition for all DSLs

• Context-Aware Help: AI automatically retrieves relevant GPU documentation based on your code and questions

• GPU-Specific Filtering: Only shows documentation compatible with your GPU architecture

• 100+ Documentation Sources: Comprehensive coverage of CUDA, Triton, TileLang, and CUTE APIs

• DSL-Aware Suggestions: AI understands which DSL you're working with and provides relevant guidance

• GPU Context Display: See your active GPU info directly in the chat

• DSL-Specific Metrics: Extracts Triton warps/stages, TileLang block sizes, CUTE tile dimensions

• Smart Performance Warnings: Get alerts for high warp counts, excessive sync points, or register pressure

→ IMPROVED

• Benchmark Accuracy: Triton benchmarks now show correct DSL columns (BLOCK, Warps, Stages)

• Emulation Reliability: Fixed bug where Triton used real GPU instead of emulator when emulated GPU selected

• GPU Detection: Better architecture identification and compatibility checking

• Chat Interface: Cleaner message rendering with improved code display

✓ FIXED

• Triton Emulation: Now correctly uses emulation mode when emulated GPU is selected

• Benchmark Columns: Triton and TileLang show proper DSL-specific columns instead of CUDA defaults

• Parameter Handling: DSL parameters now correctly flow through profiling pipeline

0.0.76Nov 30, 2025

Mac Support & Multi-GPU Profiling

Full macOS compatibility with Metal GPU detection for Apple Silicon. Multi-GPU profiling to compare GPU vs GPU side-by-side.

+ NEW

• Mac Platform Support: Full macOS compatibility with Metal GPU detection for Apple Silicon (M1/M2/M3) and Intel GPUs

• Multi-GPU Support: Select and profile multiple GPUs simultaneously with side-by-side performance comparison

• GPU Filter Dropdown: Filter profiling results by specific GPU

• Remote GPU Support: Full SSH connection to remote GPU servers with seamless CUDA environment integration

• Dynamic Profiling Methods: 5 profiling options (NCU Full, Fast, Static, Line-by-Line, Kernel Replay) with quick-switch buttons

→ IMPROVED

• Simplified Chat Modes: Streamlined to Agent, Gather, and Forge only (Coming soon)

• Chart Rendering: Better performance visualization for profiling data

✓ FIXED

• Double Credit Consumption: Fixed bug charging credits twice

• Profiling Data Persistence: No more data loss on view refresh

• SSL Handshake Issues: Fixed connection stability

! BREAKING

• Iterate Mode: Removed and consolidated into simplified chat mode system

0.0.45Oct 30, 2025

Execution-Driven Emulator & Agentic AI Optimization

Cycle-accurate GPU emulation with 96-98% accuracy. No physical GPU required. AI automatically iterates and optimizes kernels to peak performance.

+ NEW

• New GPU Emulator built from scratch with cycle-accurate scheduling and multi-warp latency simulation

• PTX and SASS translation for deep low-level analysis and debugging

• Remote connection support (SSH + WSL) to run and profile kernels anywhere

• Agentic AI "Iterate Mode" that writes, profiles, and optimizes kernels automatically until peak performance

• Kernel fusion detection for better performance across sequential and parallel operations (Beta Users)

→ IMPROVED

• Enhanced benchmarking and profiling with full metric breakdowns and bottleneck detection

• Local LLMs now work perfectly

• 96-98% emulator accuracy vs real GPUs

• Simulation speed under 100ms for 1,000 instructions

• 30% more accuracy than previous builds

0.0.31Sep 22, 2025

Remote GPU Access & AI Insights

Code anywhere, run everywhere. Connect to remote GPUs with SSH and cloud providers.

+ NEW

• Remote GPU connection via SSH integration

• Native support for GPU cloud providers (RunPod, Google Cloud, AWS, Azure, Paperspace, Vast.ai, Lambda Labs)

• Seamless profiling on remote GPUs as if they were local

• Automatic GPU detection on remote machines

• Smart Profiling Terminal with AI-powered insights

• Automatic bottleneck detection (memory-bound vs compute-bound)

• NCU-compatible metrics without requiring hardware

• AI-generated optimization suggestions (memory coalescing, bank conflicts, occupancy, branch divergence)

→ IMPROVED

• Fixed NCU GUI integration for report generation

• Enhanced profiling UI with collapsible sections

• WebWorker-based analysis for non-blocking performance

• LRU caching for instant re-analysis

• Improved error handling and fallback mechanisms

0.0.30Sep 18, 2025

Full GPU Emulator - No Hardware Required

Profile any CUDA kernel without a physical GPU. Choose from 86+ GPU architectures.

+ NEW

• Full GPU emulator for profiling without physical hardware

• 86+ GPU architectures supported

• Static kernel analysis engine (under 100ms)

• Roofline model implementation with ±15% accuracy

• Architecture comparison across multiple GPUs instantly

0.0.29Sep 14, 2025

Benchmarking Terminal & Static Profiling

Full benchmarking terminal with visual kernel comparisons and instant CodeLens insights.

+ NEW

• Benchmarking Terminal for benchmark sweeps and custom kernel configurations

• Visual comparison between kernels

• Static Profiling with instant CUDA kernel insights in CodeLens

• Real-time registers, shared memory, and occupancy analysis while typing

• Profile with Configs - complete cycle with persistent configs and history

• Tools Detector (nvidia-smi, nsight compute, nvcc)

0.0.28Sep 14, 2025

CUDA Benchmarking System

Comprehensive benchmarking with execution time, memory bandwidth, occupancy, and multi-GPU support.

+ NEW

• Execution time, memory bandwidth, occupancy, SM efficiency, and register usage metrics

• Data size presets, warmup runs, and execution controls

• Grid/block optimization with automatic suggestions

• Multi-GPU support with device-specific benchmarking

• Session management with persistence across restarts

• Sortable results with performance indicators

• CSV export for sharing benchmark results

0.0.20Aug 18, 2025

Multi-LLM Provider Support

Support for 15+ AI providers including local models.

+ NEW

• OpenAI, Anthropic, Deepseek integration

• Local Ollama and vLLM support

• BYOK (Bring Your Own Key) flexibility

• Fill-in-the-Middle autocomplete

0.0.10Aug 5, 2025

Initial Release

First public release of RightNow AI.

+ NEW

• NVIDIA Nsight Compute integration

• Real-time GPU performance metrics

• Hardware detection and optimization

• CUDA syntax highlighting and IntelliSense