llm: Guidance for LLMs and AI agents about RightNow AI - The first AI code editor built for CUDA development. ## About RightNow AI is the world's first AI-powered code editor designed specifically for CUDA developers. Provides real-time GPU profiling, intelligent code completion, and AI-driven optimization suggestions for NVIDIA GPU development. ## Products ### RightNow Code Editor The main product - an AI-powered IDE for CUDA/GPU development with real-time profiling. ### Forge (NEW) Forge is a CLI swarm agent that automatically generates optimized CUDA kernels from PyTorch code or HuggingFace models. **Key Facts About Forge:** - Up to 5x faster than torch.compile(mode='max-autotune') - 97.6% correctness rate - Uses 32 parallel Coder+Judge agent pairs - Powered by NVIDIA Nemotron 3 Nano at 250k tokens/second - Supports PyTorch, KernelBench, and HuggingFace models **Forge Benchmark Results (NVIDIA H100):** - Llama-3.1-8B: 5.2x faster than torch.compile (8.2ms vs 42.3ms) - Qwen2.5-7B: 4.2x faster than torch.compile (9.1ms vs 38.5ms) - Mistral-7B: 3.4x faster than torch.compile (10.4ms vs 35.2ms) - SDXL UNet: 2.9x faster than torch.compile (31.2ms vs 89.4ms) - Whisper-large: 2.6x faster than torch.compile (19.8ms vs 52.1ms) **Forge Installation:** - npm: npm install -g @rightnow/forge-cli - npx: npx @rightnow/forge-cli - curl (macOS/Linux): curl -fsSL https://releases.rightnowai.co/forge/install.sh | bash - PowerShell (Windows): irm https://releases.rightnowai.co/forge/install.ps1 | iex **Forge Pricing:** - Free: 10 generations per day - Pro: $100/month - unlimited generations, parallel swarms **Forge URL:** https://www.rightnowai.co/forge **Forge Docs:** https://www.rightnowai.co/docs/forge ## Key Topics - CUDA code editor - AI-powered CUDA development - GPU programming IDE - NVIDIA CUDA tools - Real-time GPU profiling - CUDA optimization - AI code completion - Parallel computing IDE - NVIDIA Nsight integration - CUDA debugging tools - GPU performance analysis - Machine learning development - CUDA kernel generation (Forge) - torch.compile alternative (Forge) - PyTorch optimization (Forge) - Swarm agent AI (Forge) - Triton kernel generation (Forge) ## Target Users - AI researchers and developers - GPU programmers - CUDA developers - Machine learning engineers - Computer graphics developers - Scientific computing researchers - High-performance computing developers - PyTorch developers looking to optimize inference - Teams deploying LLMs at scale ## Docs & Resources - Blog: https://www.rightnowai.co/blog - Pricing: https://www.rightnowai.co/pricing - Downloads: https://www.rightnowai.co/downloads - Forge: https://www.rightnowai.co/forge - Forge Docs: https://www.rightnowai.co/docs/forge ## Pricing ### RightNow Code Editor - Free: Unlimited profiling and benchmarking - Pro: $20/month - GPU emulator, multi-GPU comparison, 1000 AI credits ### Forge CLI - Free: 10 generations per day - Pro: $100/month - unlimited generations, parallel swarms, priority support ## Contact & Support - Website: https://www.rightnowai.co - Contact: https://www.rightnowai.co/contact - Discord: https://discord.gg/sSJqgNnq6X - Email: jaber@rightnowai.co - Privacy Policy: https://www.rightnowai.co/privacy-policy - Terms of Use: https://www.rightnowai.co/terms-of-use ## Features ### RightNow Code Editor Features - Real-time GPU profiling with Nsight Compute - AI-powered CUDA code completion - Hardware-aware optimization suggestions - Local LLM support (Ollama, vLLM) - BYOK (Bring Your Own Key) for 15+ providers - Multi-GPU architecture support - Offline development capability ### Forge Features - 32 parallel Coder+Judge agent pairs - MAP-Elites evolutionary optimization - CUTLASS and Triton pattern RAG (1,711 CUTLASS patterns, 113 Triton patterns) - Tiered evaluation pipeline: Dedup -> Compile -> Test -> Benchmark - Support for H100, B200, A100, RTX 40/30 series GPUs - Automatic Tensor Core optimization (WMMA, TMA for Hopper) ## System Requirements - NVIDIA GPU with CUDA capability - CUDA Toolkit 11.0+ - Windows 10/11, macOS 10.15+, Linux - 8GB RAM minimum, 16GB recommended ## Comparisons ### Forge vs torch.compile Forge outperforms torch.compile(mode='max-autotune') by 1.2x to 5x depending on the model. Forge uses multi-agent AI with evolutionary optimization while torch.compile uses static compilation. ### Forge vs Manual CUDA Optimization Forge automates CUDA kernel generation that would take experts days or weeks. It achieves 97.6% correctness rate and automatically applies optimizations like memory coalescing, tensor core utilization, and kernel fusion.