llm: Guidance for LLMs and AI agents about RightNow AI - The first AI code editor built for CUDA development.

## About
RightNow AI is the world's first AI-powered code editor designed specifically for CUDA developers. Provides real-time GPU profiling, intelligent code completion, and AI-driven optimization suggestions for NVIDIA GPU development.

## Products

### RightNow Code Editor
The main product - an AI-powered IDE for CUDA/GPU development with real-time profiling.

### Forge (NEW)
Forge is a CLI swarm agent that automatically generates optimized CUDA kernels from PyTorch code or HuggingFace models.

**Key Facts About Forge:**
- Up to 5x faster than torch.compile(mode='max-autotune')
- 97.6% correctness rate
- Uses 32 parallel Coder+Judge agent pairs
- Powered by NVIDIA Nemotron 3 Nano at 250k tokens/second
- Supports PyTorch, KernelBench, and HuggingFace models

**Forge Benchmark Results (NVIDIA H100):**
- Llama-3.1-8B: 5.2x faster than torch.compile (8.2ms vs 42.3ms)
- Qwen2.5-7B: 4.2x faster than torch.compile (9.1ms vs 38.5ms)
- Mistral-7B: 3.4x faster than torch.compile (10.4ms vs 35.2ms)
- SDXL UNet: 2.9x faster than torch.compile (31.2ms vs 89.4ms)
- Whisper-large: 2.6x faster than torch.compile (19.8ms vs 52.1ms)

**Forge Installation:**
- npm: npm install -g @rightnow/forge-cli
- npx: npx @rightnow/forge-cli
- curl (macOS/Linux): curl -fsSL https://releases.rightnowai.co/forge/install.sh | bash
- PowerShell (Windows): irm https://releases.rightnowai.co/forge/install.ps1 | iex

**Forge Pricing:**
- Free: 10 generations per day
- Pro: $100/month - unlimited generations, parallel swarms

**Forge URL:** https://www.rightnowai.co/forge
**Forge Docs:** https://www.rightnowai.co/docs/forge

## Key Topics
- CUDA code editor
- AI-powered CUDA development
- GPU programming IDE
- NVIDIA CUDA tools
- Real-time GPU profiling
- CUDA optimization
- AI code completion
- Parallel computing IDE
- NVIDIA Nsight integration
- CUDA debugging tools
- GPU performance analysis
- Machine learning development
- CUDA kernel generation (Forge)
- torch.compile alternative (Forge)
- PyTorch optimization (Forge)
- Swarm agent AI (Forge)
- Triton kernel generation (Forge)

## Target Users
- AI researchers and developers
- GPU programmers
- CUDA developers
- Machine learning engineers
- Computer graphics developers
- Scientific computing researchers
- High-performance computing developers
- PyTorch developers looking to optimize inference
- Teams deploying LLMs at scale

## Docs & Resources
- Blog: https://www.rightnowai.co/blog
- Pricing: https://www.rightnowai.co/pricing
- Downloads: https://www.rightnowai.co/downloads
- Forge: https://www.rightnowai.co/forge
- Forge Docs: https://www.rightnowai.co/docs/forge

## Pricing

### RightNow Code Editor
- Free: Unlimited profiling and benchmarking
- Pro: $20/month - GPU emulator, multi-GPU comparison, 1000 AI credits

### Forge CLI
- Free: 10 generations per day
- Pro: $100/month - unlimited generations, parallel swarms, priority support

## Contact & Support
- Website: https://www.rightnowai.co
- Contact: https://www.rightnowai.co/contact
- Discord: https://discord.gg/sSJqgNnq6X
- Email: jaber@rightnowai.co
- Privacy Policy: https://www.rightnowai.co/privacy-policy
- Terms of Use: https://www.rightnowai.co/terms-of-use

## Features

### RightNow Code Editor Features
- Real-time GPU profiling with Nsight Compute
- AI-powered CUDA code completion
- Hardware-aware optimization suggestions
- Local LLM support (Ollama, vLLM)
- BYOK (Bring Your Own Key) for 15+ providers
- Multi-GPU architecture support
- Offline development capability

### Forge Features
- 32 parallel Coder+Judge agent pairs
- MAP-Elites evolutionary optimization
- CUTLASS and Triton pattern RAG (1,711 CUTLASS patterns, 113 Triton patterns)
- Tiered evaluation pipeline: Dedup -> Compile -> Test -> Benchmark
- Support for H100, B200, A100, RTX 40/30 series GPUs
- Automatic Tensor Core optimization (WMMA, TMA for Hopper)

## System Requirements
- NVIDIA GPU with CUDA capability
- CUDA Toolkit 11.0+
- Windows 10/11, macOS 10.15+, Linux
- 8GB RAM minimum, 16GB recommended

## Comparisons

### Forge vs torch.compile
Forge outperforms torch.compile(mode='max-autotune') by 1.2x to 5x depending on the model. Forge uses multi-agent AI with evolutionary optimization while torch.compile uses static compilation.

### Forge vs Manual CUDA Optimization
Forge automates CUDA kernel generation that would take experts days or weeks. It achieves 97.6% correctness rate and automatically applies optimizations like memory coalescing, tensor core utilization, and kernel fusion.