Forge: Swarm Agents That Turn Slow PyTorch Into Fast CUDA/Triton Kernels
Abstract
Forge transforms PyTorch models into production-grade CUDA/Triton kernels through automated multi-agent optimization. Using 32 parallel AI agents with inference-time scaling, it achieves up to 14× faster inference than torch.compile(mode='max-autotune-no-cudagraphs') while maintaining 100% numerical correctness.
1.Introduction
PyTorch models leave GPU performance on the table. Manual CUDA optimization takes weeks per layer and requires rare expertise. Forge automates this entirely.
32 parallel Coder+Judge agents compete to discover optimal kernels — exploring tensor core utilization, memory coalescing, shared memory tiling, and kernel fusion simultaneously.
Powered by NVIDIA Nemotron 3 Nano 30B at 250k tokens/sec, the search completes in minutes.
Drop in any PyTorch model. Every layer gets optimized. The output is a drop-in replacement — same API, faster inference.
4.Pricing
Agent Credits
100% refund guarantee if we don't beat torch.compile
- ✓Runs on B200, H100, H200 — datacenter GPUs ready
- ✓250k tokens/sec inference — results in minutes, not hours
- ✓32 agents search in parallel — finds optimizations humans miss
- ✓Retrieves from kernel database — starts from proven patterns
- ✓Any PyTorch model in, optimized kernels out — drop-in replacement