Forge
New State of Agentic AIA Swarm Agent System for Automated GPU Kernel Optimization
Abstract
Forge is a CLI-based swarm agent system (Claude Code for kernels). It automatically generates optimized GPU kernels from any PyTorch model or HuggingFace model ID, achieving up to 5× faster inference than torch.compile(mode='max-autotune') with 97.6% correctness.
1What is Forge?
Forge is a swarm-based kernel optimizer that accelerates GPU inference for any model. Enter a HuggingFace model ID and Forge automatically generates optimized CUDA/Triton kernels for every layer.
The system runs 32 parallel Coder+Judge agent pairs that compete to find the fastest kernel implementation. Each agent explores optimization strategies including tensor core utilization, memory coalescing, and kernel fusion. This achieves up to 5× speedup over torch.compile(mode='max-autotune') with 97.6% correctness.
Forge uses inference-time scaling powered by a fine-tuned and optimized NVIDIA Nemotron 3 Nano 30B generating 250k tokens/second. This enables deep exploration of the optimization space in minutes instead of hours.
2How It Works
3Results
Benchmark results comparing Forge against torch.compile on NVIDIA H100.
4Demo
Interactive mock demo of the Forge CLI.
5Pricing
Agent Credits
Credit Refund if We Don't Beat torch.compile(mode='max-autotune')
- ✓Inference Time Scaling with NVIDIA Nemotron 3 Nano 30B (250k tokens/sec)
- ✓32 Parallel Swarm Agents with Coder+Judge Pattern
- ✓Advanced Kernel Database Retrieval
- ✓Outperform torch.compile(mode='max-autotune')
- ✓Any HuggingFace Model ID → All Layers Optimized
- ✓Datacenter GPUs Access (B200, H100, H200)