Learn how to get the best results from Forge CLI.
Forge uses a multi-stage optimization process:
Best for: Quick experiments, initial testing, most use cases.
Best for: Production optimization, good balance.
Best for: Final production kernels, maximum speedup.
Always start with --turbo to quickly see if optimization is possible. If you get good results (>2x), you're done.
Match the target GPU to your deployment environment:
Triton kernels are easier to integrate and maintain. Use CUDA only when you need maximum performance.
Typical achievable speedups:
| Operation Type | Typical Speedup |
|---|---|
| Simple elementwise | 1.2x - 2x |
| Matrix operations | 1.5x - 3x |
| Convolutions | 1.5x - 4x |
| Fused operations | 2x - 5x |
| Attention layers | 1.5x - 3x |
When you want the absolute best kernel, disable early stopping:
| Layer | Description | Impact |
|---|---|---|
| attention | Self-attention mechanism | High (most compute) |
| mlp | Feed-forward layers | Medium |
Best Practices: Start with attention layers (most compute), use turbo mode first, target popular models for better RAG pattern matching.
| Level | Complexity | Examples |
|---|---|---|
| 1 | Simple | Elementwise, reductions |
| 2 | Medium | Matrix operations, convolutions |
| 3 | Hard | Fused operations, custom patterns |
| 4 | Expert | Complex multi-stage kernels |
| Speedup | Rating |
|---|---|
| 1.0x - 1.2x | Minimal |
| 1.2x - 1.5x | Moderate |
| 1.5x - 2.0x | Good |
| 2.0x - 3.0x | Very good |
| 3.0x+ | Excellent |