RightNow AI is a research lab and software company working on GPU programming tools, CUDA development workflows, model-hardware co-design, and inference infrastructure.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $29 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What CUDA development workflow does RightNow AI support?

RightNow AI supports CUDA development workflows that combine editing, profiling, emulation, remote GPU execution, and benchmarked performance analysis.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

Optimization Guide

Learn how to get the best results from Forge CLI.

How Forge Works

Forge uses a multi-stage optimization process:

1. ANALYSIS

Parse your PyTorch code, identify patterns

2. BASELINE

Measure original performance

3. GENERATION

AI agents propose optimizations

4. EVOLUTION

Test candidates, keep best performers

5. RESULT

Return fastest kernel with speedup

Choosing the Right Mode

Turbo Mode

RECOMMENDED

forge optimize --turbo

Population: 16

Generations: 10

Time: ~2-5 min

Cost: 1 credit

Best for: Quick experiments, initial testing, most use cases.

Balanced Mode

forge optimize

Population: 32

Generations: 20

Time: ~10-20 min

Cost: 1 credit

Best for: Production optimization, good balance.

Quality Mode

forge optimize --quality

Population: 64

Generations: 50

Time: ~30-60 min

Cost: 2 credits

Best for: Final production kernels, maximum speedup.

Optimization Tips

Start with Turbo Mode

Always start with --turbo to quickly see if optimization is possible. If you get good results (>2x), you're done.

forge optimize --task 25 --turbo

Choose the Right GPU Target

Match the target GPU to your deployment environment:

# High-end (recommended)

forge optimize --gpu H100

forge optimize --gpu B200

# Mid-tier

forge optimize --gpu L40S

forge optimize --gpu A10

# Budget

forge optimize --gpu T4

Use Triton for Easy Integration

Triton kernels are easier to integrate and maintain. Use CUDA only when you need maximum performance.

forge optimize --format triton

Set Realistic Targets

Typical achievable speedups:

Operation Type	Typical Speedup
Simple elementwise	1.2x - 2x
Matrix operations	1.5x - 3x
Convolutions	1.5x - 4x
Fused operations	2x - 5x
Attention layers	1.5x - 3x

Disable Early Stop for Best Results

When you want the absolute best kernel, disable early stopping:

forge optimize --task 25 --no-early-stop

HuggingFace Optimization

# Optimize all supported layers

forge optimize --huggingface meta-llama/Llama-3-8B

# Optimize specific layer types

forge optimize --huggingface Qwen/Qwen2-0.5B --layers attention

forge optimize --huggingface Qwen/Qwen2-0.5B --layers mlp

forge optimize --huggingface Qwen/Qwen2-0.5B --layers attention,mlp

Layer Types

Layer	Description	Impact
attention	Self-attention mechanism	High (most compute)
mlp	Feed-forward layers	Medium

Best Practices: Start with attention layers (most compute), use turbo mode first, target popular models for better RAG pattern matching.

KernelBench Tasks

Task Levels

Level	Complexity	Examples
1	Simple	Elementwise, reductions
2	Medium	Matrix operations, convolutions
3	Hard	Fused operations, custom patterns
4	Expert	Complex multi-stage kernels

# Browse interactively

forge browse

# Search by name

forge browse --search "matmul"

# Filter by level

forge browse --level 2

Understanding Results

✓ Optimization complete: 2.45x speedup

Baseline: 8.55ms (torch.compile)

Optimized: 3.49ms

Improvement: 5.06ms faster

What's a Good Speedup?

Speedup	Rating
1.0x - 1.2x	Minimal
1.2x - 1.5x	Moderate
1.5x - 2.0x	Good
2.0x - 3.0x	Very good
3.0x+	Excellent

Complete Workflow Example

# 1. Check your credits

forge credits

# 2. Browse available tasks

forge browse --level 2

# 3. Quick test with turbo mode

forge optimize --task 25 --turbo

# 4. If good results, save the kernel

forge session export latest -o kernel.py

# 5. If need more optimization, use quality mode

forge optimize --task 25 --quality --no-early-stop

# 6. Export final result

forge session export latest -o final_kernel.py

Next Steps

Commands Reference

All available commands.

Troubleshooting

Common issues and solutions.

Optimization Guide

Learn how to get the best results from Forge CLI.

How Forge Works

Forge uses a multi-stage optimization process:

1. ANALYSIS

Parse your PyTorch code, identify patterns

2. BASELINE

Measure original performance

3. GENERATION

AI agents propose optimizations

4. EVOLUTION

Test candidates, keep best performers

5. RESULT

Return fastest kernel with speedup

Choosing the Right Mode

Turbo Mode

RECOMMENDED

forge optimize --turbo

Population: 16

Generations: 10

Time: ~2-5 min

Cost: 1 credit

Best for: Quick experiments, initial testing, most use cases.

Balanced Mode

forge optimize

Population: 32

Generations: 20

Time: ~10-20 min

Cost: 1 credit

Best for: Production optimization, good balance.

Quality Mode

forge optimize --quality

Population: 64

Generations: 50

Time: ~30-60 min

Cost: 2 credits

Best for: Final production kernels, maximum speedup.

Optimization Tips

Start with Turbo Mode

Always start with --turbo to quickly see if optimization is possible. If you get good results (>2x), you're done.

forge optimize --task 25 --turbo

Choose the Right GPU Target

Match the target GPU to your deployment environment:

# High-end (recommended)

forge optimize --gpu H100

forge optimize --gpu B200

# Mid-tier

forge optimize --gpu L40S

forge optimize --gpu A10

# Budget

forge optimize --gpu T4

Use Triton for Easy Integration

Triton kernels are easier to integrate and maintain. Use CUDA only when you need maximum performance.

forge optimize --format triton

Set Realistic Targets

Typical achievable speedups:

Operation Type	Typical Speedup
Simple elementwise	1.2x - 2x
Matrix operations	1.5x - 3x
Convolutions	1.5x - 4x
Fused operations	2x - 5x
Attention layers	1.5x - 3x

Disable Early Stop for Best Results

When you want the absolute best kernel, disable early stopping:

forge optimize --task 25 --no-early-stop

HuggingFace Optimization

# Optimize all supported layers

forge optimize --huggingface meta-llama/Llama-3-8B

# Optimize specific layer types

forge optimize --huggingface Qwen/Qwen2-0.5B --layers attention

forge optimize --huggingface Qwen/Qwen2-0.5B --layers mlp

forge optimize --huggingface Qwen/Qwen2-0.5B --layers attention,mlp

Layer Types

Layer	Description	Impact
attention	Self-attention mechanism	High (most compute)
mlp	Feed-forward layers	Medium

Best Practices: Start with attention layers (most compute), use turbo mode first, target popular models for better RAG pattern matching.

KernelBench Tasks

Task Levels

Level	Complexity	Examples
1	Simple	Elementwise, reductions
2	Medium	Matrix operations, convolutions
3	Hard	Fused operations, custom patterns
4	Expert	Complex multi-stage kernels

# Browse interactively

forge browse

# Search by name

forge browse --search "matmul"

# Filter by level

forge browse --level 2

Understanding Results

✓ Optimization complete: 2.45x speedup

Baseline: 8.55ms (torch.compile)

Optimized: 3.49ms

Improvement: 5.06ms faster

What's a Good Speedup?

Speedup	Rating
1.0x - 1.2x	Minimal
1.2x - 1.5x	Moderate
1.5x - 2.0x	Good
2.0x - 3.0x	Very good
3.0x+	Excellent

Complete Workflow Example

# 1. Check your credits

forge credits

# 2. Browse available tasks

forge browse --level 2

# 3. Quick test with turbo mode

forge optimize --task 25 --turbo

# 4. If good results, save the kernel

forge session export latest -o kernel.py

# 5. If need more optimization, use quality mode

forge optimize --task 25 --quality --no-early-stop

# 6. Export final result

forge session export latest -o final_kernel.py

Next Steps

Commands Reference

All available commands.

Troubleshooting

Common issues and solutions.