RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

PTX & SASS Analysis

View and analyze GPU assembly code (PTX and SASS) with source mapping, instruction-level profiling, and optimization insights.

Understanding PTX and SASS

PTX (Parallel Thread Execution)

Intermediate virtual assembly language generated by NVCC from CUDA C++. PTX is architecture-independent and can be optimized by NVIDIA drivers at runtime.

Human-readable virtual instruction set
Forward-compatible across GPU generations
Shows register allocation and memory operations
Useful for understanding compiler transformations

SASS (Streaming ASSembler)

Native GPU machine code that actually executes on NVIDIA hardware. SASS is architecture-specific and represents the final optimized instructions.

Real hardware instructions executed by GPU
Architecture-specific (sm_86, sm_89, sm_90)
Shows actual instruction scheduling and latency
Critical for deep performance optimization

PTX Viewing and Analysis

RightNow AI generates and displays PTX assembly for any CUDA kernel, with source code mapping and instruction analysis.

Viewing PTX Code

Right-click on kernel function → "View PTX Assembly"
RightNow AI compiles kernel with nvcc --ptx flag
PTX opens in virtual document with syntax highlighting
Source mapping allows navigation between CUDA source and PTX instructions

PTX Instruction Analysis

RightNow AI provides insight into PTX instruction characteristics:

Register Usage: Number of registers allocated per thread
Shared Memory: Static shared memory allocation size
Spill Loads/Stores: Register spills to local memory
Memory Operations: Global, shared, local, constant memory accesses
Instruction Count: Total PTX instructions in kernel

ptx

// Example PTX output for vector add kernel
.visible .entry vectorAdd(
    .param .u64 vectorAdd_param_0,  // Output array C
    .param .u64 vectorAdd_param_1,  // Input array A
    .param .u64 vectorAdd_param_2,  // Input array B
    .param .u32 vectorAdd_param_3   // Array size N
)
{
    .reg .pred %p<2>;
    .reg .f32  %f<4>;
    .reg .b32  %r<8>;
    .reg .b64  %rd<11>;

    ld.param.u64 %rd1, [vectorAdd_param_0];
    ld.param.u64 %rd2, [vectorAdd_param_1];
    ld.param.u64 %rd3, [vectorAdd_param_2];
    ld.param.u32 %r2, [vectorAdd_param_3];

    mov.u32 %r3, %ctaid.x;          // blockIdx.x
    mov.u32 %r4, %ntid.x;           // blockDim.x
    mov.u32 %r5, %tid.x;            // threadIdx.x
    mad.lo.s32 %r1, %r3, %r4, %r5;  // idx = blockIdx.x * blockDim.x + threadIdx.x

    setp.ge.s32 %p1, %r1, %r2;      // if (idx >= N) return;
    @%p1 bra LBB0_2;

    mul.wide.s32 %rd4, %r1, 4;      // Offset calculation
    add.s64 %rd5, %rd2, %rd4;       // A + offset
    ld.global.f32 %f1, [%rd5];      // Load A[idx]

    add.s64 %rd7, %rd3, %rd4;       // B + offset
    ld.global.f32 %f2, [%rd7];      // Load B[idx]

    add.f32 %f3, %f1, %f2;          // result = A[idx] + B[idx]

    add.s64 %rd10, %rd1, %rd4;      // C + offset
    st.global.f32 [%rd10], %f3;     // Store C[idx]

LBB0_2:
    ret;
}

SASS Viewing and Optimization

View native GPU assembly code to understand actual instruction execution and identify micro-architectural optimization opportunities.

Generating SASS Code

Right-click on kernel function → "View SASS Assembly"
RightNow AI compiles kernel and extracts SASS using cuobjdump
SASS displayed with architecture-specific instructions (Ampere, Ada Lovelace, Hopper)
Instruction scheduling and latency hints visible

SASS Instruction Categories

Compute Instructions

FFMA: Fused multiply-add (float)
DFMA: Fused multiply-add (double)
IMAD: Integer multiply-add
MUFU: Multi-function unit (sin, cos, sqrt)

Memory Instructions

LDG: Load from global memory
STG: Store to global memory
LDS: Load from shared memory
STS: Store to shared memory

Control Flow

BRA: Branch instruction
SSY: Set synchronization point
SYNC: Synchronization barrier
EXIT: Thread exit

Special Instructions

HMMA: Tensor core operations (Ampere+)
LDGSTS: Async global-to-shared copy (Ampere+)
LDSM: Shared memory matrix load

Source Code Mapping

RightNow AI maps PTX and SASS instructions back to original CUDA source lines using debug information and line directives.

How Source Mapping Works

Compile with -lineinfo flag to embed source line information in PTX/SASS
Parse debug directives to associate assembly instructions with source lines
Enable bidirectional navigation: source → assembly, assembly → source
Correlate profiling data from SASS back to specific source code lines

Interactive Navigation

Click between source and assembly views to understand compiler transformations:

Source → PTX: Click line number in source → highlights corresponding PTX instructions
PTX → Source: Click PTX instruction → jumps to originating source line
Source → SASS: See actual hardware instructions generated from source line
Profiling Integration: Hot SASS instructions mapped back to source for optimization

Assembly-Level Optimization Insights

Analyzing PTX and SASS reveals micro-architectural optimization opportunities not visible at source level.

Register Pressure Analysis

PTX shows exact register allocation:

Count .reg declarations to see register usage
High register count (255 per thread) limits occupancy
Spill loads/stores indicate register pressure (search for ld.local, st.local)
Reduce register usage to increase occupancy and SM utilization

Instruction Throughput

SASS reveals instruction scheduling and latency:

Count memory instructions (LDG, STG) to estimate memory bottlenecks
Identify long-latency operations (MUFU for special functions)
Look for instruction-level parallelism (ILP) opportunities
Measure instruction mix ratio (compute vs memory)

Memory Coalescing Verification

SASS memory instructions show access patterns:

Check if consecutive threads access consecutive addresses
Identify non-coalesced accesses (multiple LDG for same warp)
Verify 128-byte aligned memory transactions
Optimize data layout to improve coalescing

Branch Divergence Detection

SASS control flow shows divergence patterns:

Count BRA (branch) instructions with predicates
Identify SSY/SYNC pairs indicating divergent paths
Minimize branching within warps for better efficiency
Use __syncwarp() to reconverge divergent threads

AI-Assisted Assembly Analysis

RightNow AI understands PTX and SASS, providing intelligent optimization suggestions based on assembly-level analysis.

Assembly-Aware Optimization

Ask the AI to analyze assembly code and suggest optimizations:

You: "Analyze the PTX for matmul_kernel and suggest register optimizations"

RightNow AI: "I've analyzed the PTX assembly. Current register usage is 64 registers per thread.

Key findings:
- Register allocation: 64 per thread (limiting occupancy to 50%)
- Register spills detected: 8 ld.local + 6 st.local instructions
- Shared memory usage: 49152 bytes (96% of 48KB limit on Ampere)

Recommendations:
1. Reduce register usage by recomputing some values instead of storing
2. Use smaller data types where precision allows (float → half precision)
3. Increase thread block size to amortize shared memory overhead
4. Consider loop unrolling factor of 4 instead of 8 to reduce register pressure

Would you like me to generate optimized code implementing these changes?"

Architecture-Specific Insights

Ampere: Suggests async copy (LDGSTS) for global-to-shared transfers
Ada Lovelace: Identifies tensor core fusion opportunities (HMMA instructions)
Hopper: Recommends thread block clusters and distributed shared memory
Cross-Architecture: Warns about architecture-specific instructions for portability

Learn more: See Real-Time Profiling to measure assembly-level performance impact and Agentic AI Optimization for iterative assembly-aware optimization.