View and analyze GPU assembly code (PTX and SASS) with source mapping, instruction-level profiling, and optimization insights.
Intermediate virtual assembly language generated by NVCC from CUDA C++. PTX is architecture-independent and can be optimized by NVIDIA drivers at runtime.
Native GPU machine code that actually executes on NVIDIA hardware. SASS is architecture-specific and represents the final optimized instructions.
RightNow AI generates and displays PTX assembly for any CUDA kernel, with source code mapping and instruction analysis.
nvcc --ptx flagRightNow AI provides insight into PTX instruction characteristics:
// Example PTX output for vector add kernel
.visible .entry vectorAdd(
.param .u64 vectorAdd_param_0, // Output array C
.param .u64 vectorAdd_param_1, // Input array A
.param .u64 vectorAdd_param_2, // Input array B
.param .u32 vectorAdd_param_3 // Array size N
)
{
.reg .pred %p<2>;
.reg .f32 %f<4>;
.reg .b32 %r<8>;
.reg .b64 %rd<11>;
ld.param.u64 %rd1, [vectorAdd_param_0];
ld.param.u64 %rd2, [vectorAdd_param_1];
ld.param.u64 %rd3, [vectorAdd_param_2];
ld.param.u32 %r2, [vectorAdd_param_3];
mov.u32 %r3, %ctaid.x; // blockIdx.x
mov.u32 %r4, %ntid.x; // blockDim.x
mov.u32 %r5, %tid.x; // threadIdx.x
mad.lo.s32 %r1, %r3, %r4, %r5; // idx = blockIdx.x * blockDim.x + threadIdx.x
setp.ge.s32 %p1, %r1, %r2; // if (idx >= N) return;
@%p1 bra LBB0_2;
mul.wide.s32 %rd4, %r1, 4; // Offset calculation
add.s64 %rd5, %rd2, %rd4; // A + offset
ld.global.f32 %f1, [%rd5]; // Load A[idx]
add.s64 %rd7, %rd3, %rd4; // B + offset
ld.global.f32 %f2, [%rd7]; // Load B[idx]
add.f32 %f3, %f1, %f2; // result = A[idx] + B[idx]
add.s64 %rd10, %rd1, %rd4; // C + offset
st.global.f32 [%rd10], %f3; // Store C[idx]
LBB0_2:
ret;
}View native GPU assembly code to understand actual instruction execution and identify micro-architectural optimization opportunities.
cuobjdumpFFMA: Fused multiply-add (float)DFMA: Fused multiply-add (double)IMAD: Integer multiply-addMUFU: Multi-function unit (sin, cos, sqrt)LDG: Load from global memorySTG: Store to global memoryLDS: Load from shared memorySTS: Store to shared memoryBRA: Branch instructionSSY: Set synchronization pointSYNC: Synchronization barrierEXIT: Thread exitHMMA: Tensor core operations (Ampere+)LDGSTS: Async global-to-shared copy (Ampere+)LDSM: Shared memory matrix loadRightNow AI maps PTX and SASS instructions back to original CUDA source lines using debug information and line directives.
-lineinfo flag to embed source line information in PTX/SASSClick between source and assembly views to understand compiler transformations:
Analyzing PTX and SASS reveals micro-architectural optimization opportunities not visible at source level.
PTX shows exact register allocation:
.reg declarations to see register usageld.local, st.local)SASS reveals instruction scheduling and latency:
SASS memory instructions show access patterns:
SASS control flow shows divergence patterns:
RightNow AI understands PTX and SASS, providing intelligent optimization suggestions based on assembly-level analysis.
Ask the AI to analyze assembly code and suggest optimizations:
You: "Analyze the PTX for matmul_kernel and suggest register optimizations"
RightNow AI: "I've analyzed the PTX assembly. Current register usage is 64 registers per thread.
Key findings:
- Register allocation: 64 per thread (limiting occupancy to 50%)
- Register spills detected: 8 ld.local + 6 st.local instructions
- Shared memory usage: 49152 bytes (96% of 48KB limit on Ampere)
Recommendations:
1. Reduce register usage by recomputing some values instead of storing
2. Use smaller data types where precision allows (float → half precision)
3. Increase thread block size to amortize shared memory overhead
4. Consider loop unrolling factor of 4 instead of 8 to reduce register pressure
Would you like me to generate optimized code implementing these changes?"Learn more: See Real-Time Profiling to measure assembly-level performance impact and Agentic AI Optimization for iterative assembly-aware optimization.