RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

CUDA Matrix-Vector Multiplication (SGEMV) Guide

December 25, 202512 minBy RightNow AI Team

Introduction

Matrix-vector multiplication (GEMV) is ubiquitous in linear algebra, neural networks, and scientific computing. Unlike GEMM which is compute-bound, GEMV is typically memory-bound. Optimization focuses on maximizing memory bandwidth utilization through coalescing and parallel reduction.

Common Performance Issues

Memory bandwidth bottleneck - low arithmetic intensity
Uncoalesced access when parallelizing by columns
Inefficient reduction across row elements
Poor performance for skinny matrices

Optimization Techniques

1. Row-Parallel with Warp Reduction

Assign one warp per row, use shuffle for fast reduction.

cuda

__global__ void sgemv_warp(float* A, float* x, float* y, int M, int N) {
    int row = blockIdx.x;
    int lane = threadIdx.x;

    float sum = 0.0f;
    for (int j = lane; j < N; j += 32) {
        sum += A[row * N + j] * x[j];
    }

    // Warp reduction
    for (int offset = 16; offset > 0; offset /= 2) {
        sum += __shfl_down_sync(0xffffffff, sum, offset);
    }

    if (lane == 0) y[row] = sum;
}

Implementation Comparison

Before (Naive Implementation)

One thread per row - serialized inner loop.

cuda

__global__ void sgemv_naive(float* A, float* x, float* y, int M, int N) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M) {
        float sum = 0.0f;
        for (int j = 0; j < N; j++) {
            sum += A[row * N + j] * x[j];
        }
        y[row] = sum;
    }
}

After (Optimized Implementation)

Caches x vector in shared memory, uses block reduction.

cuda

__global__ void sgemv_opt(float* A, float* x, float* y, int M, int N) {
    __shared__ float xs[256];
    int row = blockIdx.x;
    int tid = threadIdx.x;
    int blockSize = blockDim.x;

    float sum = 0.0f;

    for (int tile = 0; tile < N; tile += blockSize) {
        // Collaborative load of x into shared memory
        if (tile + tid < N) xs[tid] = x[tile + tid];
        __syncthreads();

        int end = min(blockSize, N - tile);
        for (int j = 0; j < end; j++) {
            sum += A[row * N + tile + j] * xs[j];
        }
        __syncthreads();
    }

    // Block reduction
    __shared__ float sdata[256];
    sdata[tid] = sum;
    __syncthreads();

    for (int s = blockSize/2; s > 0; s >>= 1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }

    if (tid == 0) y[row] = sdata[0];
}

Performance Results

Metric	Naive	Optimized	Improvement
Throughput (4096x4096)	45 GB/s	380 GB/s	8.4x faster
vs cuBLAS	12%	89%	7.4x closer

Frequently Asked Questions

Should I use cuBLAS for SGEMV?

Yes for production. cuBLAS sgemv is highly optimized. Custom kernels only make sense for fused operations or unusual matrix shapes.

Matrix Multiplication

GEMV is M=1 case of GEMM

→

Reduction Sum

Each row is a dot product reduction

→

Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.

CUDA SGEMVmatrix vector multiplyGPU BLASparallel reductioncuBLASrow parallelism

Optimization Techniques

1. Row-Parallel with Warp Reduction

Assign one warp per row, use shuffle for fast reduction.

cuda

__global__ void sgemv_warp(float* A, float* x, float* y, int M, int N) {
    int row = blockIdx.x;
    int lane = threadIdx.x;

    float sum = 0.0f;
    for (int j = lane; j < N; j += 32) {
        sum += A[row * N + j] * x[j];
    }

    // Warp reduction
    for (int offset = 16; offset > 0; offset /= 2) {
        sum += __shfl_down_sync(0xffffffff, sum, offset);
    }

    if (lane == 0) y[row] = sum;
}

Implementation Comparison

Before (Naive Implementation)

One thread per row - serialized inner loop.

cuda

__global__ void sgemv_naive(float* A, float* x, float* y, int M, int N) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M) {
        float sum = 0.0f;
        for (int j = 0; j < N; j++) {
            sum += A[row * N + j] * x[j];
        }
        y[row] = sum;
    }
}

After (Optimized Implementation)

Caches x vector in shared memory, uses block reduction.

cuda

__global__ void sgemv_opt(float* A, float* x, float* y, int M, int N) {
    __shared__ float xs[256];
    int row = blockIdx.x;
    int tid = threadIdx.x;
    int blockSize = blockDim.x;

    float sum = 0.0f;

    for (int tile = 0; tile < N; tile += blockSize) {
        // Collaborative load of x into shared memory
        if (tile + tid < N) xs[tid] = x[tile + tid];
        __syncthreads();

        int end = min(blockSize, N - tile);
        for (int j = 0; j < end; j++) {
            sum += A[row * N + tile + j] * xs[j];
        }
        __syncthreads();
    }

    // Block reduction
    __shared__ float sdata[256];
    sdata[tid] = sum;
    __syncthreads();

    for (int s = blockSize/2; s > 0; s >>= 1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }

    if (tid == 0) y[row] = sdata[0];
}

Metric

Naive

Optimized

Improvement

Throughput (4096x4096)

45 GB/s

380 GB/s

8.4x faster

vs cuBLAS

12%

89%

7.4x closer

CUDA Matrix-Vector Multiplication (SGEMV) Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Row-Parallel with Warp Reduction

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Should I use cuBLAS for SGEMV?

Related Guides

CUDA Matrix-Vector Multiplication (SGEMV) Guide

Introduction

Common Performance Issues

Optimization Techniques

1. Row-Parallel with Warp Reduction

Implementation Comparison

Before (Naive Implementation)

After (Optimized Implementation)

Performance Results

Frequently Asked Questions

Should I use cuBLAS for SGEMV?

Related Guides