RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

DatacenterNVIDIA Data Center

NVIDIA L40S CUDA Performance Guide: Specs, Benchmarks & Optimization

December 25, 202512 min read

Introduction

The NVIDIA L40S brings Ada Lovelace architecture to datacenters with 48GB of GDDR6 memory. Positioned between consumer RTX GPUs and HBM-based datacenter cards, it offers FP8 Tensor Cores and modern features at a more accessible price point than H100. For CUDA developers deploying inference or training workloads in cloud/datacenter environments, the L40S provides excellent performance per dollar. The 48GB VRAM handles large models, while FP8 support enables efficient inference. This guide covers L40S optimization strategies and when to choose it over alternatives.

Specifications

Architecture	Ada Lovelace (AD102)
CUDA Cores	18,176
Tensor Cores	568
Memory	48GB GDDR6
Memory Bandwidth	864 GB/s
Base / Boost Clock	1110 / 2520 MHz
FP32 Performance	91.6 TFLOPS
FP16 Performance	183.2 TFLOPS
L2 Cache	96MB
TDP	350W
NVLink	No
MSRP	$10,000+
Release	August 2023

Key Features

Ada Lovelace for datacenter
48GB GDDR6 with ECC
4th Gen Tensor Cores with FP8
96MB L2 cache - largest ever
CUDA Compute Capability 8.9
350W TDP - air coolable
PCIe Gen 4 x16
Multi-instance GPU (MIG) support
Better price/perf than H100
Strong for inference

CUDA Optimization Tips

1.Use FP8 for maximum inference throughput
2.The 96MB L2 cache dramatically helps memory-bound kernels
3.No NVLink - focus on single-GPU workload optimization
4.GDDR6 bandwidth lower than HBM - optimize for cache hits
5.MIG enables multi-tenant deployment
6.Great for LLM inference at scale
7.Consider for inference-heavy deployments over H100
8.Profile L2 cache utilization with Nsight Compute

Code Examples

L40S Setup and Memory Check

This code snippet shows how to detect your L40S, check available memory, and configure optimal settings for the Ada Lovelace (AD102) architecture.

python

import torch
import pynvml

# Check if L40S is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {torch.cuda.get_device_name(0)}")

# L40S Memory: 48GB - Optimal batch sizes
# Architecture: Ada Lovelace (AD102)
# CUDA Cores: 18,176

# Memory-efficient training for L40S
torch.backends.cuda.matmul.allow_tf32 = True  # Enable TF32 for Ada Lovelace (AD102)
torch.backends.cudnn.allow_tf32 = True

# Check available memory
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"Free memory: {info.free / 1024**3:.1f} GB / 48 GB total")

# Recommended batch size calculation for L40S
model_memory_gb = 2.0  # Adjust based on your model
batch_multiplier = (48 - model_memory_gb) / 4  # 4GB per batch unit
recommended_batch = int(batch_multiplier * 32)
print(f"Recommended batch size for L40S: {recommended_batch}")

Benchmarks

Task	Performance	Comparison
LLaMA-70B Inference (tokens/sec)	55	FP8 quantized
Stable Diffusion XL (images/sec)	8.2	Strong for generation
BERT-Large Inference (sentences/sec)	3,800	FP8 optimized
ResNet-50 Training (imgs/sec)	2,100	74% of H100
Memory Bandwidth (GB/s measured)	810	94% of theoretical peak
cuBLAS GEMM FP8 (TFLOPS)	680	Strong FP8 performance

Use Cases

Use Case	Rating	Notes
LLM Inference	Excellent	FP8 + 48GB excellent for serving
Generative AI Inference	Excellent	Cost-effective for SD/image gen
Multi-Tenant Inference	Excellent	MIG for workload isolation
ML Training	Good	Good but H100 better for training
Budget Datacenter	Excellent	Better $/perf than H100
Video AI	Excellent	Ada architecture video features

Pros and Cons

Pros

+48GB fits large models
+FP8 Tensor Cores for inference
+Huge 96MB L2 cache
+Better price than H100
+Air coolable (350W)
+MIG support for multi-tenancy

Cons

−No NVLink
−GDDR6 lower bandwidth than HBM
−Slower training than H100
−No HBM3 speed
−Limited multi-GPU scaling
−Not ideal for training at scale

Frequently Asked Questions

L40S vs H100 - which to choose?

H100 is 2-3x faster for training and has NVLink. L40S is better for inference at lower cost. Choose H100 for training clusters, L40S for inference deployment and cost-sensitive workloads.

L40S vs A100 - which is better?

L40S is newer with FP8 and larger L2 cache, roughly 20% faster for inference. A100 has HBM2e with higher bandwidth, better for training. L40S is better for inference, A100 for mixed workloads.

Is L40S good for LLM serving?

Excellent. The 48GB VRAM and FP8 Tensor Cores make it ideal for LLM inference. Cost-effective compared to H100 for inference-focused deployments.

Can L40S do LLM training?

Yes, but H100 is more efficient. L40S lacks NVLink for multi-GPU training. For single-GPU fine-tuning or smaller training jobs, L40S works well. For large-scale training, use H100.

Alternatives

NVIDIA H100

2-3x faster, HBM3, higher price

→

NVIDIA A100

HBM2e, better bandwidth

→

RTX 4090

Consumer 24GB, half VRAM

→

RTX A6000

Workstation 48GB, older arch

→

Ready to optimize your CUDA kernels for L40S? Download RightNow AI for real-time performance analysis.

L40S CUDAL40S specsL40S machine learningL40S deep learningL40S vs A100L40S vs H100