The NVIDIA L40S brings Ada Lovelace architecture to datacenters with 48GB of GDDR6 memory. Positioned between consumer RTX GPUs and HBM-based datacenter cards, it offers FP8 Tensor Cores and modern features at a more accessible price point than H100. For CUDA developers deploying inference or training workloads in cloud/datacenter environments, the L40S provides excellent performance per dollar. The 48GB VRAM handles large models, while FP8 support enables efficient inference. This guide covers L40S optimization strategies and when to choose it over alternatives.
| Architecture | Ada Lovelace (AD102) |
| CUDA Cores | 18,176 |
| Tensor Cores | 568 |
| Memory | 48GB GDDR6 |
| Memory Bandwidth | 864 GB/s |
| Base / Boost Clock | 1110 / 2520 MHz |
| FP32 Performance | 91.6 TFLOPS |
| FP16 Performance | 183.2 TFLOPS |
| L2 Cache | 96MB |
| TDP | 350W |
| NVLink | No |
| MSRP | $10,000+ |
| Release | August 2023 |
This code snippet shows how to detect your L40S, check available memory, and configure optimal settings for the Ada Lovelace (AD102) architecture.
import torch
import pynvml
# Check if L40S is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {torch.cuda.get_device_name(0)}")
# L40S Memory: 48GB - Optimal batch sizes
# Architecture: Ada Lovelace (AD102)
# CUDA Cores: 18,176
# Memory-efficient training for L40S
torch.backends.cuda.matmul.allow_tf32 = True # Enable TF32 for Ada Lovelace (AD102)
torch.backends.cudnn.allow_tf32 = True
# Check available memory
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"Free memory: {info.free / 1024**3:.1f} GB / 48 GB total")
# Recommended batch size calculation for L40S
model_memory_gb = 2.0 # Adjust based on your model
batch_multiplier = (48 - model_memory_gb) / 4 # 4GB per batch unit
recommended_batch = int(batch_multiplier * 32)
print(f"Recommended batch size for L40S: {recommended_batch}")| Task | Performance | Comparison |
|---|---|---|
| LLaMA-70B Inference (tokens/sec) | 55 | FP8 quantized |
| Stable Diffusion XL (images/sec) | 8.2 | Strong for generation |
| BERT-Large Inference (sentences/sec) | 3,800 | FP8 optimized |
| ResNet-50 Training (imgs/sec) | 2,100 | 74% of H100 |
| Memory Bandwidth (GB/s measured) | 810 | 94% of theoretical peak |
| cuBLAS GEMM FP8 (TFLOPS) | 680 | Strong FP8 performance |
| Use Case | Rating | Notes |
|---|---|---|
| LLM Inference | Excellent | FP8 + 48GB excellent for serving |
| Generative AI Inference | Excellent | Cost-effective for SD/image gen |
| Multi-Tenant Inference | Excellent | MIG for workload isolation |
| ML Training | Good | Good but H100 better for training |
| Budget Datacenter | Excellent | Better $/perf than H100 |
| Video AI | Excellent | Ada architecture video features |
H100 is 2-3x faster for training and has NVLink. L40S is better for inference at lower cost. Choose H100 for training clusters, L40S for inference deployment and cost-sensitive workloads.
L40S is newer with FP8 and larger L2 cache, roughly 20% faster for inference. A100 has HBM2e with higher bandwidth, better for training. L40S is better for inference, A100 for mixed workloads.
Excellent. The 48GB VRAM and FP8 Tensor Cores make it ideal for LLM inference. Cost-effective compared to H100 for inference-focused deployments.
Yes, but H100 is more efficient. L40S lacks NVLink for multi-GPU training. For single-GPU fine-tuning or smaller training jobs, L40S works well. For large-scale training, use H100.
2-3x faster, HBM3, higher price
HBM2e, better bandwidth
Consumer 24GB, half VRAM
Workstation 48GB, older arch
Ready to optimize your CUDA kernels for L40S? Download RightNow AI for real-time performance analysis.