The NVIDIA L40 brings Ada Lovelace architecture to professional visualization and AI workloads with 48GB of GDDR6 memory. Positioned between the consumer RTX 4090 and datacenter L40S, the L40 combines RT cores for ray tracing with Tensor Cores for AI acceleration. For CUDA developers, the L40 offers a versatile platform that handles both graphics and compute workloads. The 48GB memory capacity enables large model inference and complex rendering scenes, while maintaining reasonable power consumption at 300W. This guide covers the L40's specifications, CUDA optimization strategies, benchmark results, and practical tips for maximizing performance.
| Architecture | Ada Lovelace (AD102) |
| CUDA Cores | 18,176 |
| Tensor Cores | 568 |
| Memory | 48GB GDDR6 |
| Memory Bandwidth | 864 GB/s |
| Base / Boost Clock | 735 / 2490 MHz |
| FP32 Performance | 90.5 TFLOPS |
| FP16 Performance | 181 TFLOPS |
| L2 Cache | 96MB |
| TDP | 300W |
| NVLink | No |
| MSRP | $7,000 |
| Release | October 2022 |
This code snippet shows how to detect your L40, check available memory, and configure optimal settings for the Ada Lovelace (AD102) architecture.
import torch
import pynvml
# Check if L40 is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {torch.cuda.get_device_name(0)}")
# L40 Memory: 48GB - Optimal batch sizes
# Architecture: Ada Lovelace (AD102)
# CUDA Cores: 18,176
# Memory-efficient training for L40
torch.backends.cuda.matmul.allow_tf32 = True # Enable TF32 for Ada Lovelace (AD102)
torch.backends.cudnn.allow_tf32 = True
# Check available memory
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(f"Free memory: {info.free / 1024**3:.1f} GB / 48 GB total")
# Recommended batch size calculation for L40
model_memory_gb = 2.0 # Adjust based on your model
batch_multiplier = (48 - model_memory_gb) / 4 # 4GB per batch unit
recommended_batch = int(batch_multiplier * 32)
print(f"Recommended batch size for L40: {recommended_batch}")| Task | Performance | Comparison |
|---|---|---|
| ResNet-50 Inference (imgs/sec) | 8,500 | FP16 Tensor Cores |
| Stable Diffusion (sec/img) | 2.5 | FP16 mode |
| LLaMA-7B (tokens/sec) | 75 | INT8 quantized |
| SPECviewperf 3dsmax | 180 | Professional rendering |
| Blender Rendering | 2x A40 | Cycles RT |
| Memory Bandwidth (GB/s) | 820 | 95% efficiency |
| Use Case | Rating | Notes |
|---|---|---|
| Visual Computing | Excellent | RT cores + 48GB for complex scenes |
| AI Inference | Excellent | FP8 Tensor Cores, large memory |
| Virtual Workstations | Excellent | vGPU support for VDI |
| Content Creation | Excellent | Rendering + video processing |
| ML Training | Good | 48GB helps but prefer L40S |
| Pure Compute | Good | L40S better for pure AI |
L40 has RT cores for ray tracing (142 vs 0) while L40S removes them for lower power (350W vs 300W). L40S has slightly faster AI performance. Choose L40 for graphics+AI, L40S for pure AI inference.
Yes, the L40 is excellent for ML inference with 48GB memory and FP8 Tensor Cores. For pure training workloads without graphics needs, L40S or A100 are better optimized.
L40 offers 48GB vs 24GB, professional drivers, vGPU support, and better reliability. For pure compute, RTX 4090 is faster per dollar. L40 shines in mixed graphics+AI and enterprise deployments.
Yes, L40 has 142 3rd Gen RT cores for hardware-accelerated ray tracing. This makes it suitable for rendering, visualization, and graphics workloads alongside AI inference.
No RT cores, optimized for AI
Consumer, 24GB, faster compute
Previous gen, similar positioning
Workstation, 48GB, NVLink
Ready to optimize your CUDA kernels for L40? Download RightNow AI for real-time performance analysis.