TensorFlow is Google's production-grade deep learning framework with comprehensive CUDA support. Its XLA compiler, mixed precision API, and distribution strategies make it excellent for large-scale training and deployment across devices. For CUDA developers, TensorFlow offers automatic GPU utilization with @tf.function JIT compilation, XLA for cross-platform optimization, and tight integration with TensorRT for inference. Understanding memory management and graph optimization is key to peak performance. This guide covers TensorFlow's GPU configuration, XLA compilation, mixed precision training, memory optimization, and deployment strategies for CUDA-accelerated workflows.
CUDA Integration: TensorFlow uses cuDNN and cuBLAS under the hood, automatically selecting optimal algorithms. XLA compiles TensorFlow graphs into optimized GPU code, fusing operations and eliminating memory transfers. For inference, TensorFlow-TensorRT (TF-TRT) converts models to TensorRT for maximum GPU utilization.
Install TensorFlow with GPU support using pip.
# TensorFlow 2.x includes GPU support by default
pip install tensorflow
# Verify installation
python -c "import tensorflow as tf; print(f'TensorFlow {tf.__version__}'); print(f'GPUs: {tf.config.list_physical_devices("GPU")}')"
# Check CUDA/cuDNN versions
python -c "import tensorflow as tf; print(tf.sysconfig.get_build_info())"
# For specific CUDA versions, use nvidia's containers
docker pull tensorflow/tensorflow:latest-gpuConfigure GPU memory and train a simple model with TensorFlow.
import tensorflow as tf
# Configure GPU memory growth (prevents TF from grabbing all VRAM)
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
print(f"GPUs available: {len(gpus)}")
except RuntimeError as e:
print(e)
# Simple model with Keras
model = tf.keras.Sequential([
tf.keras.layers.Dense(256, activation='relu', input_shape=(784,)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# Training - automatically uses GPU
history = model.fit(
x_train, y_train,
batch_size=128,
epochs=10,
validation_split=0.1
)
# Inference
predictions = model.predict(x_test)Enable XLA compilation and mixed precision for maximum GPU performance.
import tensorflow as tf
from tensorflow.keras import mixed_precision
# Enable mixed precision globally
mixed_precision.set_global_policy('mixed_float16')
# Enable XLA compilation
tf.config.optimizer.set_jit(True)
# Custom training loop with tf.function for graph compilation
class Trainer:
def __init__(self, model, optimizer):
self.model = model
self.optimizer = optimizer
self.loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
@tf.function(jit_compile=True) # XLA compilation
def train_step(self, x, y):
with tf.GradientTape() as tape:
logits = self.model(x, training=True)
# Cast to float32 for loss computation (mixed precision)
loss = self.loss_fn(y, tf.cast(logits, tf.float32))
gradients = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients, self.model.trainable_variables))
return loss
@tf.function
def eval_step(self, x, y):
logits = self.model(x, training=False)
return tf.reduce_mean(
tf.cast(tf.argmax(logits, axis=1) == y, tf.float32)
)
# Create model with float16 compute, float32 output
def create_model():
inputs = tf.keras.Input(shape=(784,))
x = tf.keras.layers.Dense(512, activation='relu')(inputs)
x = tf.keras.layers.Dense(256, activation='relu')(x)
# Output layer in float32 for numerical stability
outputs = tf.keras.layers.Dense(10, activation='softmax', dtype='float32')(x)
return tf.keras.Model(inputs, outputs)
model = create_model()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
trainer = Trainer(model, optimizer)
# Training loop
for epoch in range(10):
for x_batch, y_batch in train_dataset:
loss = trainer.train_step(x_batch, y_batch)Add @tf.function(jit_compile=True) to your training functions for automatic kernel fusion and optimization.
mixed_precision.set_global_policy("mixed_float16") halves memory usage and speeds up training on Tensor Core GPUs.
tf.config.experimental.set_memory_growth(gpu, True) prevents TensorFlow from allocating all GPU memory upfront.
Prefetch and parallelize data loading with tf.data.Dataset.prefetch(tf.data.AUTOTUNE) to hide I/O latency.
Wrap your forward pass in @tf.function to compile it as a single GPU kernel instead of many small operations.
Convert your SavedModel to TensorRT format with tf.experimental.tensorrt.Converter for 2-6x faster inference.
| Task | Performance | Notes |
|---|---|---|
| ResNet-50 Training (imgs/sec) | 1,720 | RTX 4090, batch=64, XLA+AMP |
| BERT Inference (sentences/sec) | 285 | RTX 4090, TensorRT FP16 |
| XLA speedup | 1.2-1.5x | Varies by model complexity |
| TensorRT speedup | 2-6x | Compared to TF native inference |
TensorFlow pre-allocates GPU memory by default for performance. Use tf.config.experimental.set_memory_growth(gpu, True) to allocate memory as needed instead.
Use tf.distribute.MirroredStrategy() for single-node multi-GPU training. Wrap your model creation and training in strategy.scope() context.
Eager mode executes operations immediately (good for debugging). Graph mode with @tf.function compiles operations into an optimized graph (much faster for training/inference).
Save as SavedModel, then use tf.experimental.tensorrt.Converter to convert. Specify precision_mode="FP16" for best performance on Tensor Core GPUs.
More flexible, better for research, dynamic graphs
Google's newer framework, pure functional, better XLA
For custom GPU kernels, lower level
Optimize your TensorFlow CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.