RAPIDS is NVIDIA's suite of GPU-accelerated data science libraries that provide pandas-like DataFrames (cuDF), scikit-learn-like ML (cuML), and NetworkX-like graph analytics (cuGraph). Built on Apache Arrow and CUDA, RAPIDS enables end-to-end data science pipelines on GPU. For CUDA developers working with data, RAPIDS provides familiar Python APIs with massive speedups. A single GPU can process datasets that would require distributed computing on CPU, with operations running 10-100x faster than pandas/scikit-learn while using the same syntax. This guide covers cuDF for data manipulation, cuML for machine learning, cuGraph for network analysis, and best practices for building GPU-accelerated data science workflows.
CUDA Integration: RAPIDS uses CUDA for all operations, with cuBLAS for linear algebra, cuDNN for deep learning primitives, and NCCL for multi-GPU communication. All libraries use Apache Arrow columnar format for zero-copy data sharing between components and with other frameworks.
Install RAPIDS using conda for best compatibility.
# Install via conda (recommended)
conda create -n rapids-env -c rapidsai -c conda-forge -c nvidia \
rapids=24.10 python=3.11 cudatoolkit=12.0
conda activate rapids-env
# Or install specific packages
conda install -c rapidsai -c conda-forge -c nvidia \
cudf=24.10 cuml=24.10 cugraph=24.10
# Verify installation
python -c "import cudf; print(f'cuDF {cudf.__version__}')"
python -c "import cuml; print(f'cuML {cuml.__version__}')"
# Check GPU
python -c "import cudf; df = cudf.DataFrame({'a': [1,2,3]}); print(df)"Use cuDF for GPU-accelerated pandas operations.
import cudf
import pandas as pd
import numpy as np
# Create GPU DataFrame - just like pandas
gdf = cudf.DataFrame({
'a': np.arange(1000000),
'b': np.random.randn(1000000),
'c': np.random.choice(['X', 'Y', 'Z'], 1000000)
})
# All pandas operations work on GPU
gdf['d'] = gdf['a'] * gdf['b']
grouped = gdf.groupby('c').agg({'a': 'mean', 'b': 'sum', 'd': 'max'})
filtered = gdf[gdf['b'] > 0]
# Convert between pandas and cuDF
pdf = gdf.to_pandas() # GPU -> CPU
gdf2 = cudf.from_pandas(pdf) # CPU -> GPU
# Read CSV directly to GPU
gdf = cudf.read_csv('data.csv')
# SQL queries on GPU DataFrames
from cudf import sql
result = sql.execute("SELECT c, AVG(b) FROM gdf WHERE a > 1000 GROUP BY c")
# Join operations - much faster than pandas
left = cudf.DataFrame({'key': range(1000000), 'val1': range(1000000)})
right = cudf.DataFrame({'key': range(500000), 'val2': range(500000)})
merged = left.merge(right, on='key', how='inner')
# String operations on GPU
gdf = cudf.DataFrame({'text': ['hello', 'world', 'GPU', 'acceleration']})
gdf['upper'] = gdf['text'].str.upper()
gdf['contains'] = gdf['text'].str.contains('GPU')
# DateTime operations
gdf = cudf.DataFrame({
'date': cudf.date_range('2024-01-01', periods=1000000, freq='1min')
})
gdf['year'] = gdf['date'].dt.year
gdf['month'] = gdf['date'].dt.monthBuild complete GPU-accelerated machine learning pipeline.
import cudf
import cuml
from cuml.ensemble import RandomForestClassifier
from cuml.preprocessing import train_test_split, StandardScaler
from cuml.metrics import accuracy_score
# Load data to GPU
df = cudf.read_csv('large_dataset.csv')
# Feature engineering on GPU
df['feature_interaction'] = df['feature1'] * df['feature2']
df['log_feature'] = cuml.preprocessing.log1p(df['skewed_feature'])
# Encode categorical variables
from cuml.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
# Split data (stays on GPU)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features on GPU
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Random Forest on GPU
rf = RandomForestClassifier(
n_estimators=100,
max_depth=10,
n_bins=128, # GPU-specific parameter
max_features=1.0
)
rf.fit(X_train_scaled, y_train)
# Predict on GPU
predictions = rf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.4f}")
# K-Means clustering on GPU
from cuml.cluster import KMeans
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(X_train_scaled)
# PCA dimensionality reduction on GPU
from cuml.decomposition import PCA
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X_train_scaled)
# Multi-GPU with Dask
import dask_cudf
from dask_ml.model_selection import train_test_split as dask_train_test_split
# Create Dask GPU DataFrame
ddf = dask_cudf.read_csv('huge_dataset_*.csv')
# Operations distributed across GPUs
result = ddf.groupby('category').agg({'value': 'mean'}).compute()
# cuGraph for network analysis
import cugraph
# Create graph on GPU
G = cugraph.Graph()
edges = cudf.DataFrame({
'src': [0, 1, 2, 3],
'dst': [1, 2, 3, 0],
'weight': [1.0, 2.0, 1.5, 0.5]
})
G.from_cudf_edgelist(edges, source='src', destination='dst', edge_attr='weight')
# PageRank on GPU
pagerank = cugraph.pagerank(G)
# Shortest paths on GPU
paths = cugraph.sssp(G, source=0)Avoid .to_pandas() until final results. Chain cuDF -> cuML -> cuGraph operations without CPU transfers for maximum speed.
Convert between RAPIDS, PyTorch, and TensorFlow using Arrow format to avoid data copies.
Single large operations are faster than many small ones. Combine filters and aggregations.
dask-cudf automatically manages memory and scales across multiple GPUs. Essential for 100GB+ datasets.
cuDF string operations are highly optimized. Use them instead of applying Python functions with .apply().
Use rmm (RAPIDS Memory Manager) pool allocator to reduce allocation overhead and monitor GPU memory.
| Task | Performance | Notes |
|---|---|---|
| GroupBy aggregation (100M rows) | 50x | vs pandas on CPU |
| Random Forest training (1M samples) | 25x | vs scikit-learn |
| PageRank (10M edges) | 100x | vs NetworkX |
| CSV reading (10GB file) | 20x | vs pandas |
Mostly yes - cuDF implements most pandas operations with the same API. Some advanced features may differ. Check the API docs for compatibility. Generally 90%+ of pandas code works as-is.
Use __cuda_array_interface__ for zero-copy: torch.as_tensor(cudf_series.__cuda_array_interface__) or use .values to get CuPy array then convert to PyTorch.
Yes, use dask-cudf and dask-cuml for multi-GPU. Each GPU processes a partition of the data. RAPIDS handles data distribution and synchronization automatically.
Single GPU: up to 80GB (A100), Multi-GPU with Dask: terabytes. For datasets larger than GPU memory, Dask spills to CPU RAM or disk automatically.
Lower-level NumPy-on-GPU, more control but less data science focus
Deep learning focused, less data manipulation
ML framework, not for general data processing
Optimize your RAPIDS CUDA code with RightNow AI - get real-time performance suggestions and memory analysis.