RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

scientificPython

RAPIDS CUDA Guide: GPU-Accelerated Data Science with cuDF, cuML, cuGraph

December 25, 202515 min read

Introduction

RAPIDS is NVIDIA's suite of GPU-accelerated data science libraries that provide pandas-like DataFrames (cuDF), scikit-learn-like ML (cuML), and NetworkX-like graph analytics (cuGraph). Built on Apache Arrow and CUDA, RAPIDS enables end-to-end data science pipelines on GPU. For CUDA developers working with data, RAPIDS provides familiar Python APIs with massive speedups. A single GPU can process datasets that would require distributed computing on CPU, with operations running 10-100x faster than pandas/scikit-learn while using the same syntax. This guide covers cuDF for data manipulation, cuML for machine learning, cuGraph for network analysis, and best practices for building GPU-accelerated data science workflows.

CUDA Integration: RAPIDS uses CUDA for all operations, with cuBLAS for linear algebra, cuDNN for deep learning primitives, and NCCL for multi-GPU communication. All libraries use Apache Arrow columnar format for zero-copy data sharing between components and with other frameworks.

Key Features

✓cuDF: pandas-like GPU DataFrames
✓cuML: scikit-learn compatible ML on GPU
✓cuGraph: GPU graph analytics
✓cuSpatial: GPU spatial/GIS operations
✓cuSignal: GPU signal processing
✓Apache Arrow for zero-copy interop
✓Dask integration for multi-GPU
✓PyTorch/TensorFlow interoperability
✓SQL query support via cuDF
✓Built on CUDA/cuDNN/NCCL

Installation

Install RAPIDS using conda for best compatibility.

bash

# Install via conda (recommended)
conda create -n rapids-env -c rapidsai -c conda-forge -c nvidia \
    rapids=24.10 python=3.11 cudatoolkit=12.0

conda activate rapids-env

# Or install specific packages
conda install -c rapidsai -c conda-forge -c nvidia \
    cudf=24.10 cuml=24.10 cugraph=24.10

# Verify installation
python -c "import cudf; print(f'cuDF {cudf.__version__}')"
python -c "import cuml; print(f'cuML {cuml.__version__}')"

# Check GPU
python -c "import cudf; df = cudf.DataFrame({'a': [1,2,3]}); print(df)"

Basic Example

cuDF Data Manipulation

Use cuDF for GPU-accelerated pandas operations.

python

import cudf
import pandas as pd
import numpy as np

# Create GPU DataFrame - just like pandas
gdf = cudf.DataFrame({
    'a': np.arange(1000000),
    'b': np.random.randn(1000000),
    'c': np.random.choice(['X', 'Y', 'Z'], 1000000)
})

# All pandas operations work on GPU
gdf['d'] = gdf['a'] * gdf['b']
grouped = gdf.groupby('c').agg({'a': 'mean', 'b': 'sum', 'd': 'max'})
filtered = gdf[gdf['b'] > 0]

# Convert between pandas and cuDF
pdf = gdf.to_pandas()  # GPU -> CPU
gdf2 = cudf.from_pandas(pdf)  # CPU -> GPU

# Read CSV directly to GPU
gdf = cudf.read_csv('data.csv')

# SQL queries on GPU DataFrames
from cudf import sql
result = sql.execute("SELECT c, AVG(b) FROM gdf WHERE a > 1000 GROUP BY c")

# Join operations - much faster than pandas
left = cudf.DataFrame({'key': range(1000000), 'val1': range(1000000)})
right = cudf.DataFrame({'key': range(500000), 'val2': range(500000)})
merged = left.merge(right, on='key', how='inner')

# String operations on GPU
gdf = cudf.DataFrame({'text': ['hello', 'world', 'GPU', 'acceleration']})
gdf['upper'] = gdf['text'].str.upper()
gdf['contains'] = gdf['text'].str.contains('GPU')

# DateTime operations
gdf = cudf.DataFrame({
    'date': cudf.date_range('2024-01-01', periods=1000000, freq='1min')
})
gdf['year'] = gdf['date'].dt.year
gdf['month'] = gdf['date'].dt.month

Advanced Example

End-to-End ML Pipeline with cuML and cuDF

Build complete GPU-accelerated machine learning pipeline.

python

import cudf
import cuml
from cuml.ensemble import RandomForestClassifier
from cuml.preprocessing import train_test_split, StandardScaler
from cuml.metrics import accuracy_score

# Load data to GPU
df = cudf.read_csv('large_dataset.csv')

# Feature engineering on GPU
df['feature_interaction'] = df['feature1'] * df['feature2']
df['log_feature'] = cuml.preprocessing.log1p(df['skewed_feature'])

# Encode categorical variables
from cuml.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

# Split data (stays on GPU)
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features on GPU
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Random Forest on GPU
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    n_bins=128,  # GPU-specific parameter
    max_features=1.0
)
rf.fit(X_train_scaled, y_train)

# Predict on GPU
predictions = rf.predict(X_test_scaled)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.4f}")

# K-Means clustering on GPU
from cuml.cluster import KMeans
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(X_train_scaled)

# PCA dimensionality reduction on GPU
from cuml.decomposition import PCA
pca = PCA(n_components=10)
X_reduced = pca.fit_transform(X_train_scaled)

# Multi-GPU with Dask
import dask_cudf
from dask_ml.model_selection import train_test_split as dask_train_test_split

# Create Dask GPU DataFrame
ddf = dask_cudf.read_csv('huge_dataset_*.csv')

# Operations distributed across GPUs
result = ddf.groupby('category').agg({'value': 'mean'}).compute()

# cuGraph for network analysis
import cugraph

# Create graph on GPU
G = cugraph.Graph()
edges = cudf.DataFrame({
    'src': [0, 1, 2, 3],
    'dst': [1, 2, 3, 0],
    'weight': [1.0, 2.0, 1.5, 0.5]
})
G.from_cudf_edgelist(edges, source='src', destination='dst', edge_attr='weight')

# PageRank on GPU
pagerank = cugraph.pagerank(G)

# Shortest paths on GPU
paths = cugraph.sssp(G, source=0)

Performance Tips

high impact

Keep data on GPU throughout pipeline

Avoid .to_pandas() until final results. Chain cuDF -> cuML -> cuGraph operations without CPU transfers for maximum speed.

high impact

Use Apache Arrow for zero-copy sharing

Convert between RAPIDS, PyTorch, and TensorFlow using Arrow format to avoid data copies.

medium impact

Batch operations when possible

Single large operations are faster than many small ones. Combine filters and aggregations.

high impact

Use Dask for datasets larger than GPU memory

dask-cudf automatically manages memory and scales across multiple GPUs. Essential for 100GB+ datasets.

medium impact

Leverage cuDF string operations

cuDF string operations are highly optimized. Use them instead of applying Python functions with .apply().

medium impact

Profile with RAPIDS memory manager

Use rmm (RAPIDS Memory Manager) pool allocator to reduce allocation overhead and monitor GPU memory.

Common Pitfalls

•Frequent CPU-GPU transfers - use .to_pandas() sparingly
•Not using Dask for large datasets - causes OOM errors
•Mixing pandas and cuDF in loops - forces transfers
•Using .apply() with Python functions - extremely slow
•Not checking GPU memory before operations
•Ignoring RMM memory pool configuration

Benchmarks

Task	Performance	Notes
GroupBy aggregation (100M rows)	50x	vs pandas on CPU
Random Forest training (1M samples)	25x	vs scikit-learn
PageRank (10M edges)	100x	vs NetworkX
CSV reading (10GB file)	20x	vs pandas

Frequently Asked Questions

Is cuDF a drop-in replacement for pandas?

Mostly yes - cuDF implements most pandas operations with the same API. Some advanced features may differ. Check the API docs for compatibility. Generally 90%+ of pandas code works as-is.

How do I use RAPIDS with PyTorch?

Use __cuda_array_interface__ for zero-copy: torch.as_tensor(cudf_series.__cuda_array_interface__) or use .values to get CuPy array then convert to PyTorch.

Can RAPIDS use multiple GPUs?

Yes, use dask-cudf and dask-cuml for multi-GPU. Each GPU processes a partition of the data. RAPIDS handles data distribution and synchronization automatically.