Scatter and gather operations move data between positions based on index tensors. Gather reads from indexed positions (like embedding lookup), while scatter writes to indexed positions (like sparse gradient updates). Graph neural networks heavily use these for message passing. This guide covers efficient implementations for both operations with focus on avoiding race conditions in scatter.
Sort indices, process segments without atomics.
Use sorting for reproducible results.
Use float4 for consecutive dimension access.
Atomics are correct but serialize when indices collide.
// Scatter with atomics - correct but slow
__global__ void scatter_add_atomic(float* src, int* indices, float* dst,
int n, int dim) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n) return;
int target = indices[idx];
for (int d = 0; d < dim; d++) {
atomicAdd(&dst[target * dim + d], src[idx * dim + d]);
}
}Sorting enables segment-based reduction without atomics.
// Sort indices, then process segments
// Step 1: Sort (indices, values) by indices
// Step 2: Find segment boundaries
// Step 3: Reduce each segment
#include <cub/cub.cuh>
void scatter_add_sorted(float* src, int* indices, float* dst,
int n, int dim, void* temp) {
// Sort indices and get permutation
int* sorted_indices;
int* permutation;
cub::DeviceRadixSort::SortPairs(temp, temp_bytes,
indices, sorted_indices,
permutation, n);
// Now consecutive threads access consecutive memory
// Segment reduce without atomics
scatter_segment_reduce<<<grid, block>>>(
src, sorted_indices, permutation, dst, n, dim);
}
__global__ void scatter_segment_reduce(float* src, int* sorted_idx,
int* perm, float* dst, int n, int dim) {
// Identify segment boundaries
// Reduce within segment using warp shuffle
// Single thread per segment writes result
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Scatter throughput | 15 GB/s | 120 GB/s | 8x for high collision |
Atomic is simpler for low collision rates (<5% duplicates). Sorted is faster for high collision (embeddings, GNNs) but has O(n log n) sort overhead.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.