Cooperative Groups provide flexible, composable thread synchronization beyond traditional __syncthreads. They enable grid-wide synchronization, dynamic group partitioning, and cleaner modular code. Essential for advanced algorithms requiring global barriers. This guide covers the cooperative groups API, common patterns, and performance considerations.
Create fixed-size tile groups for modular sync.
Group only active threads after branch.
Synchronize entire grid for multi-phase algorithms.
Traditional sync with __syncthreads.
// Traditional block synchronization
__global__ void kernel_traditional(float* data, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
// Phase 1
sdata[tid] = data[blockIdx.x * blockDim.x + tid];
__syncthreads(); // Traditional sync
// Phase 2
if (tid < 128) sdata[tid] += sdata[tid + 128];
__syncthreads();
// More phases...
}Cooperative groups enable flexible synchronization and grid-wide barriers.
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void kernel_coop_groups(float* data, int n) {
// Get thread block group
cg::thread_block block = cg::this_thread_block();
// Create warp-sized tile
cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
// Warp shuffle using cooperative groups API
int lane = warp.thread_rank();
float val = data[blockIdx.x * blockDim.x + threadIdx.x];
// Warp-level reduction
for (int offset = 16; offset > 0; offset /= 2) {
val += warp.shfl_down(val, offset);
}
// Block sync
block.sync();
// Coalesced group for active threads only
if (threadIdx.x < n) {
cg::coalesced_group active = cg::coalesced_threads();
// Only active threads participate
float sum = cg::reduce(active, val, cg::plus<float>());
}
}
// Grid-wide synchronization
__global__ void grid_sync_kernel(float* data, int n) {
cg::grid_group grid = cg::this_grid();
// Phase 1: All blocks process their portion
int idx = grid.thread_rank();
if (idx < n) data[idx] *= 2.0f;
// Grid-wide barrier - ALL blocks wait here
grid.sync();
// Phase 2: Now all data is updated
// Can safely read neighbors from other blocks
}
// Launch with cooperative kernel API
void launch_grid_sync() {
int numBlocksPerSm;
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm,
grid_sync_kernel, 256, 0);
void* args[] = { &d_data, &n };
cudaLaunchCooperativeKernel((void*)grid_sync_kernel,
numBlocksPerSm * num_sms, 256, args);
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Grid sync overhead | N/A | ~10μs | Enables new algorithms |
Grid sync is useful for iterative algorithms (Jacobi, conjugate gradient) where you need to complete a phase before the next. It avoids kernel launch overhead for multi-phase algorithms.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.