Memory coalescing is one of the most critical optimizations in CUDA programming. When threads in a warp access consecutive memory addresses, the GPU hardware combines these requests into a minimal number of memory transactions. A single coalesced access can achieve 10-20x higher bandwidth than scattered accesses. Understanding coalescing is essential because global memory bandwidth is often the primary bottleneck in CUDA applications. Modern GPUs can deliver 500+ GB/s of memory bandwidth, but only with properly coalesced access patterns. Non-coalesced patterns may achieve less than 50 GB/s, leaving 90% of available bandwidth unused.
Convert interleaved data (AoS) to separate arrays for each field. This ensures threads accessing the same field hit consecutive memory addresses.
// BAD: Array of Structures (AoS)
struct Particle { float x, y, z, w; };
Particle particles[N];
// Thread i accesses: particles[i].x (stride = 16 bytes)
// GOOD: Structure of Arrays (SoA)
struct ParticlesSoA {
float *x, *y, *z, *w;
};
ParticlesSoA particles;
// Thread i accesses: particles.x[i] (stride = 4 bytes, coalesced!)
// Kernel access pattern
__global__ void update_aos(Particle* p, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
p[i].x += 1.0f; // 16-byte stride, poor coalescing
}
__global__ void update_soa(float* x, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
x[i] += 1.0f; // 4-byte stride, perfect coalescing
}Ensure base addresses are aligned to 32, 64, or 128 bytes. Misaligned accesses may require multiple transactions even with consecutive thread access.
// Using cudaMalloc - automatically aligned to 256 bytes
float *d_array;
cudaMalloc(&d_array, N * sizeof(float));
// For manual alignment (host pinned memory)
void *ptr;
cudaMallocHost(&ptr, size + 256);
float *aligned_ptr = (float*)(((uintptr_t)ptr + 255) & ~255);
// Access aligned segments
__global__ void aligned_access(float* data, int offset) {
// Ensure offset maintains alignment
int aligned_offset = (offset + 31) & ~31; // Align to 128 bytes
int idx = blockIdx.x * blockDim.x + threadIdx.x;
float val = data[aligned_offset + idx]; // Coalesced
}When data layout cannot be changed, use shared memory to transpose access patterns. Load with one pattern, store to shared memory, then access with coalesced pattern.
__global__ void transpose_access(float* input, float* output,
int width, int height) {
__shared__ float tile[32][33]; // +1 padding for bank conflicts
int x = blockIdx.x * 32 + threadIdx.x;
int y = blockIdx.y * 32 + threadIdx.y;
// Coalesced read from global memory (row-wise)
if (x < width && y < height)
tile[threadIdx.y][threadIdx.x] = input[y * width + x];
__syncthreads();
// Coalesced write to global memory (transposed)
x = blockIdx.y * 32 + threadIdx.x;
y = blockIdx.x * 32 + threadIdx.y;
if (x < height && y < width)
output[y * height + x] = tile[threadIdx.x][threadIdx.y];
}Use vector types (float2, float4, int4) to load multiple elements per thread in a single instruction. This increases memory throughput and instruction efficiency.
// Scalar access: 4 load instructions per thread
__global__ void scalar_load(float* in, float* out, int n) {
int i = (blockIdx.x * blockDim.x + threadIdx.x) * 4;
out[i] = in[i] * 2.0f;
out[i+1] = in[i+1] * 2.0f;
out[i+2] = in[i+2] * 2.0f;
out[i+3] = in[i+3] * 2.0f;
}
// Vector access: 1 load instruction per thread
__global__ void vector_load(float4* in, float4* out, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
float4 val = in[i]; // Single 128-bit load
val.x *= 2.0f; val.y *= 2.0f;
val.z *= 2.0f; val.w *= 2.0f;
out[i] = val; // Single 128-bit store
}Add padding to data structures to ensure natural alignment boundaries. This is especially important for 2D arrays where row width matters.
// Automatic padding with cudaMallocPitch
size_t pitch;
float* d_matrix;
cudaMallocPitch(&d_matrix, &pitch, width * sizeof(float), height);
// Access with pitch
__global__ void access_pitched(float* data, size_t pitch,
int width, int height) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
// pitch is in bytes, so cast appropriately
float* row = (float*)((char*)data + y * pitch);
float val = row[x]; // Coalesced access within row
}Column-wise traversal of a row-major matrix causes severe coalescing problems. Each warp iteration accesses memory addresses separated by the matrix width.
// BAD: Column-wise access of row-major matrix
// Threads in a warp access addresses with stride = width
__global__ void sum_columns_bad(float* matrix, float* result,
int width, int height) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (col < width) {
float sum = 0.0f;
for (int row = 0; row < height; row++) {
// Thread 0: matrix[0], Thread 1: matrix[1], ...
// But matrix[0] and matrix[width] are far apart!
sum += matrix[row * width + col]; // Strided access
}
result[col] = sum;
}
}
// Memory access pattern for warp with 32 threads:
// Transaction 1: matrix[0], matrix[1], ..., matrix[31] ✓
// Transaction 2: matrix[width], matrix[width+1], ...
// ^ These are NOT consecutive with first set!Using shared memory to reorganize access patterns achieves coalescing while maintaining correct column-sum semantics.
// GOOD: Load tiles row-wise (coalesced), reduce in shared memory
#define BLOCK_SIZE 256
__global__ void sum_columns_good(float* matrix, float* result,
int width, int height) {
__shared__ float partial[BLOCK_SIZE];
int col = blockIdx.x;
int tid = threadIdx.x;
// Each thread accumulates multiple elements
float sum = 0.0f;
for (int row = tid; row < height; row += BLOCK_SIZE) {
// All threads in block access consecutive columns
// Thread 0: matrix[row*width + col]
// Thread 1: matrix[row*width + col+1] (if different block)
// Within a block: row varies, col fixed → stride = width
// BUT: different blocks handle different columns
// Better approach: transpose thinking
sum += matrix[row * width + col];
}
partial[tid] = sum;
__syncthreads();
// Parallel reduction
for (int s = BLOCK_SIZE / 2; s > 0; s >>= 1) {
if (tid < s) {
partial[tid] += partial[tid + s];
}
__syncthreads();
}
if (tid == 0) result[col] = partial[0];
}
// Even better: tile-based approach with coalesced loads
__global__ void sum_columns_tiled(float* matrix, float* result,
int width, int height) {
__shared__ float tile[32][33]; // 32x32 tile, +1 for banks
int col = blockIdx.x * 32 + threadIdx.x;
int row_base = blockIdx.y * 32;
// Coalesced load: threads load consecutive columns
if (col < width && (row_base + threadIdx.y) < height)
tile[threadIdx.y][threadIdx.x] =
matrix[(row_base + threadIdx.y) * width + col];
else
tile[threadIdx.y][threadIdx.x] = 0.0f;
__syncthreads();
// Now reduce within tile (access pattern in shared memory)
// ... reduction code
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Effective Bandwidth (GB/s) | 45 GB/s | 380 GB/s | 8.4x higher |
| Memory Transactions per Request | 32 transactions | 1 transaction | 32x fewer |
| L2 Cache Hit Rate | 12% | 85% | 7x better |
| Kernel Execution Time | 4.2 ms | 0.5 ms | 8.4x faster |
A memory transaction is a single memory request from the GPU to global memory. Modern GPUs use 32-byte, 64-byte, or 128-byte transactions. When a warp (32 threads) accesses memory, the hardware tries to combine requests into the minimum number of transactions. Ideally, 32 threads accessing consecutive 4-byte floats results in a single 128-byte transaction.
Use NVIDIA Nsight Compute to profile your kernel. Look at the "Memory Workload Analysis" section. Key metrics include: Global Load/Store Efficiency (should be close to 100%), Transactions Per Request (should be close to 1), and L2 Cache Throughput. If efficiency is below 25%, you likely have coalescing issues.
Shared memory does not have coalescing requirements, but it has bank conflict issues instead. Shared memory is divided into 32 banks, and simultaneous access to the same bank by different threads causes serialization. The concepts are related but require different optimization strategies.
Modern GPUs (Volta and later) have unified L1/shared memory and L2 cache. While caching can help with repeated access to the same data, it does not fix fundamental coalescing problems. Non-coalesced accesses still generate multiple transactions and consume more cache bandwidth. Always optimize for coalescing first.
Coalescing is critical for GEMM performance
Classic example of coalescing challenges
Coalescing affects reduction input loading
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.