Loading...
Concatenation joins tensors along a specified dimension. While conceptually simple, efficient implementation minimizes memory copies and maximizes bandwidth utilization through coalesced access.
Ensure threads write to consecutive memory.
// Concat along batch dimension - simple memcpy
__global__ void concat_dim0(float** inputs, int* sizes, float* output,
int n_tensors, int inner_size) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int total = 0;
for (int t = 0; t < n_tensors; t++) total += sizes[t];
if (tid < total * inner_size) {
// Find which tensor and offset
int remaining = tid;
for (int t = 0; t < n_tensors; t++) {
int tensor_size = sizes[t] * inner_size;
if (remaining < tensor_size) {
output[tid] = inputs[t][remaining];
break;
}
remaining -= tensor_size;
}
}
}Sequential memcpy for each tensor.
void concat_naive(float* a, float* b, float* out, int na, int nb) {
cudaMemcpy(out, a, na * sizeof(float), cudaMemcpyDeviceToDevice);
cudaMemcpy(out + na, b, nb * sizeof(float), cudaMemcpyDeviceToDevice);
}Single kernel for many tensors avoids launch overhead.
// For many small tensors, single kernel is faster
__global__ void concat_batched(float** inputs, int* offsets,
float* output, int n_tensors, int total) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid >= total) return;
// Binary search to find source tensor
int t = 0;
while (t < n_tensors - 1 && tid >= offsets[t + 1]) t++;
output[tid] = inputs[t][tid - offsets[t]];
}
// Alternative: use thrust::gather for flexibility
thrust::gather(indices.begin(), indices.end(),
all_data.begin(), output.begin());| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput (10x1M tensors) | 12 GB/s | 380 GB/s | 32x faster |
Concat joins along existing dim (sizes add up). Stack creates new dim (all inputs same size). Stack = unsqueeze then concat.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.