Embedding lookups are memory-bound operations that index into large tables. For language models with 50K+ vocabulary, efficient embedding access is critical. The challenge is random memory access patterns that defeat caching and coalescing. This guide covers memory layout optimization, batched lookups, and sparse gradient techniques for training.
Parallelize across embedding dimension for coalesced reads.
Sort indices to improve cache locality for repeated tokens.
Only update embeddings that were accessed in forward pass.
Sequential read of embedding dimension wastes memory bandwidth.
__global__ void embedding_naive(float* table, int* indices, float* output,
int num_indices, int embed_dim) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= num_indices) return;
int token_id = indices[idx];
for (int d = 0; d < embed_dim; d++) {
output[idx * embed_dim + d] = table[token_id * embed_dim + d];
}
}Parallelizing across embedding dimension enables coalesced memory access.
__global__ void embedding_coalesced(float* table, int* indices, float* output,
int num_indices, int embed_dim) {
int token_idx = blockIdx.x;
int dim_idx = threadIdx.x;
if (token_idx >= num_indices) return;
int token_id = indices[token_idx];
float* src = table + token_id * embed_dim;
float* dst = output + token_idx * embed_dim;
// Coalesced read: threads read consecutive dimensions
for (int d = dim_idx; d < embed_dim; d += blockDim.x) {
dst[d] = src[d];
}
}
// Launch: embedding_coalesced<<<num_indices, 256>>>(...)| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| Throughput (GB/s) | 120 | 480 | 4x |
| Latency per batch | 85μs | 22μs | 3.9x |
Use NVIDIA Merlin or custom implementations with: (1) CPU offloading with prefetch, (2) Distributed embedding across GPUs, (3) Hash embeddings for compression.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.