Quantization reduces model size and increases inference speed by using lower-precision arithmetic. INT8 provides 4x memory reduction with ~1% accuracy loss. INT4 enables 8x reduction for LLMs. Efficient CUDA implementation requires understanding packed data formats and fused dequantization. This guide covers quantization schemes, packed kernels, and performance optimization.
Pack multiple values per register for SIMD.
Dequantize in GEMM epilogue, not separate kernel.
Different scales per output channel for accuracy.
Separate dequantization doubles memory traffic.
// Anti-pattern: separate dequantize kernel
__global__ void dequantize(int8_t* input, float* output, float scale, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
output[i] = input[i] * scale;
}
}
void quantized_inference() {
// Dequantize weights
dequantize<<<grid, block>>>(q_weights, f_weights, scale, n);
// Regular GEMM
cublasSgemm(..., f_weights, ...);
}Fused quantized GEMM eliminates dequantization overhead.
// INT8 GEMM with fused dequantization
// Using INT8 Tensor Cores (Turing+)
#include <cublas_v2.h>
#include <cublasLt.h>
void int8_gemm_fused(cublasLtHandle_t handle,
int8_t* A, int8_t* B, int8_t* C,
float* scale_A, float* scale_B, float* scale_C,
int M, int N, int K) {
cublasLtMatmulDesc_t matmulDesc;
cublasLtMatmulDescCreate(&matmulDesc, CUBLAS_COMPUTE_32I, CUDA_R_32I);
// Set scaling factors (fused in epilogue)
cublasLtMatmulDescSetAttribute(matmulDesc,
CUBLASLT_MATMUL_DESC_A_SCALE_POINTER, &scale_A, sizeof(scale_A));
cublasLtMatmulDescSetAttribute(matmulDesc,
CUBLASLT_MATMUL_DESC_B_SCALE_POINTER, &scale_B, sizeof(scale_B));
cublasLtMatmulDescSetAttribute(matmulDesc,
CUBLASLT_MATMUL_DESC_D_SCALE_POINTER, &scale_C, sizeof(scale_C));
// Execute with fused scaling
cublasLtMatmul(handle, matmulDesc, ...);
}
// INT4 packed format (2 values per byte)
__device__ int8_t unpack_int4(uint8_t packed, int idx) {
if (idx == 0) return (packed & 0x0F) - 8; // Lower 4 bits
else return (packed >> 4) - 8; // Upper 4 bits
}
// INT4 GEMM kernel
__global__ void gemm_int4(uint8_t* A_packed, uint8_t* B_packed,
float* C, float* scale, int M, int N, int K) {
// Each thread unpacks and processes multiple INT4 values
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < M && col < N) {
int32_t sum = 0;
for (int k = 0; k < K / 2; k++) {
uint8_t a_packed = A_packed[row * (K/2) + k];
uint8_t b_packed = B_packed[k * (N) + col];
int8_t a0 = unpack_int4(a_packed, 0);
int8_t a1 = unpack_int4(a_packed, 1);
int8_t b0 = unpack_int4(b_packed, 0);
int8_t b1 = unpack_int4(b_packed, 1);
sum += a0 * b0 + a1 * b1;
}
C[row * N + col] = sum * scale[col];
}
}| Metric | Naive | Optimized | Improvement |
|---|---|---|---|
| INT8 vs FP16 | 1x | 2x | Tensor Core INT8 |
| INT4 vs FP16 | 1x | 3-4x | Memory bandwidth |
Typically <1% for vision models, 1-2% for language models with proper calibration. Per-channel quantization preserves more accuracy than per-tensor.
Ready to optimize your CUDA code? Download RightNow AI and get real-time performance analysis for your kernels.