Single source of truth for GEMM and Grouped Convolution kernel generation.
See also: Main Dispatcher README for installation and core concepts.
Both GEMM and Grouped Conv generators share common code via codegen_common.py:
TileConfig- Dataclass for tile dimensionsTraitConfigBase- Base for kernel trait configurations with arch-aware validationCommonTypeMappings- Dtype-to-C++ type mappingsparallel_generate()- Parallel kernel generation with per-kernel progress logging- Arch-aware expansion helpers (
valid_wave_configs,valid_warp_configs, etc.)
cd dispatcher/codegen
# Generate standard FP16 kernels
python3 unified_gemm_codegen.py \
--output-dir ../build/generated_kernels \
--datatype fp16 \
--layout rcr \
--variants standard
# Generate all variants
python3 unified_gemm_codegen.py \
--output-dir ../build/generated_kernels \
--variants standard preshuffle multi_dcd dispatcher/codegen
# Generate forward FP16 grouped conv kernels
python3 unified_grouped_conv_codegen.py \
--output-dir ../build/generated_kernels \
--datatype fp16 \
--variant forward \
--ndim-spatial 2
# Generate backward data kernels
python3 unified_grouped_conv_codegen.py \
--output-dir ../build/generated_kernels \
--variant backward_data \
--ndim-spatial 2from ctypes_utils import CodegenRunner, KernelConfig
# Generate from specific config
config = KernelConfig(tile_m=256, tile_n=256, tile_k=64)
codegen = CodegenRunner()
result = codegen.generate_from_config(config)
# Generate variant
result = codegen.generate("preshuffle")
# Generate all
results = codegen.generate_all()| Option | Values | Description |
|---|---|---|
--output-dir |
path | Output directory |
--datatype |
fp16, bf16, fp32, int8 |
Data type |
--layout |
rcr, rrr, crr, ccr |
Matrix layouts |
--gpu-target |
gfx942, gfx90a, gfx950 |
Target GPU |
--variants |
standard, preshuffle, multi_d |
Kernel variants |
--preselected |
fp16_rcr_essential, etc. |
Predefined kernel set |
R= Row-major,C= Column-major- Order: A, B, C (e.g.,
rcr= A row, B col, C row)
Basic GEMM: C = A x B
Optimized weight access with LDS pre-shuffling. Best for large matrices.
Element-wise fusion: C = op(A x B + D0 + D1 + ...)
Supported ops: PassThrough, MultiDAdd, Relu, Gelu, Sigmoid, Tanh
generated_kernels/
|---- gemm_fp16_rcr_compv4_..._128x128x32_....hpp # GEMM kernels
|---- gemm_fp16_rcr_compv4_..._preshuffle.hpp
|---- gemm_fp16_rcr_compv4_..._multid_Relu_d1.hpp
|---- grouped_conv_fwd_fp16_nhwgc_..._128x128x32_....hpp # Grouped conv kernels
+---- ...
GPU architecture specifications (single source of truth):
{
"architectures": {
"gfx942": {
"family": "cdna3",
"warp_size": 64,
"warp_configs": [[2, 2, 1], [4, 4, 1]],
...
}
}
}Curated kernel sets for common use cases.
See ADDING_NEW_GPU.md for complete guide.
Quick steps:
- Edit
arch_specs.json - Run
python generate_arch_specs.py - Rebuild
| Issue | Solution |
|---|---|
| "Arguments not supported" | Check tile config validity |
| Missing element-wise op | Check elementwise_ops.hpp |
| Compilation errors | Verify C++17, include paths |
More info: See ../README.md for full documentation.