Summary
Add warp-level intrinsics for shuffle operations and warp reductions.
Words to implement
| Word |
Stack effect |
Description |
SHFL-DOWN |
( val offset -- result ) |
Warp shuffle down |
SHFL-UP |
( val offset -- result ) |
Warp shuffle up |
SHFL-XOR |
( val mask -- result ) |
Warp shuffle XOR (butterfly) |
SHFL-IDX |
( val idx -- result ) |
Warp shuffle to specific lane |
Motivation
- Needed for high-performance reductions (e.g., sum across a warp without shared memory)
- Used in split-K matmul variants
- Warp-level operations avoid shared memory round-trips
Implementation notes
- Maps to
nvvm.shfl.sync intrinsics in NVVM
- Full warp mask (
0xFFFFFFFF) can be the default
- May also want
WARP-SIZE (constant 32) and LANE-ID words
Priority
Nice to have — needed for advanced GPU optimization patterns.
Summary
Add warp-level intrinsics for shuffle operations and warp reductions.
Words to implement
SHFL-DOWN( val offset -- result )SHFL-UP( val offset -- result )SHFL-XOR( val mask -- result )SHFL-IDX( val idx -- result )Motivation
Implementation notes
nvvm.shfl.syncintrinsics in NVVM0xFFFFFFFF) can be the defaultWARP-SIZE(constant 32) andLANE-IDwordsPriority
Nice to have — needed for advanced GPU optimization patterns.