Warp-level primitives: shuffle and reductions

## Summary

Add warp-level intrinsics for shuffle operations and warp reductions.

## Words to implement

| Word | Stack effect | Description |
|------|-------------|-------------|
| `SHFL-DOWN` | `( val offset -- result )` | Warp shuffle down |
| `SHFL-UP` | `( val offset -- result )` | Warp shuffle up |
| `SHFL-XOR` | `( val mask -- result )` | Warp shuffle XOR (butterfly) |
| `SHFL-IDX` | `( val idx -- result )` | Warp shuffle to specific lane |

## Motivation

- Needed for high-performance reductions (e.g., sum across a warp without shared memory)
- Used in split-K matmul variants
- Warp-level operations avoid shared memory round-trips

## Implementation notes

- Maps to `nvvm.shfl.sync` intrinsics in NVVM
- Full warp mask (`0xFFFFFFFF`) can be the default
- May also want `WARP-SIZE` (constant 32) and `LANE-ID` words

## Priority

Nice to have — needed for advanced GPU optimization patterns.

Word	Stack effect	Description
`SHFL-DOWN`	`( val offset -- result )`	Warp shuffle down
`SHFL-UP`	`( val offset -- result )`	Warp shuffle up
`SHFL-XOR`	`( val mask -- result )`	Warp shuffle XOR (butterfly)
`SHFL-IDX`	`( val idx -- result )`	Warp shuffle to specific lane

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warp-level primitives: shuffle and reductions #10

Summary

Words to implement

Motivation

Implementation notes

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Warp-level primitives: shuffle and reductions #10

Description

Summary

Words to implement

Motivation

Implementation notes

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions