Add FP8 Support For CK Tile Group GEMM by aris134 · Pull Request #475 · ROCm/TransformerEngine

aris134 · 2026-03-06T14:09:29Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

TODO:

Add support for other architectures (i.e., MI350X)
Add support for other quantization modes
Performance analysis and tuning

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Enables mixed precision (fp8/bf8 FNUZ variants) support for CK tile grouped GEMM with tensor quantization

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

This reverts commit 86fbbac.

…mTest

Align GemmRowColTensorQuantPipelineProblem with ck_tile V3 requirements by using AccType for intermediate C results. Specific to TensorQuant (per-tensor scaling); limited to e4m3/e5m2 FNUZ formats. Updates test_numerics.py to exercise FP8 inputs in the grouped linear accuracy suite.

… param_types in test_numerics

…acy tests

Enable mixed FP8/BF8 grouped GEMM for the CK backend used by GroupedLinear backward. Certain mixed-type combinations normalize to (AType=bf8_t, BType=fp8_t), but CK currently lacks a corresponding warp GEMM specialization for WarpGemmMfma_f32_32x32x32_bf8_fp8. This prevents the default FP8 tile configuration (K_Warp_Tile=32) from compiling or dispatching correctly. To address this, a fallback tile policy is introduced that routes the (bf8_t, fp8_t) case to a supported kernel configuration using K_Warp_Tile=16. This preserves correct GEMM operand ordering and avoids unsafe operand-swapping workarounds. Notes: - Only tensor quantization mode is currently supported. - Implementation targets MI300X (CDNA3) FP8/BF8 kernels. - Additional kernel coverage may be required for MI350X (CDNA4). With this change, mixed FP8/BF8 backprop paths are supported and all parametrized unit tests in test_grouped_linear_accuracy_cutlass() pass successfully.

… grouped gemm

matthiasdiener added 30 commits December 9, 2025 17:01

GEMM reference HIP implementation

ad748da

blockwise amax

11e090b

Merge branch 'dev' into compute-ref-offload

9006224

Change to use Tensor arguments, combine mxfp8/non-mxfp8 paths

3ecea7f

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

cafee59

skip on SwizzleScale limitation on gfx950

86fbbac

Revert "skip on SwizzleScale limitation on gfx950"

54de3db

This reverts commit 86fbbac.

MXFP8 fix

311ddfe

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

306e432

correct scale_inv packing and exp2(biased−127) conversion

445e64f

cleanups

462945f

Merge branch 'dev' into compute-ref-offload

e32fb3d

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

7bf8adb

use Tensor class for more device objects

e11e400

Pass D Tensor into run_reference and move RefD allocation into Perfor…

325ece6

…mTest

[WIP] proof-of-concept: grouped GEMM with ck_tile

fc64b8c

Merge branch 'dev' into ck-grouped-gemm

134b350

restructure and enable tests

9091e6c

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

7435062

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

a00a1c8

grid improvements

4e9ead9

restructure

259645c

reduce code duplication & simplify

9986bd4

make the code more similar to nv, check emopty gelu/bias

355ec2f

Merge branch 'dev' into ck-grouped-gemm

df5e3ea

further simplify & make closer to nv

a42f7ca

add ck_tile reference

fac7c11

rename in error messages

71b97e0

allow flattened higher-D tensors

dd3ed2f

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

7b0413e

matthiasdiener and others added 28 commits February 5, 2026 17:22

simplify normalization

58b34e7

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

74f229a

run hipblaslt for num_gemms==1

e28c801

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

6151b96

disable ck_tile when accumulate=true

5c57d47

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

29d6ab7

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

6e9aae4

remove test file

2e844d9

Merge branch 'dev' into ck-grouped-gemm

4aa8229

fix copyright header

f680d6a

simplify calls in dispatch_grouped

6d85088

remove is_mi3*0_class

7910038

disable unused constants

e8ebb0e

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

deb7474

add another fallback

e866bc6

implement Primus-Turbo selection logic, persistent descs

ee438fb

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

b65dbfa

tighten tolerances

0cbf1cd

use namespace, various cleanups

98e0c66

avoid creating vector with Tensors

36bd68e

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

070c58d

merge dispatch_grouped into ck_tile_grouped_gemm

c5d83a4

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

56afb04

same tolerances for gfx950

26dfbb6

Include Float8 E4M3/E5M2 in is_supported_dtype and remove float8 from…

54da682

… param_types in test_numerics

forward pass ck_tile with matching FP8 data type inputs passing accur…

78a702f

…acy tests

aris134 self-assigned this Mar 6, 2026

include more descriptive comment regarding tensor normalization in ck…

6b24be2

… grouped gemm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FP8 Support For CK Tile Group GEMM#475

Add FP8 Support For CK Tile Group GEMM#475
aris134 wants to merge 62 commits intodevfrom
amartin/ck-grouped-gemm-fp8

aris134 commented Mar 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aris134 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aris134 commented Mar 6, 2026 •

edited

Loading