MXFP4 Cast Transpose Triton [WIP] by sarthak-amd · Pull Request #422 · ROCm/TransformerEngine

sarthak-amd · 2026-01-20T00:32:53Z

Description

Implements the MXFP4 rowwise and columnwise FP32/BF16 -> MXFP4 fused quantization + cast kernel

Verify Tolerances and functional Unit Tests
The triton te_cast_transpose_mxfp4_triton currently outputs FP4 data in linear layout [M, N/2] with contiguous byte packing. AITER's gemm_a4w4 requires the B matrix in MFMA shuffle layout for tensor cores. This layout shuffle can be fused into the triton kernel in future.

…-mxfp4

wangye805

You

tests/pytorch/triton_kernels/test_cast_mxfp4.py

transformer_engine/common/util/pybind_helper.h

transformer_engine/pytorch/tensor/_internal/mxfp4_tensor_base.py

transformer_engine/pytorch/tensor/mxfp4_tensor.py

tests/pytorch/triton_kernels/test_cast_mxfp4.py

…-mxfp4

sudhu2k · 2026-02-16T20:34:38Z

Hi @sarthak-amd
Can't we merge the mxfp8 triton kernel with the mxfp4 triton kernel?

TransformerEngine/transformer_engine/pytorch/triton_kernels/cast_transpose.py

Line 335 in ef83316

def _cast_transpose_triton_mxfp8(

Kernel wise it should be exactly similar except how the block is being casted.
in mxfp8 we use a separate function to cast the values.

TransformerEngine/transformer_engine/pytorch/triton_kernels/cast_transpose.py

Line 301 in ef83316

def float_to_e8m0_triton(val: tl.float32) -> tl.uint8:

So we can just put this part in a separate function:

TransformerEngine/transformer_engine/pytorch/triton_kernels/cast_transpose.py

Lines 562 to 571 in ef83316

    
           # Nearest-neighbor quantization to E2M1 values 
        
           # E2M1 representable values: {0, 0.5, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0} 
        
           idx_row = tl.zeros([MXFP4_BLOCK_SIZE, MXFP4_BLOCK_SIZE], dtype=tl.uint8) 
        
           idx_row = tl.where(abs_qx_row >= 0.25, 1, idx_row)  # → 0.5 
        
           idx_row = tl.where(abs_qx_row >= 0.75, 2, idx_row)  # → 1.0 
        
           idx_row = tl.where(abs_qx_row >= 1.25, 3, idx_row)  # → 1.5 
        
           idx_row = tl.where(abs_qx_row >= 1.75, 4, idx_row)  # → 2.0 
        
           idx_row = tl.where(abs_qx_row >= 2.5,  5, idx_row)  # → 3.0 
        
           idx_row = tl.where(abs_qx_row >= 3.5,  6, idx_row)  # → 4.0 
        
           idx_row = tl.where(abs_qx_row >= 5.0,  7, idx_row)  # → 6.0

and alternate the calls based on whether it's a MXFP8 kernel or MXFP4 kernel.

The following operation can be replaced with the exp2f_rcp_triton function

TransformerEngine/transformer_engine/pytorch/triton_kernels/cast_transpose.py

Lines 549 to 551 in ef83316

    
           scale_unbiased_row = tl.log2(tl.maximum(amax_rounded, 1e-45)).floor() - 2 
        
           scale_unbiased_row = tl.clamp(scale_unbiased_row, min=-127.0, max=127.0) 
        
           quant_scale_row = tl.exp2(-scale_unbiased_row)

Another difference seems to be that MXFP8 kernels doesn't have the shuffle op fused in it. Do we want that mxfp8 as well? or is that specific to fp4 data?

Also @wangye805 noticed
For the column-wise quantization of an input tensor MxN, the mxfp8 seems to still have columnwise output as shape MxN but mxfp4 implementation is NxM. Technically we can add something like STORE_TRANSPOSE, to the kernel and invert the strides in the mxfp8 kernel itself.

Let me know what you think @sarthak-amd

- Updated `test_cast_mxfp4.py` to simplify quantization output handling by removing unnecessary output tensor creation. - Introduced `MXFP4BlockScaling` recipe class. - Enhanced `MXFP4Quantizer` to utilize new scaling methods and updated tensor creation logic. - Added new quantization kernel `_mxfp4_quantize_32x32_block` to remove redundant work. - Updated Triton kernel wrapper.

- Updated `mxfp4_quantize_cpu` to include a `SHUFFLE` parameter for conditional scale shuffling. - Modified tests in `test_cast_mxfp4.py` to accommodate the new shuffling logic and added parameterization for shuffle options. - Bug fix, removed redundant QuantizedTensorBase

sudhu2k · 2026-03-04T22:18:41Z

New changes

Refactor of kernel wrappers and mxfp4 tensors to be consistent with other recipes.
Decomposition of repeated work into a separate function for the mxfp4 quantize kernel.
Added shuffle tests.
Bug fixes.

- Removed redundant _empty_tensor function from utils.py. - Ensured proper newline at the end of the file in quantized_tensor.py.

- Changed `fp8_format` to `fp4_format` for consistency with the new scaling method.

wangye805 · 2026-03-10T15:35:35Z

tests/pytorch/triton_kernels/test_cast_mxfp4.py

+        compare_fp4_data_nibblewise(
+            quantized_out._rowwise_data.view(torch.uint8),
+            ref_data,
+            msg=f"Rowwise FP4 ({shape}, {in_dtype})",
+            max_mismatch_rate=0.05,
+        )
+        y1_scales_triton = quantized_out._rowwise_scale_inv.view(torch.uint8)
+        y1_scales_torch = ref_scale
+        if shuffle_B_matrix_for_aiter:
+            y1_scales_triton = un_shuffle_scales(
+                y1_scales_triton.view(y1_scales_triton.shape[0] // 32, -1)
+            )
+            y1_scales_torch = un_shuffle_scales(
+                y1_scales_torch.view(y1_scales_torch.shape[0] // 32, -1)
+            )
+
+        compare_e8m0_scales(


Okay, currently the target vs ref comparison is not coupled: we validated quantized data content and scale independently and allow for certain mismatch rate.

A better way is to do the validation jointly by adjusting the quanized value if the scale is mismatched:

TransformerEngine/tests/cpp/operator/test_cast_mxfp8.cu

Lines 314 to 353 in 9b31283

#ifdef __HIP_PLATFORM_AMD__

const double abs_tolerable_mismatches_limit = 1.0;

const double rel_tolerable_mismatches_limit = 1.0e-4;

#else

const double abs_tolerable_mismatches_limit = 0.0;

const double rel_tolerable_mismatches_limit = 0.0;

#endif

std::vector<size_t> mismatches_scales_indices;

size_t mismatches_scales = 0;

compare_e8m0_scaling_factors("scales", gpu_scales_ptr, ref_output_scales.get(),

unpadded_blocks_Y, unpadded_blocks_X, scales_stride,

mismatches_scales_indices, mismatches_scales,

scale_diff_abs_tolerance,

abs_tolerable_mismatches_limit,

rel_tolerable_mismatches_limit);

#ifdef __HIP_PLATFORM_AMD__

if (::testing::Test::HasFatalFailure()) return;

adjust_ref_for_e8m0_scale_error("scales", mismatches_scales_indices, gpu_scales_ptr,

ref_output_scales.get(), scales_stride, rows, cols, rowwise,

ref_output_c.get(), otype);

mismatches_scales = 0;

#endif

const size_t mismatches_elts = 32 * mismatches_scales;

auto [atol, rtol] = getTolerances(otype);

compareResults("output_c", output_c, ref_output_c.get(), rowwise, atol, rtol, true, mismatches_elts);

if (processing_method == ProcessingMethod::CAST_DBIAS

|| processing_method == ProcessingMethod::CAST_DBIAS_DACT)

{

auto [atol_dbias, rtol_dbias] = getTolerances(itype);

if (itype == DType::kFloat32) {

atol_dbias = 1e-4;

rtol_dbias *= sqrt(static_cast<double>(rows)) ;

} else {

rtol_dbias *= 4;

}

compareResults("output_dbias", output_dbias, ref_output_dbias.get(), true, atol_dbias, rtol_dbias);

transformer_engine/common/util/pybind_helper.h

sarthak-amd added 6 commits January 19, 2026 18:28

MXFP4 Tensor support in TE

fd7129d

fused cast transpose mxfp4

aca9e33

add E2M1 Dtype

b7cc9f2

Add unit test

7b2b4e5

update unit test and unify the api with upcoming hip kernel

df39c9a

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

c1680cb

…-mxfp4

wangye805 requested changes Feb 2, 2026

View reviewed changes

sarthak-amd and others added 6 commits February 3, 2026 13:13

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

f2bef5a

…-mxfp4

add mxfp4 cast kernel to pytorch ci

968875d

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

a05fbb9

…-mxfp4

add prime number shapes

0523d73

Merge remote-tracking branch 'origin/dev' into feature/cast-transpose…

1922fb9

…-mxfp4

update tolerances to match hip requirements

ef83316

wangye805 requested a review from sudhu2k February 12, 2026 22:40

sudhu2k added 4 commits March 4, 2026 16:42

Merge branch 'dev' into feature/cast-transpose-mxfp4

e688f4c

Remove NVTE_USE_CAST_TRANSPOSE_TRITON in test_cast_mxfp4.py

e3dcf95

sudhu2k marked this pull request as ready for review March 4, 2026 22:18

sudhu2k requested review from ipanfilo and wenchenvincent as code owners March 4, 2026 22:18

sudhu2k requested a review from wangye805 March 4, 2026 22:18

sudhu2k added 2 commits March 4, 2026 22:19

Code clean up

7259726

- Removed redundant _empty_tensor function from utils.py. - Ensured proper newline at the end of the file in quantized_tensor.py.

Refactor MXFP4BlockScaling class for clarity and compatibility

4c5564a

- Changed `fp8_format` to `fp4_format` for consistency with the new scaling method.

sudhu2k self-assigned this Mar 5, 2026

sudhu2k removed their request for review March 5, 2026 16:59

wangye805 requested changes Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXFP4 Cast Transpose Triton [WIP]#422

MXFP4 Cast Transpose Triton [WIP]#422
sarthak-amd wants to merge 18 commits intodevfrom
feature/cast-transpose-mxfp4

sarthak-amd commented Jan 20, 2026 •

edited

Loading

Uh oh!

wangye805 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sudhu2k commented Feb 16, 2026 •

edited

Loading

Uh oh!

sudhu2k commented Mar 4, 2026

Uh oh!

wangye805 Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	#ifdef __HIP_PLATFORM_AMD__
	const double abs_tolerable_mismatches_limit = 1.0;
	const double rel_tolerable_mismatches_limit = 1.0e-4;
	#else
	const double abs_tolerable_mismatches_limit = 0.0;
	const double rel_tolerable_mismatches_limit = 0.0;
	#endif

	std::vector<size_t> mismatches_scales_indices;
	size_t mismatches_scales = 0;
	compare_e8m0_scaling_factors("scales", gpu_scales_ptr, ref_output_scales.get(),
	unpadded_blocks_Y, unpadded_blocks_X, scales_stride,
	mismatches_scales_indices, mismatches_scales,
	scale_diff_abs_tolerance,
	abs_tolerable_mismatches_limit,
	rel_tolerable_mismatches_limit);

	#ifdef __HIP_PLATFORM_AMD__
	if (::testing::Test::HasFatalFailure()) return;
	adjust_ref_for_e8m0_scale_error("scales", mismatches_scales_indices, gpu_scales_ptr,
	ref_output_scales.get(), scales_stride, rows, cols, rowwise,
	ref_output_c.get(), otype);
	mismatches_scales = 0;
	#endif

	const size_t mismatches_elts = 32 * mismatches_scales;
	auto [atol, rtol] = getTolerances(otype);
	compareResults("output_c", output_c, ref_output_c.get(), rowwise, atol, rtol, true, mismatches_elts);

	if (processing_method == ProcessingMethod::CAST_DBIAS
	\|\| processing_method == ProcessingMethod::CAST_DBIAS_DACT)
	{
	auto [atol_dbias, rtol_dbias] = getTolerances(itype);
	if (itype == DType::kFloat32) {
	atol_dbias = 1e-4;
	rtol_dbias *= sqrt(static_cast<double>(rows)) ;
	} else {
	rtol_dbias *= 4;
	}
	compareResults("output_dbias", output_dbias, ref_output_dbias.get(), true, atol_dbias, rtol_dbias);

Conversation

sarthak-amd commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

wangye805 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sudhu2k commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sudhu2k commented Mar 4, 2026

New changes

Uh oh!

wangye805 Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sarthak-amd commented Jan 20, 2026 •

edited

Loading

sudhu2k commented Feb 16, 2026 •

edited

Loading