[Common] Tuned NVFP4 cast kernel #2412

Oleg-Goncharov · 2025-11-21T23:47:10Z

Description

This PR introduces a specialized CUDA kernel optimized for NVFP4 quantization of BF16 inputs on Blackwell architecture (sm100f family). The implementation achieves performance improvements by leveraging architecture-specific features:

RN: round-to-nearest mode 6.4 TB/s (rowwise only 7.2 TB/s)

SR: stochastic rounding 4.5 TB/s (rowwise only 7.0 TB/s)

Rowwise + Colwise (transpose)

Rowwise only

a) round-to-nearest

b) stochastic rounding

Below are the performance measurements for quantizing tensors using dimensions representative of DSv3 [8192×8, 7168] on internal Cluster (B300).

Using `--fast-math` can improve performance of the kernel with the stochastic rounding (RNG) by up to ~10%.

Threads to data mapping (colwise case)

To reduce shared memory bank conflicts, the following mapping is use when reading from and writing to shmem buffers:

Singe thread processes 16x2 elements (2x NVFP4 blocks).
Cells of the same color belong to the same warp
Indices of threads and their offsets are computed as:

const int tid_Y_colwise = (thread_lane % 4 + warp) % 4;
const int tid_X_colwise = thread_lane;

const int thread_offset_Y_colwise = tid_Y_colwise * SCALE_DIM;
const int thread_offset_X_colwise = tid_X_colwise * 2;

where SCALE_DIM=16.
The arrows in the figure below illustrate how thread indices increment, forming a zigzag pattern.

a) Reads from SHMEM Input Buffer

b) Writes to SHMEM Output Transpose Buffer

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added a specialized kernel
Added the logic to use it when the conditions are met

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2025-11-21T23:51:05Z

Greptile Summary

This PR adds a specialized NVFP4 quantization kernel for Blackwell GPUs achieving 6.4 TB/s for round-to-nearest and 4.5 TB/s for stochastic rounding. The implementation leverages TMA, mbarrier, and cluster launch control instructions.

Key Changes

New tuned kernel (quantize_transpose_nvfp4_tuned_1D.cuh): Implements optimized 128x128 tiling with manual swizzling to reduce bank conflicts, processes 16x2 elements per thread
PTX wrappers (ptx.cuh): Adds Blackwell-specific instruction wrappers including mul_cvt_bf16_to_fp4_8x for vectorized conversion and cluster launch control helpers
Dispatcher update (quantize_transpose_nvfp4.cuh): Routes BF16 1D quantization to the new kernel
Test improvements: Updates reference implementation to properly handle 2D quantization validation

Issues Found

Missing runtime architecture check (line 1171): Dispatcher calls tuned kernel without verifying device is sm_100+, which will cause runtime errors on pre-Blackwell GPUs

Confidence Score: 3/5

safe to merge after fixing the architecture check
the implementation is well-structured with proper compile-time guards in the kernel, but the dispatcher is missing a critical runtime check that will cause failures on older GPUs
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh needs the architecture check added before the tuned kernel dispatch

Important Files Changed

Filename	Overview
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh	adds dispatcher logic to route BF16 1D quantization to new tuned kernel, but missing runtime architecture check before dispatch
transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh	implements new optimized NVFP4 kernel using Blackwell-specific TMA, mbarrier, and cluster launch control instructions with proper compile-time guards

Sequence Diagram

sequenceDiagram
    participant User
    participant Dispatcher as quantize_transpose()
    participant TunedKernel as quantize_transpose_tuned_1D()
    participant CUDAKernel as quantize_transpose_nvfp4_tuned_1D_kernel
    participant PTX as PTX Instructions
    participant GPU

    User->>Dispatcher: Call with BF16 input, 1D quantization
    Dispatcher->>Dispatcher: Check !use_2d_quantization && BF16
    Note over Dispatcher: ⚠️ Missing is_supported_by_CC_100() check
    Dispatcher->>TunedKernel: quantize_transpose_tuned_1D()
    TunedKernel->>TunedKernel: Validate tensors (rows%32==0, cols%32==0)
    TunedKernel->>TunedKernel: Create TMA tensor maps
    TunedKernel->>TunedKernel: Configure shared memory
    TunedKernel->>CUDAKernel: Launch kernel<<<grid, block, shmem>>>
    
    alt Device is sm_100+
        CUDAKernel->>PTX: mbarrier_init (TMA barriers)
        CUDAKernel->>PTX: cp_async_bulk_tensor_2d (TMA prefetch)
        loop For each stage
            CUDAKernel->>PTX: mbarrier_wait (wait for data)
            CUDAKernel->>CUDAKernel: rowwise_scaling (find amax, quantize)
            CUDAKernel->>PTX: mul_cvt_bf16_to_fp4_8x (BF16→FP4)
            opt If transpose requested
                CUDAKernel->>CUDAKernel: colwise_scaling
            end
            CUDAKernel->>PTX: cp_async_bulk_tensor_2d (write results)
            CUDAKernel->>PTX: cp_async_bulk_tensor_2d (write transpose)
        end
        CUDAKernel->>GPU: Store scales to global memory
        GPU-->>User: Success
    else Device is pre-sm_100
        CUDAKernel->>CUDAKernel: NVTE_DEVICE_ERROR
        GPU-->>User: Runtime error
    end

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

transformer_engine/common/cast/dispatch/quantize.cuh, line 250 (link)

logic: missing is_supported_by_CC_100() check. forward pass (line 104) has this check, but backward pass doesn't. without it, the persistent kernel can be dispatched on non-Blackwell hardware and hit NVTE_DEVICE_ERROR at runtime.

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/common/cast/dispatch/quantize.cuh

transformer_engine/common/util/ptx.cuh

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh, line 1171-1174 (link)

logic: Missing runtime architecture check before dispatching to persistent kernel. The tuned 1D kernel uses Blackwell-specific PTX instructions (mbarrier variants, cluster launch control, etc.) and requires sm_100+. Should guard with is_supported_by_CC_100() check similar to other TMA kernel dispatches in the codebase (see quantize_fp8.cuh:535, gated.cuh:48).

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh, line 1171-1174 (link)

logic: Missing runtime check for sm_100+ before dispatching to tuned kernel. The tuned kernel uses Blackwell-specific features (TMA, cluster launch control) that will fail on non-sm_100 devices.

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

tests/cpp/operator/test_cast_nvfp4_transpose.cu, line 125-126 (link)

syntax: typo in comment: 'Numercial' should be 'Numerical'

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{5 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh, line 1171-1173 (link)

logic: Add runtime architecture check before dispatching to tuned kernel:

The tuned kernel requires sm_100+ but this dispatch lacks the is_supported_by_CC_100() check that other sm_100+ kernels use (see cast/dispatch/gated.cuh:48, cast/fp8/quantize_fp8.cuh:535). Without this check, the kernel will launch on non-Blackwell devices and hit NVTE_DEVICE_ERROR at runtime.

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Oleg-Goncharov · 2025-12-15T18:56:34Z

/te-ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

…(numerical truncation). Tightened the numerical tolerance Signed-off-by: Oleg Goncharov <[email protected]>

…formant when inputs are large Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps · 2026-01-16T21:03:17Z

Greptile's behavior is changing!

From now on, if a review finishes with no comments, we will not post an additional "statistics" comment to confirm that our review found nothing to comment on. However, you can confirm that we reviewed your changes in the status check section.

_{This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".}

Oleg-Goncharov · 2026-01-16T21:13:23Z

/te-ci

tdophung · 2026-01-16T23:38:52Z

tests/cpp/operator/test_cast_nvfp4_transpose.cu

-    const double atol = 0.05;
-    const double rtol = 0.1;
+    const double atol = 1.0E-6;
+    const double rtol = 1.0E-6;


why is this changing? Looks like it got stricter so I think if it passes, it is better. But I am not understanding why it was lower before

tdophung · 2026-01-16T23:43:56Z

transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh

+constexpr int THREADS_X_ROWWISE = TILE_DIM_X / ELTS_PER_THREAD;
+constexpr int THREADS_Y_ROWWISE = THREADS_NUM / THREADS_X_ROWWISE;
+
+constexpr int THREADS_X_TRANSP = TILE_DIM_X / 2;


Can we use 1 convention for macros for transpose? Like you named THREADS_X_T or BUFF_OUT_T_DIM_Y before, so "_T", or "_TRANSP", or "_TR" (like the BUFFS_NUM_OUT_TR one. But just choose 1

Thanks, I’m fine to standardize. Please consolidate any remaining feedback into a single final pass by EOD today so we can merge without another round of incremental comments. This PR has been open since late November and was originally requested as time-sensitive.

tdophung · 2026-01-17T00:04:01Z

transformer_engine/common/cast/nvfp4/specialized/quantize_transpose_nvfp4_tuned_1D.cuh

+inline void quantize_transpose_tuned_1D(const Tensor &input, const Tensor *noop, Tensor *output,
+                                        const QuantizationConfig *quant_config,
+                                        cudaStream_t stream) {
+#if FP4_TYPE_SUPPORTED


But shouldn't the TunableConfig struct is for that purpose? To make sure that these values are tuned separately from the common patrt of the 2 kernels, and we keep the common part in one place only. Then it should not be tight coupling

Oleg-Goncharov · 2026-01-20T13:01:19Z

TunableConfig/coupling: we already aligned on this with Przemek, we’re not moving these values into TunableConfig for this change. Let’s keep the scope as agreed.

Signed-off-by: Oleg Goncharov <[email protected]>

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-20T13:21:31Z

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

+  if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {
+    quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
+    return;


logic: add runtime check for sm_100+ before dispatching to tuned kernel

Suggested change

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16) && is_supported_by_CC_100()) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

}

The tuned kernel uses Blackwell-specific instructions (TMA, mbarrier, cluster launch control) that will cause runtime errors on older GPUs. FP4_TYPE_SUPPORTED is only a compile-time check for CUDA version, not device capability.

Oleg-Goncharov · 2026-01-20T14:07:47Z

/te-ci

tdophung

I have no other issue with this PR. We should merge. Thanks for the change!

greptile-apps

_{5 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-21T15:10:37Z

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

+  if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {
+    quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);
+    return;
+  }


logic: add runtime check for sm_100+ before dispatching to tuned kernel

Suggested change

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16)) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

}

if (!use_2d_quantization && (input.dtype() == DType::kBFloat16) && is_supported_by_CC_100()) {

quantize_transpose_tuned_1D(input, noop, output, quant_config, stream);

return;

}

the tuned kernel uses Blackwell-specific instructions (TMA, mbarrier, cluster launch control) that require sm_100+. without this check, older GPUs will hit the runtime error inside the kernel at line 651

* Implemented persistent nvfp4 kernel Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix FP4 guard in ptx Signed-off-by: Oleg Goncharov <[email protected]> * Fix Signed-off-by: Oleg Goncharov <[email protected]> * Fix in ptx. reduxf32 guard Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by: Oleg Goncharov <[email protected]> * Fixes per PR review Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixes per PR review. Added parameter to turn off the persistency Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Modified reference CPU implementation in C++ unit tests to match GPU (numerical truncation). Tightened the numerical tolerance Signed-off-by: Oleg Goncharov <[email protected]> * Disabled persistency by default, as non-persistent kernel is more performant when inputs are large Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use the tuned kernel also for the rowwise only quantization Signed-off-by: Oleg Goncharov <[email protected]> * Fixed typo Signed-off-by: Oleg Goncharov <[email protected]> * Addressed comments from the PR review Signed-off-by: Oleg Goncharov <[email protected]> * Resolved conflicts Signed-off-by: Oleg Goncharov <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Macros renaming Signed-off-by: Oleg Goncharov <[email protected]> --------- Signed-off-by: Oleg Goncharov <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Oleg-Goncharov requested review from ptrendx, tdophung and timmoon10 November 21, 2025 23:47

greptile-apps bot reviewed Nov 21, 2025

View reviewed changes

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh Outdated Show resolved Hide resolved

greptile-apps bot reviewed Nov 22, 2025

View reviewed changes

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh Outdated Show resolved Hide resolved

greptile-apps bot reviewed Nov 22, 2025

View reviewed changes

Oleg-Goncharov changed the title ~~[Common] Persistent NVFP4 kernel~~ [Common] Persistent NVFP4 cast + transpose kernel Nov 22, 2025

greptile-apps bot reviewed Nov 22, 2025

View reviewed changes

Oleg-Goncharov force-pushed the pr_nvfp4_persistent_kernel branch from 445c870 to a7a0652 Compare November 22, 2025 01:34

greptile-apps bot reviewed Nov 22, 2025

View reviewed changes

ptrendx added the 2.11.0 label Nov 25, 2025

Oleg-Goncharov added performance Performance issues enhancement New feature or request labels Dec 4, 2025

ptrendx reviewed Dec 5, 2025

View reviewed changes

transformer_engine/common/cast/dispatch/quantize.cuh Outdated Show resolved Hide resolved

ptrendx reviewed Dec 5, 2025

View reviewed changes

transformer_engine/common/util/ptx.cuh Outdated Show resolved Hide resolved

ptrendx reviewed Dec 5, 2025

View reviewed changes

transformer_engine/common/util/ptx.cuh Outdated Show resolved Hide resolved

greptile-apps bot reviewed Dec 8, 2025

View reviewed changes

Oleg-Goncharov changed the title ~~[Common] Persistent NVFP4 cast + transpose kernel~~ [Common] Tuned NVFP4 cast + transpose kernel Dec 8, 2025

greptile-apps bot reviewed Dec 8, 2025

View reviewed changes

ptrendx added the fp4 label Dec 11, 2025

Oleg-Goncharov force-pushed the pr_nvfp4_persistent_kernel branch from 91312be to a38eeff Compare December 12, 2025 15:45

greptile-apps bot reviewed Dec 12, 2025

View reviewed changes

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

Oleg-Goncharov changed the title ~~[Common] Tuned NVFP4 cast + transpose kernel~~ [Common] Tuned NVFP4 cast kernel Dec 15, 2025

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

Oleg-Goncharov and others added 12 commits January 16, 2026 20:40

Fix

18faabc

Signed-off-by: Oleg Goncharov <[email protected]>

Fixes per PR review

24bf822

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

26231ca

for more information, see https://pre-commit.ci

Fixes per PR review. Added parameter to turn off the persistency

c3150ab

Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

43b548c

for more information, see https://pre-commit.ci

Modified reference CPU implementation in C++ unit tests to match GPU …

1bfd070

…(numerical truncation). Tightened the numerical tolerance Signed-off-by: Oleg Goncharov <[email protected]>

Disabled persistency by default, as non-persistent kernel is more per…

7210c12

…formant when inputs are large Signed-off-by: Oleg Goncharov <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

57a3467

for more information, see https://pre-commit.ci

Use the tuned kernel also for the rowwise only quantization

1f6f609

Signed-off-by: Oleg Goncharov <[email protected]>

Fixed typo

d81cacf

Signed-off-by: Oleg Goncharov <[email protected]>

Addressed comments from the PR review

166bc20

Signed-off-by: Oleg Goncharov <[email protected]>

Resolved conflicts

bc26160

Signed-off-by: Oleg Goncharov <[email protected]>

Oleg-Goncharov force-pushed the pr_nvfp4_persistent_kernel branch from 7362e47 to bc26160 Compare January 16, 2026 20:58

[pre-commit.ci] auto fixes from pre-commit.com hooks

1d6780c

for more information, see https://pre-commit.ci

Oleg-Goncharov requested a review from ptrendx January 16, 2026 21:00

tdophung reviewed Jan 16, 2026

View reviewed changes

tdophung reviewed Jan 17, 2026

View reviewed changes

Merge branch 'main' into pr_nvfp4_persistent_kernel

85f5910

Macros renaming

76efd00

Signed-off-by: Oleg Goncharov <[email protected]>

greptile-apps bot reviewed Jan 20, 2026

View reviewed changes

tdophung approved these changes Jan 21, 2026

View reviewed changes

Merge branch 'main' into pr_nvfp4_persistent_kernel

e09f39f

Oleg-Goncharov merged commit fbb16f4 into NVIDIA:main Jan 21, 2026
10 of 13 checks passed

greptile-apps bot reviewed Jan 21, 2026

View reviewed changes

Oleg-Goncharov mentioned this pull request Jan 22, 2026

[Common] Disabled the tuned NVFP4 kernels #2615

Merged

13 tasks

[Common] Tuned NVFP4 cast kernel #2412

[Common] Tuned NVFP4 cast kernel #2412

Uh oh!

Conversation

Oleg-Goncharov commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

RN: round-to-nearest mode 6.4 TB/s (rowwise only 7.2 TB/s)

SR: stochastic rounding 4.5 TB/s (rowwise only 7.0 TB/s)

Rowwise + Colwise (transpose)

Rowwise only

a) round-to-nearest

b) stochastic rounding

Using --fast-math can improve performance of the kernel with the stochastic rounding (RNG) by up to ~10%.

Threads to data mapping (colwise case)

a) Reads from SHMEM Input Buffer

b) Writes to SHMEM Output Transpose Buffer

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Key Changes

Issues Found

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Oleg-Goncharov commented Nov 21, 2025 •

edited

Loading

Using `--fast-math` can improve performance of the kernel with the stochastic rounding (RNG) by up to ~10%.

greptile-apps bot commented Nov 21, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading