feat(deepseek): add --cast_mxfp4_to_nvfp4 to deepseek_v4 quantize step#1653
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a ChangesMXFP4→NVFP4 Lossless Cast
Sequence DiagramsequenceDiagram
participant User
participant DeepSeekScript
participant numeric_utils
participant NVFP4Quantizer
participant HFCheckpoint
User->>DeepSeekScript: run quantize_to_nvfp4.py --cast_mxfp4_to_nvfp4
DeepSeekScript->>NVFP4Quantizer: convert_shard(cast=True)
NVFP4Quantizer->>numeric_utils: mxfp4_to_nvfp4_global_amax / mxfp4_to_nvfp4_per_block_amax
numeric_utils-->>NVFP4Quantizer: amax, k_max, diagnostics
NVFP4Quantizer-->>DeepSeekScript: quantized NVFP4 weights + lossless stats
DeepSeekScript->>HFCheckpoint: write `--output_ckpt` NVFP4 checkpoint
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1653 +/- ##
==========================================
- Coverage 77.51% 76.07% -1.44%
==========================================
Files 489 509 +20
Lines 54498 57850 +3352
==========================================
+ Hits 42242 44008 +1766
- Misses 12256 13842 +1586
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
1d16130 to
460509f
Compare
460509f to
656101f
Compare
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.py (1)
278-285: 💤 Low valueConsider moving the import to module level.
The
numeric_utilsimport inside the test function could be moved to the top of the file with other imports. Per coding guidelines, imports belong at module top so import errors surface at collection time. However, since this is a test file and only this single test usesnumeric_utils, the current approach is acceptable.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.py` around lines 278 - 285, Move the local import of numeric_utils out of test_e2m1_magnitude_table_cached_per_device and into the module-level imports so import errors surface during test collection; specifically, import modelopt.torch.quantization.utils.numeric_utils at the top of the file and then in the test simply call numeric_utils._e2m1_magnitude_table(torch.device("cpu")) and assert as before (references: test_e2m1_magnitude_table_cached_per_device, numeric_utils, _e2m1_magnitude_table, _E2M1_MAGNITUDE).Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.py`:
- Around line 278-285: Move the local import of numeric_utils out of
test_e2m1_magnitude_table_cached_per_device and into the module-level imports so
import errors surface during test collection; specifically, import
modelopt.torch.quantization.utils.numeric_utils at the top of the file and then
in the test simply call numeric_utils._e2m1_magnitude_table(torch.device("cpu"))
and assert as before (references: test_e2m1_magnitude_table_cached_per_device,
numeric_utils, _e2m1_magnitude_table, _E2M1_MAGNITUDE).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ee495359-5a81-4f65-bf13-107e56db31b6
📒 Files selected for processing (7)
CHANGELOG.rstexamples/deepseek/README.mdexamples/deepseek/deepseek_v4/quantize_to_nvfp4.pyexamples/llm_ptq/cast_mxfp4_to_nvfp4.pymodelopt/torch/quantization/utils/__init__.pymodelopt/torch/quantization/utils/numeric_utils.pytests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.py
✅ Files skipped from review due to trivial changes (2)
- examples/deepseek/README.md
- CHANGELOG.rst
dd1444d to
2a2ac77
Compare
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Solid follow-up to #1372: the closed-form MXFP4→NVFP4 numerics are hoisted into modelopt/torch/quantization/utils/numeric_utils.py and reused by both example cast paths instead of duplicated, and the V4 twist (w1/w3 sharing one scale_2 for the fused GEMM1) is the right minimal addition. Design protocol satisfied — the PR body explicitly justifies hoisting rather than copy-pasting, and numeric_utils.py is a thin pure-tensor utility, not a new framework.
Math reads correctly: per_block_scale = 2^(k_j-m) × scale_2 = 2^m ⇒ 2^k_j exactly, BF16 dequant of E2M1 magnitudes is lossless, and the lossless-accounting mask (k >= k_max - 17) | (e8m0 == 0) correctly handles all-zero blocks (whose k_j = -127 would otherwise read as out-of-range). NVFP4QTensor.quantize accepts a pre-supplied per-block scale tensor and that path was already exercised by the GPT-OSS cast.
A few things that warrant a human pass before sign-off:
- No unit tests for the new V4-specific helpers.
_quantize_weight_nvfp4_lossless,_build_w13_kmax_overrides, and_kmax_from_mxfp4_scaleare exercised only by the on-the-real-V4-checkpoint validation reported in the PR body; the moved tests intests/unit/torch/quantization/test_numeric_utils.pycover only the hoisted library helpers. A cheap unit test for the V4 path (synthetic MXFP4 weight + scale → roundtrip lossless count, plus a w1/w3 shared-k_maxcase) would lock in the V4-specific contract. - README quietly promotes
--cast_mxfp4_to_nvfp4to the standard example.examples/deepseek/README.mdnow appends--cast_mxfp4_to_nvfp4to the mainquantize_to_nvfp4.pyinvocation, not as a "lossless variant" callout — that contradicts the PR body's "default export behavior unchanged" framing. Worth either dropping the flag from the top-level usage block (keeping the dedicated subsection that explains the trade-off) or saying explicitly that the flag is now recommended for V4 sources. - Stat naming nit:
stats[f"cast_oor_layers_{block_kind}"]is incremented per expert-projection tensor (w1/w2/w3), not per layer —cast_oor_tensors_*would matchexperts_*/weight_synth_*siblings. - Minor:
_build_w13_kmax_overridesreads each*.scaleonce and the main loop reads it again viaf.get_tensor(scale_key). Cheap (E8M0 is tiny relative to packed nibbles), but if you're already touching this you could reuse the cached read.
Nothing blocking; the cast itself looks right. Just want a human reviewer to land the unit-test ask and the README usage call.
it's ok to skip unittests impl for example code. |
2a2ac77 to
4fe63d0
Compare
| return overrides | ||
|
|
||
|
|
||
| def _quantize_weight_nvfp4_lossless( |
There was a problem hiding this comment.
I suggest to make this a general util in modeopt, I bet the mxfp4-> nvfp4 casting will be a often used util later.
There was a problem hiding this comment.
for the normal PTQ path, we already have the feature integrated. See https://github.com/NVIDIA/Model-Optimizer/blob/main/examples/llm_ptq/cast_mxfp4_to_nvfp4.py
meenchen
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review: prior critical/important comments are addressed.
- 💬 Operator: "We don't need to add unittest for this example" — drops the V4-specific-helpers unit-test ask from the prior nudge.
- README usage call: top-level
quantize_to_nvfp4.pyinvocation no longer appends--cast_mxfp4_to_nvfp4; the flag is now in a dedicated "Lossless MXFP4 → NVFP4 weight cast" subsection with the trade-off explained, matching the "default unchanged" framing. - Stat naming:
cast_oor_layers_*renamed tocast_oor_tensors_*to match the per-tensor increment and theexperts_*/weight_synth_*siblings. - Hoist + math (closed-form
scale_2 = 2^(k_max-8),per_block_scale = 2^(k_j-m), lossless mask(k >= k_max - 17) | (e8m0 == 0), w1/w3 sharedk_max) is consistent with PR #1372 and previously reviewed. numeric_utils.pycarries the standard NVIDIA Apache-2.0 header; no other licensing changes.
Complex PR: spans 7 directories (≥ 5). Looping in a human for approval.
Add a closed-form, bit-exact MXFP4 -> NVFP4 routed-expert weight cast to examples/deepseek/deepseek_v4/quantize_to_nvfp4.py via a --cast_mxfp4_to_nvfp4 flag. Pins scale_2 = 2^(k_max-8) and each per-block E4M3 scale to 2^(k_j-m) from the source E8M0 scales, so the NVFP4 nibbles equal the source MXFP4 nibbles bit-for-bit for every in-range block. w1/w3 share one scale_2 for the fused GEMM1; activation input_scale still comes from --amax_path calibration. Hoist the shared closed-form numerics (mxfp4_to_nvfp4_global_amax, mxfp4_to_nvfp4_per_block_amax, and the E2M1/E4M3/E8M0 constants) out of the GPT-OSS example cast (examples/llm_ptq/cast_mxfp4_to_nvfp4.py, PR #1372) into modelopt.torch.quantization.utils.numeric_utils, so both the GPT-OSS and DeepSeek-V4 cast paths import them from the library. Their unit tests move to tests/unit/torch/quantization/test_numeric_utils.py; the example test keeps the cast-specific cases (quantizer naming, build_amax_map, apply_to_model). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
4fe63d0 to
ada91f0
Compare
Edwardf0t1
left a comment
There was a problem hiding this comment.
It seems there are overlapped contents in readme and docstrings, could we simplify?
What does this PR do?
Type of change: new feature
Brings the GPT-OSS lossless MXFP4 → NVFP4 cast (#1372) to DeepSeek V4's routed-expert export by adding a
--cast_mxfp4_to_nvfp4flag toexamples/deepseek/deepseek_v4/quantize_to_nvfp4.py.To avoid duplicating the closed-form math, the shared numerics —
mxfp4_to_nvfp4_global_amax,mxfp4_to_nvfp4_per_block_amax, and the E2M1/E4M3/E8M0 constants — are hoisted out of the GPT-OSS example cast into the library atmodelopt/torch/quantization/utils/numeric_utils.py. Both the GPT-OSS cast (examples/llm_ptq/cast_mxfp4_to_nvfp4.py) and the new DeepSeek path now import them from there.DeepSeek V4's routed experts ship as MXFP4 (E2M1 nibbles + a power-of-two E8M0 scale per 32-element block). By default the export dequantizes them to BF16 and re-quantizes to NVFP4 using the calibrated per-tensor weight amax, which re-derives per-block scales from the data and is therefore lossy. With the flag, the cast pins
scale_2 = 2^(k_max-8)and each per-block E4M3 scale to2^(k_j-m)straight from the source E8M0 scales, soper_block_scale * scale_2 = 2^k_jand the NVFP4 nibbles equal the source MXFP4 nibbles bit-for-bit (for every block whosek_jlands in E4M3's representable window; rare out-of-range blocks clamp). The one V4-specific addition is that w1/w3 share a singlescale_2for the fused GEMM1, sok_maxis taken over both projections. The flag only affects routed-expert weights — activationinput_scalestill comes from--amax_pathcalibration.Usage
python deepseek_v4/quantize_to_nvfp4.py \ --amax_path ${AMAX} \ --source_ckpt ${DS_V4} \ --output_ckpt ${HF_NVFP4_PATH} \ --cast_mxfp4_to_nvfp4Testing
tests/unit/torch/quantization/test_numeric_utils.py(10 cases: per-tensor global_amax, per-block amax incl. out-of-range, magnitude-table cache) — 10/10 pass. The example testtests/examples/llm_ptq/test_cast_mxfp4_to_nvfp4.pykeeps the cast-specific cases (quantizer naming,build_amax_map,apply_to_model).float8_e8m0fnuscale dtype): 23.5M blocks, 100% lossless, 0 error.[cast] lossless MXFP4->NVFP4 blocks: 8,657,043,456/8,657,043,456 (100.0000%). Output weights match an independently-produced reference cast byte-for-byte (weight_scale,weight_scale_2, packed nibbles modulo the harmless sign-of-zero).Before your PR is "Ready for review"
CONTRIBUTING.md: ✅ N/A (no new deps; shared numerics moved into the library rather than duplicated)tests/unit/torch/quantization/test_numeric_utils.py; end-to-end validated on a real DeepSeek-V4 checkpoint)/claude review)Additional Information
Mirrors and reuses #1372 (GPT-OSS MXFP4 → NVFP4 cast); the closed-form numerics are now shared via
modelopt.torch.quantization.utils.numeric_utils.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
--cast_mxfp4_to_nvfp4flag to perform a closed-form, mostly lossless MXFP4→NVFP4 conversion for routed-expert weights with aggregated lossless/block statistics.Documentation
Chores
Tests