Generic Fused MoE Quantization + Export for transformers 5.0+#1187
Generic Fused MoE Quantization + Export for transformers 5.0+#1187Edwardf0t1 merged 17 commits intomainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds detection, quantization wrapping, and export conversion for fused HuggingFace MoE expert modules: new fused-expert wrapper and helpers, export routine to split fused gate/up/down weights into per-expert submodules with quantization state, and integration into the unified HF quantization/export pipeline. Changes
Sequence DiagramssequenceDiagram
participant Model as HF Model
participant Plugin as Quant Plugin
participant Detector as _is_fused_experts_module()
participant Registry as QuantModuleRegistry
participant Wrapper as _QuantFusedExperts
Model->>Plugin: Load model with fused MoE experts
Plugin->>Detector: Inspect submodule shape/attrs
Detector-->>Registry: Register fused-expert wrapper if matched
Registry->>Wrapper: Instantiate wrapper for expert container
Wrapper-->>Model: Replace/attach quantized expert wrapper
Note over Wrapper: shared input quantizers\nper-expert weight quantizers
sequenceDiagram
participant Module as Fused Expert Module
participant Export as _export_fused_experts()
participant Split as Weight Slicer
participant Quant as Quantizer Handler
participant Register as Submodule Registration
Export->>Module: Identify fused params & num_experts
loop for each expert idx
Export->>Split: Slice gate_up_proj -> gate_proj + up_proj
Export->>Split: Slice down_proj -> down_proj[idx]
Export->>Quant: Deep-copy per-expert quantizers, slice _amax or compute if missing
Quant-->>Export: Exported quantized weight buffers
Export->>Register: Create/register gate_proj/up_proj/down_proj under module.{idx}
end
Export->>Module: Delete fused params and quantizer lists
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1187 +/- ##
==========================================
+ Coverage 75.42% 77.49% +2.07%
==========================================
Files 353 353
Lines 40603 40662 +59
==========================================
+ Hits 30623 31510 +887
+ Misses 9980 9152 -828
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/export/layer_utils.py (1)
968-995:⚠️ Potential issue | 🟠 MajorGeneric support is still incomplete for the AWQ/SVDQuant resmooth path.
get_expert_linear_names()now recognizes fused blocks, butrequantize_resmooth_fused_llm_layers()still goes throughget_experts_list(), and that helper only accepts Mixtral/Qwen model types. For the new targets called out in this PR (DeepSeek, Jamba, OLMoE, etc.), AWQ/NVFP4_SVDQuant export will still hitNotImplementedErrorbefore_export_fused_experts()is reached. Please add a structural fallback there, or bypass the old model-type path for fused experts entirely.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/layer_utils.py` around lines 968 - 995, The requantize_resmooth_fused_llm_layers flow fails for new fused expert types because requantize_resmooth_fused_llm_layers still relies on get_experts_list (which only knows Mixtral/Qwen) and raises NotImplementedError before _export_fused_experts is reached; update requantize_resmooth_fused_llm_layers (or get_experts_list) to use the same structural detection as get_expert_linear_names() (e.g., check hasattr(module, "experts") and hasattr(module.experts, "gate_up_proj_weight_quantizers") or call get_expert_linear_names(module)) as a fallback path so fused experts (DeepSeek, Jamba, OLMoE, etc.) are handled generically instead of raising NotImplementedError, or bypass the old model-type branch entirely and directly dispatch to _export_fused_experts() when the structural checks match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 871-879: _the detector is too permissive and may misidentify
BMM-style expert containers as F.linear-style fused experts; tighten
_is_fused_experts_module() (and the fallback in
_get_fused_expert_intermediate_dim()) by explicitly validating tensor
orientations expected by _QuantFusedExperts: require gate_up_proj and
gate_down_proj to be 2-D, experts to be 3-D, compute candidate_intermediate =
gate_up_proj.shape[1] // 2 and assert gate_up_proj.shape[1] is even, then verify
experts.shape[2] == candidate_intermediate and experts.shape[0] ==
int(getattr(module, "num_experts", experts.shape[0])); only auto-register the
_QuantFusedExperts path in CUSTOM_MODEL_PLUGINS when those orientation checks
pass so alternative layouts fall back to their own handlers (e.g.,
_QuantLlama4TextExperts).
- Around line 906-925: The computed idx must only be trusted if the slice truly
shares the underlying storage with self.gate_up_proj; first check that
weight.storage().data_ptr() (or weight.storage() identity) equals
self.gate_up_proj.storage().data_ptr() and raise a clear error if it does not,
instead of silently proceeding; then compute idx from storage_offset() and
stride as before and keep the existing range assertion using
_get_expert_idx_from_gate_up, gate_up_proj, storage_offset, stride, and
num_experts to locate and fix the code path.
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 229-230: The assertion comparing out_ref and out_test is too
strict (atol=1e-5) causing CI flakiness; update the test in
test_fused_experts.py (the comparison of out_ref and out_test) to use
torch.testing.assert_close with a slightly larger tolerance (e.g., atol=2e-5 or
rtol=1e-6) or, alternatively, explicitly disable the new quantizers before
producing out_test so the values match exactly; modify the assertion accordingly
to reference out_ref and out_test.
---
Outside diff comments:
In `@modelopt/torch/export/layer_utils.py`:
- Around line 968-995: The requantize_resmooth_fused_llm_layers flow fails for
new fused expert types because requantize_resmooth_fused_llm_layers still relies
on get_experts_list (which only knows Mixtral/Qwen) and raises
NotImplementedError before _export_fused_experts is reached; update
requantize_resmooth_fused_llm_layers (or get_experts_list) to use the same
structural detection as get_expert_linear_names() (e.g., check hasattr(module,
"experts") and hasattr(module.experts, "gate_up_proj_weight_quantizers") or call
get_expert_linear_names(module)) as a fallback path so fused experts (DeepSeek,
Jamba, OLMoE, etc.) are handled generically instead of raising
NotImplementedError, or bypass the old model-type branch entirely and directly
dispatch to _export_fused_experts() when the structural checks match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 06b7ed0f-85fc-4c97-a13b-76a86a374933
📒 Files selected for processing (5)
modelopt/torch/export/layer_utils.pymodelopt/torch/export/moe_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/plugins/huggingface.pytests/unit/torch/quantization/plugins/test_fused_experts.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/torch/quantization/plugins/test_fused_experts.py (1)
160-165: Consider using a pytest fixture for registry cleanup to ensure isolation on test failure.The current manual cleanup approach at start and end of each test works but is fragile if a test fails mid-execution. A fixture with
yieldguarantees cleanup.♻️ Proposed fixture-based cleanup
+@pytest.fixture(autouse=True) +def cleanup_registry(): + """Ensure registry is clean before and after each test.""" + yield + # Cleanup after test completes (even on failure) + for mod_type in [_SyntheticFusedExperts, _SyntheticSparseMoeBlock]: + if QuantModuleRegistry.get(mod_type) is not None: + QuantModuleRegistry.unregister(mod_type) + class TestQuantFusedExperts: - `@staticmethod` - def _cleanup_registry(mod_type): - if QuantModuleRegistry.get(mod_type) is not None: - QuantModuleRegistry.unregister(mod_type)Then remove the explicit
_cleanup_registrycalls from each test method.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 160 - 165, Replace the ad-hoc _cleanup_registry pattern with a pytest fixture that yields and performs teardown to guarantee registry cleanup on test failures: create a fixture (e.g., quant_module_registry_cleanup) that checks QuantModuleRegistry.get(mod_type) before yield and calls QuantModuleRegistry.unregister(mod_type) in the finally/teardown section, then use that fixture in TestQuantFusedExperts tests and remove explicit calls to TestQuantFusedExperts._cleanup_registry from each test method.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 160-165: Replace the ad-hoc _cleanup_registry pattern with a
pytest fixture that yields and performs teardown to guarantee registry cleanup
on test failures: create a fixture (e.g., quant_module_registry_cleanup) that
checks QuantModuleRegistry.get(mod_type) before yield and calls
QuantModuleRegistry.unregister(mod_type) in the finally/teardown section, then
use that fixture in TestQuantFusedExperts tests and remove explicit calls to
TestQuantFusedExperts._cleanup_registry from each test method.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9d8f0410-80ad-49d4-bed2-3fb4f4022ae4
📒 Files selected for processing (1)
tests/unit/torch/quantization/plugins/test_fused_experts.py
|
Test failing: |
It should be fixed now, could you double check? |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
modelopt/torch/quantization/plugins/huggingface.py (2)
907-926:⚠️ Potential issue | 🟠 MajorReject materialized slices before deriving
idx.Line 921 only checks that the computed index is in range. A copied/materialized
gate_up_proj[idx]can still satisfy that check—typically by collapsing to expert0—so this path silently reuses the wrong quantizers instead of failing fast. Guard on shared storage, and ideally on expert-boundary alignment, before computingidx.🛠️ Suggested guard
def _get_expert_idx_from_gate_up(self, weight: torch.Tensor) -> int: ... base_offset = self.gate_up_proj.storage_offset() stride = self.gate_up_proj.stride(0) if stride == 0: return 0 - idx = (weight.storage_offset() - base_offset) // stride + if weight.untyped_storage().data_ptr() != self.gate_up_proj.untyped_storage().data_ptr(): + raise AssertionError( + "Expected gate_up_proj[idx] to share storage with self.gate_up_proj." + ) + delta = weight.storage_offset() - base_offset + if delta % stride != 0: + raise AssertionError("Expected gate_up_proj[idx] to start on an expert boundary.") + idx = delta // strideIn PyTorch, if an expert index is computed from `(weight.storage_offset() - base_offset) // stride`, can `weight.clone()` or `weight.contiguous()` still yield an in-range index even though the tensor no longer shares storage with the original parameter?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 907 - 926, The _get_expert_idx_from_gate_up method should reject materialized/copy slices before deriving idx: verify that weight shares the same underlying storage object as self.gate_up_proj (e.g., compare storage/data_ptr) and that the offset aligns with expert boundaries ((weight.storage_offset() - base_offset) is divisible by stride) before computing idx; if either check fails (including the stride==0 special case), raise a clear error instead of proceeding so we don't silently reuse the wrong quantizers.
871-879:⚠️ Potential issue | 🟠 MajorAuto-register only the exact eager fused-expert layout.
The detector still accepts any module with 3-D
gate_up_proj/down_proj,num_experts, andact_fn. That is broad enough to register transposed/BMM-style containers or non-eager fused variants, while_QuantFusedExpertsassumes[N, 2I, H]/[N, H, I], an even split, and two alternatingF.linearcalls. Because the registry is keyed by class, one false positive affects every instance of that type. Validate tensor orientation,num_expertsconsistency, and the resolved experts backend before registering. Theshape[1] // 2fallback should also reject odd widths instead of truncating.In Hugging Face transformers 5.x fused MoE experts, what tensor layouts are used for eager vs batched_mm/grouped_mm implementations, and where is the active experts implementation exposed on the module or config?Also applies to: 1458-1505, 1729-1735
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 871 - 879, The current detector in _get_fused_expert_intermediate_dim and the fused-expert registration is too permissive; restrict registration to the exact eager fused-expert layout by validating that module.gate_up_proj and module.gate_down_proj have the expected orientations (e.g., gate_up_proj.shape == (num_experts, 2*I, H) and gate_down_proj.shape == (num_experts, H, I) for eager layout), that num_experts on the module matches the leading dimension of these tensors, that the intermediate dimension (shape[1]) is even (reject odd widths rather than floor-dividing), and that the module’s resolved experts backend/type (from config or exposed attribute) corresponds to the eager implementation before returning or registering _QuantFusedExperts; update _get_fused_expert_intermediate_dim to return None or raise when these validations fail so only exact eager fused-expert modules are auto-registered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py`:
- Around line 255-298: Add a new CPU regression test that mirrors
test_export_creates_per_expert_submodules but uses the legacy non-fused MoE
container (the test suite's non-fused Synthetic expert class), e.g., instantiate
the non-fused container, register it with QuantModuleRegistry, call
QuantModuleRegistry.convert(...) to get converted, run a small forward on CPU to
calibrate, call _export_fused_experts(converted, torch.float32), then assert
per-expert submodules (getattr(converted, str(idx))) exist and contain
gate_proj/up_proj/down_proj with expected shapes and that fused attributes like
gate_up_proj and gate_up_proj_weight_quantizers are removed, and finally
unregister the expert type from QuantModuleRegistry.
---
Duplicate comments:
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 907-926: The _get_expert_idx_from_gate_up method should reject
materialized/copy slices before deriving idx: verify that weight shares the same
underlying storage object as self.gate_up_proj (e.g., compare storage/data_ptr)
and that the offset aligns with expert boundaries ((weight.storage_offset() -
base_offset) is divisible by stride) before computing idx; if either check fails
(including the stride==0 special case), raise a clear error instead of
proceeding so we don't silently reuse the wrong quantizers.
- Around line 871-879: The current detector in
_get_fused_expert_intermediate_dim and the fused-expert registration is too
permissive; restrict registration to the exact eager fused-expert layout by
validating that module.gate_up_proj and module.gate_down_proj have the expected
orientations (e.g., gate_up_proj.shape == (num_experts, 2*I, H) and
gate_down_proj.shape == (num_experts, H, I) for eager layout), that num_experts
on the module matches the leading dimension of these tensors, that the
intermediate dimension (shape[1]) is even (reject odd widths rather than
floor-dividing), and that the module’s resolved experts backend/type (from
config or exposed attribute) corresponds to the eager implementation before
returning or registering _QuantFusedExperts; update
_get_fused_expert_intermediate_dim to return None or raise when these
validations fail so only exact eager fused-expert modules are auto-registered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 31275e25-b8d3-4777-baaa-ab979d03d640
📒 Files selected for processing (4)
modelopt/torch/export/layer_utils.pymodelopt/torch/export/moe_utils.pymodelopt/torch/quantization/plugins/huggingface.pytests/unit/torch/quantization/plugins/test_fused_experts.py
🚧 Files skipped from review as they are similar to previous changes (2)
- modelopt/torch/export/moe_utils.py
- modelopt/torch/export/layer_utils.py
| class TestExportFusedExperts: | ||
| def test_export_creates_per_expert_submodules(self): | ||
| """_export_fused_experts should create per-expert submodules with standard naming.""" | ||
| from modelopt.torch.export.moe_utils import _export_fused_experts | ||
|
|
||
| experts = _SyntheticFusedExperts() | ||
| expert_type = type(experts) | ||
|
|
||
| # Manually register and convert | ||
| if QuantModuleRegistry.get(expert_type) is None: | ||
| QuantModuleRegistry.register({expert_type: "test.SyntheticFusedExperts"})( | ||
| _QuantFusedExperts | ||
| ) | ||
| converted = QuantModuleRegistry.convert(experts) | ||
|
|
||
| # Run a forward pass to calibrate (set amaxes) | ||
| seq_len = 16 | ||
| hidden_states = torch.randn(seq_len, HIDDEN_DIM) | ||
| top_k_index = torch.randint(0, NUM_EXPERTS, (seq_len, TOP_K)) | ||
| top_k_weights = torch.softmax(torch.randn(seq_len, TOP_K), dim=-1) | ||
| with torch.no_grad(): | ||
| converted(hidden_states, top_k_index, top_k_weights) | ||
|
|
||
| _export_fused_experts(converted, torch.float16) | ||
|
|
||
| # Verify per-expert submodules exist | ||
| for idx in range(NUM_EXPERTS): | ||
| expert_mod = getattr(converted, str(idx), None) | ||
| assert expert_mod is not None, f"Missing expert submodule {idx}" | ||
| assert hasattr(expert_mod, "gate_proj"), f"Expert {idx} missing gate_proj" | ||
| assert hasattr(expert_mod, "up_proj"), f"Expert {idx} missing up_proj" | ||
| assert hasattr(expert_mod, "down_proj"), f"Expert {idx} missing down_proj" | ||
|
|
||
| assert expert_mod.gate_proj.weight.shape == (INTERMEDIATE_DIM, HIDDEN_DIM) | ||
| assert expert_mod.up_proj.weight.shape == (INTERMEDIATE_DIM, HIDDEN_DIM) | ||
| assert expert_mod.down_proj.weight.shape == (HIDDEN_DIM, INTERMEDIATE_DIM) | ||
|
|
||
| # Verify fused params are removed | ||
| assert not hasattr(converted, "gate_up_proj") | ||
| assert not hasattr(converted, "down_proj") | ||
| assert not hasattr(converted, "gate_up_proj_weight_quantizers") | ||
|
|
||
| if QuantModuleRegistry.get(expert_type) is not None: | ||
| QuantModuleRegistry.unregister(expert_type) |
There was a problem hiding this comment.
Please add a legacy non-fused export regression.
This export test only exercises the new fused layout. The reported backward-compat break is on the older per-expert expert-module path, so this suite still would not catch an export branch that assumes gate_proj/up_proj naming everywhere. Add one CPU regression for a non-fused MoE expert container as well.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/torch/quantization/plugins/test_fused_experts.py` around lines 255
- 298, Add a new CPU regression test that mirrors
test_export_creates_per_expert_submodules but uses the legacy non-fused MoE
container (the test suite's non-fused Synthetic expert class), e.g., instantiate
the non-fused container, register it with QuantModuleRegistry, call
QuantModuleRegistry.convert(...) to get converted, run a small forward on CPU to
calibrate, call _export_fused_experts(converted, torch.float32), then assert
per-expert submodules (getattr(converted, str(idx))) exist and contain
gate_proj/up_proj/down_proj with expected shapes and that fused attributes like
gate_up_proj and gate_up_proj_weight_quantizers are removed, and finally
unregister the expert type from QuantModuleRegistry.
196748e to
3e1a0fc
Compare
cjluo-nv
left a comment
There was a problem hiding this comment.
Have you tested models 1) with gate and up fused and 2) with gate and up separated?
Yes — fused gate_up_proj is tested via unit tests (_SyntheticFusedExperts) and the new GPU export test entries for Qwen3 MoE (fp8, nvfp4). The separated per-expert path (gate_proj + up_proj as individual nn.Linear) is handled by the existing _QuantSparseSequentialMoe and is not changed by this PR. |
| """ | ||
| if not hasattr(module, "gate_up_proj") or not hasattr(module, "down_proj"): | ||
| return False | ||
| if not hasattr(module, "num_experts") or not hasattr(module, "act_fn"): |
There was a problem hiding this comment.
Hi @Edwardf0t1, can we safely assume that act_fn will always be present as an attribute in the Expert module?
I noticed that GPTOSS doesn't include the act_fn attribute in its module. If that's the case, wouldn't this cause the function to return False even when we're actually dealing with a fused expert module?
i don't see GPT-OSS in the lines 1376-, do we have a special handling for that or we don't consider it?
Apologies if my query is trivial, hoping to ramp up and contributing soon :)
There was a problem hiding this comment.
Good question - GPT-OSS is intentionally excluded here —
-
It has its own dedicated handler (
_QuantGptOssExperts) registered explicitly at import time. Theact_fncheck serves as a structural filter to distinguish the standard HF fused pattern (which usesF.linearthat we intercept) from non-standard layouts like GPT-OSS (which usestorch.bmm/__matmul__instead). -
Even without the
act_fncheck,register_fused_experts_on_the_flyhas a second guard — it skips any module type that's already registered.
| return self.num_experts | ||
| def _setup(self): | ||
| n = self.num_experts | ||
| self.gate_up_proj_input_quantizer = TensorQuantizer() |
There was a problem hiding this comment.
Could you test a model that does not have gate_proj?
There was a problem hiding this comment.
You mean a dense model? That's covered already.
tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Update test imports and assertions after _is_sparse_moe_block was renamed to _is_sparse_sequaential_moe_block in PR #975. Fused MoE blocks with non-iterable experts are correctly not detected as sequential MoE blocks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Add create_tiny_qwen3_moe_dir helper and add Qwen3 MoE entries (fp8, nvfp4) to test_unified_hf_export_and_check_safetensors so CI covers the fused MoE PTQ + export path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Add create_tiny_gpt_oss_dir helper and add GPT-OSS entries (fp8, nvfp4) to test_unified_hf_export_and_check_safetensors to verify GPT-OSS PTQ + export works with transformers 5.0+. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Qwen3_5MoeExperts follows the standard fused expert pattern (gate_up_proj 3D + down_proj 3D + num_experts + act_fn) and is now handled generically by _QuantFusedExperts via register_fused_experts_on_the_fly. Both Qwen3.5-397B and Qwen3.5-35B use fused 3D experts in their main MoE layers; MTP per-expert layers are iterable and handled by _QuantSparseSequentialMoe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
6f5015c to
28c1307
Compare
Reduce from 12 to 7 test cases by distributing models across qformats: llama for dense-only formats, MoE models for common formats (fp8, nvfp4). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Review feedback1. GPU tests remove existing dense model coverage
Three existing llama test cases were removed:
These were replaced with 2 MoE tests. The MoE additions are great, but the llama removals are a regression in dense-model export coverage. Please keep the existing llama tests and append the MoE tests to the parametrize list. 2. Silent no-op when amax dim is not divisible
if fused_total % amax_dim0 == 0:
slice_start = fused_start * amax_dim0 // fused_total
...When |
- Restore removed llama test cases (nvfp4, nvfp4_mse, nvfp4_awq) while keeping MoE entries appended, so dense-model export coverage is preserved. - Add warning when per-channel amax dim is not evenly divisible during fused expert weight slicing, instead of silently skipping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
@cjluo-nv Thanks for catching both issues.
Fixed in 2db2fb3. |
What does this PR do?
Add generic quantization and export support for fused MoE expert modules in HuggingFace transformers 5.0+.
In transformers 5.0+, all major MoE models switched from sequential per-expert
nn.ModuleListto fused 3D tensor parameters (gate_up_proj,down_proj). This breaks ModelOpt's existing per-expert quantization and export pipeline, which assumes iterable expert submodules.Affected models (verified against transformers v5.5.0 source):
MixtralExperts(Mixtral)Qwen2MoeExperts(Qwen2-MoE)Qwen3MoeExperts(Qwen3-MoE)Qwen3_5MoeExperts(Qwen3.5-MoE)DeepseekV3NaiveMoe(DeepSeek-V3)JambaExperts,OlmoeExperts, and any future model following the same HF standard patternKey insight: All these models share an identical fused expert structure and forward pattern. A single generic solution replaces N model-specific implementations.
Context: relationship to PR #975 and PR #1170
kmorabi/bump-transformers-5.0): Adds experimental transformers 5.0 support but explicitly skips batched MoE experts. This PR fills that gap.chenjiel/refactor_qwen35): Handles only Qwen3.5 using_QuantFunctionalMixin. This PR generalizes that approach to all fused MoE models.Changes
Quantization (
modelopt/torch/quantization/plugins/huggingface.py):_QuantFusedExperts(_QuantFunctionalMixin)-- Generic wrapper that interceptsF.linearcalls and applies per-expert quantization via storage-offset-based expert index recovery. Each expert gets its own weight and input quantizers (nn.ModuleList)._is_fused_experts_module()-- Structural detector:gate_up_proj(3D) +down_proj(3D) +num_experts+act_fn.register_fused_experts_on_the_fly()-- Auto-registration callback, added toCUSTOM_MODEL_PLUGINSbeforeregister_sparse_moe_on_the_flyso explicit registrations (Llama4, GptOss, etc.) take priority._get_fused_expert_intermediate_dim()-- Helper for cross-version attribute name resolution (intermediate_dim/intermediate_size/ fallback to shape).Export (
modelopt/torch/export/moe_utils.py,unified_export_hf.py,layer_utils.py):_export_fused_experts()-- Splits fused 3D weights into per-expert 2D projections (gate_proj,up_proj,down_proj), handles amax fallback for uncalibrated experts, proportionally slices per-channel amax, and registers results under the standardexperts.{E}.gate_proj.weightnaming convention._process_quantized_modulesand_export_transformers_checkpointto dispatch to_export_fused_expertsfor fused expert modules.get_expert_linear_namesfor fused experts. AddedMixtralSparseMoeBlockto thegate_proj/down_proj/up_projgroup (transformers 5.0 naming).Tests (
tests/unit/torch/quantization/plugins/test_fused_experts.py):Two-level registration design
SparseMoeBlock --> _QuantSparseMoe (calibration control, token counting, top_k override) .experts --> _QuantFusedExperts (per-expert F.linear interception + quantization)
register_fused_experts_on_the_flyruns first to register the inner expert module;register_sparse_moe_on_the_flythen registers the outer block._QuantSparseMoe.layer_sync_moe_local_experts_amaxskips fused experts (they are not iterable), as per-expert amax is managed internally by_QuantFusedExperts.Known limitations
@use_experts_implementationbackends: TheF.linearinterception only works withexperts_implementation="eager"(default).batched_mm/grouped_mmusetorch.bmm/torch._grouped_mminstead and are not intercepted.storage_offset()breaks under.contiguous(), FSDP2 redistribution, ortorch.compilematerialization. Runtime assertions are included.F.linearcalls per expert. Documented in docstrings.Testing
test_sparse_moe.py)Before your PR is "Ready for review"
Summary by CodeRabbit
New Features
Tests
Changelog