Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds support for Qwen3-VL vision-language models to the Megatron Core export/import framework by introducing model mappings to convert between Hugging Face and Megatron Core structures for quantization workflows, alongside documentation updates and comprehensive unit tests. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
## What does this PR do? This PR implements RegionSearch class. RegionSearch could help partition big ONNX model into small region. QDQ autouning will be performed on the regions. **Overview:** ? ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No, document updates is in Part 4. - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: CHANGELOG will be updated when all changes are ready. ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes **Refactor** * Improved ONNX quantization backend with new optimization framework and extensive test coverage to enhance internal graph processing capabilities. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Will Guo <willg@nvidia.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? This PR implement RegionInspect tool. This tool could be used to visualize the regions parititioned by RegionSearch classes. This tool could be used to analyze if the partitioned regions match the fusion patterns. **Overview:** ? ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No, document update is in Part 4. - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: No, CHANGELOG will be updated when all changes are ready. ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## New Features * Added a region inspection tool for ONNX models. Analyzes model structure and generates detailed reports including region statistics, hierarchical relationships, node coverage metrics, and size distribution analysis. Available through a command-line interface with configurable parameters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Will Guo <willg@nvidia.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? **Type of change:** ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> Bug fix **Overview:** ? 1. Fixing megatron ignore module has additional `.` in the suffix 2. Change megatron export to safe per layer as a safetensor (avoid ghost safetensors) ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Export workflow now supports additional model components (EAGLE/Medusa modules) * Per-layer model state organization for improved checkpoint management * **Bug Fixes** * More robust Hugging Face configuration, tokenizer, and image processor preservation * Enhanced multimodal component extraction and loading * **Refactor** * Optimized model export process with improved per-layer safetensors handling <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? **Type of change:** New model support <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** Add PTQ support for https://huggingface.co/nvidia/NVIDIA-Nemotron-Parse-v1.1 ## Usage <!-- You can potentially add a usage example below. --> ```python python3 hf_ptq.py --pyt_ckpt_path /home/omniml_data_3/models/NVIDIA-Nemotron-Parse-v1.1 --qformat fp8 --export_path /home/omniml_data_3/zhiyuc/checkpoints/NVIDIA-Nemotron-Parse-v1.1-FP8 --trust_remote_code --kv_cache_qformat none --attn_implementation eager ``` By default, image-text data will be used in calibration for VLMs. ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Not yet <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added support for Nemotron-Parse multimodal models, including proper device mapping, processor loading, and generation handling. * **Improvements** * Enhanced quantization robustness with safer handling of quantization attributes and fallback logic. * Improved model loading with better device placement and encoder buffer management for vision-language models. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? **Type of change:** Bug fix <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> ## Testing <!-- Mention how have you tested your change if applicable. --> Nemotron Nano v2 pruned can be saved <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Fixed Hugging Face model loading to properly respect the `trust_remote_code` parameter during model instantiation. * **Improvements** * Enhanced distributed training logging with rank-0 aware warning and logging mechanisms for cleaner, non-redundant output in multi-GPU and multi-node scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do? [Short term]: Megatron based tests take a long time often resulting in CICD timeout. Splitting megatron tests into a dedicated CICD job for faster overall CI/CD run [Mid/Long term]: Run all megatron gpu tests using `torchrun` instead of `pytest` so all dist processes are already created and all individual tests no longer need to setup and destroy their processes which adds a lot of overhead per test ## Testing <!-- Mention how have you tested your change if applicable. --> - [x] 1-GPU CI/CD passing (on this PR) - [x] 2-GPU CI/CD passing (on nightly run - manually triggered): https://github.com/NVIDIA/Model-Optimizer/actions/runs/22000517688 --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
NVIDIA#884) ## What does this PR do? **Type of change:** new feature **Overview:** Enable full TE spec support for NemotronH (Mamba hybrid) models during HF-to-Megatron weight import via `import_mcore_gpt_from_hf`. Previously, importing HF weights into a Megatron model built with the full TE spec (`TELayerNormColumnParallelLinear`, `TEGroupedMLP`, etc.) failed for NemotronH models due to two issues: 1. **Grouped expert prefix bug**: The `experts.linear_fc1/fc2` import rules had a hard-coded `mtp.layers.{}` prefix, which was only correct for MTP layers. When regular decoder MoE layers use `TEGroupedMLP` (via the full TE spec), the importer generated incorrect HF keys (e.g., `mtp.layers.27.mixer.experts.0.up_proj.weight` instead of `backbone.layers.27.mixer.experts.0.up_proj.weight`). 2. **Fused layer norm loading**: In the full TE spec, layer norms are fused into `TELayerNormColumnParallelLinear` modules as `layer_norm_weight`. The importer's `_name_remapping` would crash trying to load `layer_norm_weight` from a non-existent HF path (e.g., `backbone.layers.X.mixer.in_proj.layer_norm_weight`), when the actual HF norm weight lives at `backbone.layers.X.norm.weight`. ### Changes **`mcore_nemotron.py`**: - Fixed grouped expert prefix from `mtp.layers.{}` to `backbone.layers.{}`. The `_grouped_mlp_merging` function already handles `backbone` → `mtp` replacement when `is_mtp=True`, so both decoder and MTP layers work correctly. - Added `mapping={"layer_norm_weight": None}` to `in_proj` and `linear_fc1` rules to skip `layer_norm_weight` during `_name_remapping` (loaded separately via `fused_norm`). - Added `fused_norm` rule (`NameRemapping("backbone.layers.{}.norm.weight")`) to load HF norm weights into fused TE modules. **`megatron_importer.py`**: - Added `source_key is None` check in `_name_remapping` to skip keys mapped to `None` in the mapping dict (keeps existing value instead of crashing on missing HF key). - Added fused norm loading in `_import_mamba_layer`: after loading `in_proj`, loads `layer_norm_weight` from HF via `fused_norm` rule when `layer.norm` is `IdentityOp`. - Added fused norm loading in `_import_transformer_layer`: loads `layer_norm_weight` into `linear_qkv` (when `input_layernorm` is `IdentityOp`) and into `linear_fc1` (when `pre_mlp_layernorm` is `IdentityOp`). ## Usage The full TE spec is enabled via the `--full-te-spec` flag on the Megatron-LM side (separate PR). On the ModelOpt side, no user-facing changes are needed -- the import rules automatically handle both local spec and full TE spec models. ```bash # Convert HF checkpoint to Megatron with full TE spec (megatron-lm side) unset MLM_MODEL_CKPT && export MLM_MODEL_SAVE=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16_mlm && export HF_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 export PP=2 export MLM_EXTRA_ARGS="--full-te-spec" bash convert.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 # Quantize the converted checkpoint (megatron-lm side) export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16_mlm export MLM_MODEL_SAVE=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm export HF_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 export PP=2 && export TP=4 && export EP=4 && export ETP=1 export MLM_EXTRA_ARGS="--full-te-spec" bash quantize.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 FP8_DEFAULT_CFG # Generate export PP=2 && export TP=4 && export EP=4 && export ETP=1 export MLM_EXTRA_ARGS="--full-te-spec" export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm && ./generate.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 # MMLU export PP=2 && export TP=4 && export EP=4 && export ETP=1 export MLM_EXTRA_ARGS="--full-te-spec" export MLM_MODEL_CKPT=/models/NVIDIA-Nemotron-3-Nano-30B-A3B-fp8_mlm && export MLM_EXTRA_ARGS="--fraction 0.05 --disable-tqdm" && ./mmlu.sh nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 ``` ## Testing - Tested end-to-end: HF → Megatron conversion → FP8 quantization → inference (generate) → MMLU evaluation with Nemotron-3-Nano-30B-A3B-BF16. - Verified the resulting model structure matches Megatron-Bridge's TE spec output (TELayerNormColumnParallelLinear, TEGroupedMLP, IdentityOp norms, etc.). - Verified quantized model produces coherent text generation outputs. - Verified backward compatibility: all changes are no-ops for existing local-spec pipelines (guarded by `IdentityOp` checks, `hasattr` checks, and `"fused_norm" in self.rules` checks). ## Before your PR is "*Ready for review*" - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes -- all changes are guarded by conditions that only activate for full TE spec models. Local spec models follow the exact same code paths as before. - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: No ## Additional Information Companion megatron-lm changes (separate PR): - `megatron/core/post_training/modelopt/mamba/model_specs.py`: Added `use_full_te_spec` parameter to return canonical `mamba_stack_spec` from `mamba_layer_specs.py`. - `megatron/post_training/model_builder.py`: Passes `use_full_te_spec=args.full_te_spec` to `get_mamba_stack_modelopt_spec`. - `megatron/post_training/arguments.py`: Added `--full-te-spec` CLI flag. - `examples/post_training/modelopt/convert_model.py`: Skip `moe_grouped_gemm=False` override when `--full-te-spec` is set. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added support for loading fused normalization weights during model import. * **Bug Fixes** * Improved weight mapping logic to correctly skip redundant layer norm weights in specialized model architectures. * **Refactor** * Reorganized expert model parallel configuration paths for better compatibility with mixed parallel processing settings. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: James Shen <yueshen@nvidia.com> Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
## What does this PR do?
**Type of change:** new example <!-- Use one of the following: Bug fix,
new feature, new example, new tests, documentation. -->
**Overview:**
Adding LTX-2 distillation trainer.
## Usage
<!-- You can potentially add a usage example below. -->
```bash
accelerate launch \
--config_file configs/accelerate/fsdp.yaml \
--num_processes 8 \
distillation_trainer.py --config configs/distillation_example.yaml
```
See readme for more details.
## Testing
Run training with single/multiple nodes.
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes <!--- If No, explain why.
-->
- **Did you write any new necessary tests?**: NA
- **Did you add or update any necessary documentation?**: Yes
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes <!--- Only for new features, API changes, critical bug fixes or bw
breaking changes. -->
## Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## New Features
* Added distillation training support for LTX-2 models with quantization
integration.
* Introduced comprehensive documentation and example configurations for
distillation workflows.
* Includes multi-GPU and multi-node training setup with distributed
training support and customizable configuration templates.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Hung-Yueh <hungyueh.chiang@gmail.com>
Signed-off-by: hychiang <kenny5312012@gmail.com>
Signed-off-by: hychiang <kenny5312012@gmail.com>
modelopt-bot
left a comment
There was a problem hiding this comment.
Review Summary
Overall this is a well-structured PR adding Qwen3-VL support. The mapping logic correctly handles the different weight structure (model.language_model. prefix) compared to Qwen3. The tests are comprehensive.
However, I have a few questions and suggestions that warrant discussion before approval.
modelopt-bot
left a comment
There was a problem hiding this comment.
Review Summary
Overall this is a well-structured PR adding Qwen3-VL support. The mapping logic correctly handles the different weight structure (model.language_model. prefix) compared to Qwen3. Tests are comprehensive.
Requested Changes:
- Merge conflicts - 22 files have conflicts that need resolution (as flagged by CodeRabbit)
- Missing newline - mcore_qwen3vl.py is missing a trailing newline
Questions/Clarifications:
3. Copyright year mismatch (2023-2025 vs 2024 in test file)
4. Router mapping includes MoE but no shared_expert mappings - is this intentional?
5. Consider adding docstrings to the mapping dictionaries
Please address the merge conflicts first, then the minor formatting issue.
What does this PR do?
new feature:
Overview: Add Qwen3-VL (Vision-Language) model support to the Megatron Core export/import
plugin, enabling HuggingFace-to-mcore weight conversion for PTQ/QAT/QAD workflows
Details
Qwen3-VL has a different weight structure from Qwen3 text-only models:
This PR adds:
Qwen3VLForConditionalGeneration and Megatron Core, handling the language_model prefix for
all decoder layers, QKV merging/slicing, gated MLP merging/slicing, Q/K layer norms.
all_mcore_hf_export_mapping and all_mcore_hf_import_mapping.
Usage
Testing
covering:
prefix, lm_head. at root, QKVMerging, GatedMLPMerging, REPLICATE
for layernorms, TP sharding configs
parallel_config
prefixes
language_model. prefix, lm_head unchanged
Before your PR is "Ready for review"
tests/unit/torch/export/test_mcore_qwen3vl.pydocs/source/deployment/3_unified_hf.rstCHANGELOG.rstAdditional Information
Companion Megatron-LM PR adds Qwen3VLModel, Qwen3VLDataset, and pretrain_qwenvl.py. Please see this PR NVIDIA/Megatron-LM#3444
Summary by CodeRabbit
New Features
Tests