Conversation
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
…ntizer, NVFP4MSECalibrator (#849) **Type of change:** ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** ? <!-- You can potentially add a usage example below. --> ```python ``` <!-- Mention how have you tested your change if applicable. --> <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **New Features** * Added NVFP4StaticQuantizer for improved 4-bit quantization with enhanced precision control * Introduced NVFP4MSECalibrator with flexible candidate generation for calibration optimization * **Improvements** * Optimized GPU kernels for Hopper+ graphics cards with better performance * Extended Triton support to broader GPU compatibility * Enhanced backward compatibility for restoring previously quantized models * **Tests** * Added comprehensive test coverage for new quantizers and calibration methods <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: realAsma <akuriparambi@nvidia.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
…FP4QTensor Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR replaces GPTQ-Lite with a new GPTQ implementation featuring Hessian-informed blockwise weight updates, adds sequential calibration checkpointing and resume capabilities, introduces activation MSE measurement and perplexity evaluation to the example quantization script, and extends dataset utilities and vLLM serving configurations to support these features. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Script
participant SeqCal as Sequential<br/>Calibrator
participant ActivCol as Activation<br/>Collector
participant CheckMgr as Checkpoint<br/>Manager
participant Model as Quantized<br/>Model
User->>SeqCal: Start sequential calibration<br/>(with checkpoint_dir)
SeqCal->>CheckMgr: detect_sequential_resume_layer()
alt Checkpoint exists
CheckMgr-->>SeqCal: resume_layer_idx > 0
SeqCal->>ActivCol: prepare_for_resume(resume_layer_idx)
ActivCol->>Model: Warmup forward pass<br/>(set layer modes, capture output_meta)
else No checkpoint
CheckMgr-->>SeqCal: resume_layer_idx = 0
end
loop For each layer [resume_idx..total_layers)
SeqCal->>ActivCol: Set layer capture mode
SeqCal->>Model: Forward pass (collect activations)
SeqCal->>SeqCal: Run calibration (gptq/rtn/etc)
SeqCal->>Model: Update layer weights
SeqCal->>CheckMgr: should_save_seq_calib_checkpoint()?
alt Save interval reached
CheckMgr->>CheckMgr: save_sequential_checkpoint()
CheckMgr->>Model: Attach progress metadata
CheckMgr->>CheckMgr: Invoke registered saver
end
end
SeqCal-->>User: Calibration complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modelopt/torch/utils/dataset_utils.py (1)
221-253:⚠️ Potential issue | 🟠 Major
skip_samplesis currently ignored for.jsonldatasets (and negative values are not validated).The early return on Line 252-253 bypasses the new
skip_samplesbehavior, so resume/offset behavior is inconsistent for JSONL inputs.Proposed fix
def get_dataset_samples( dataset_name: str, num_samples: int, @@ skip_samples: int = 0, ) -> list[str]: @@ + if skip_samples < 0: + raise ValueError("skip_samples must be >= 0") + # Local JSONL file path support (each line is a JSON object with a `text` field). if dataset_name.endswith(".jsonl"): - return get_jsonl_text_samples(dataset_name, num_samples, key="text") + return get_jsonl_text_samples( + dataset_name, num_samples + skip_samples, key="text" + )[skip_samples:]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/utils/dataset_utils.py` around lines 221 - 253, The .jsonl early-return ignores skip_samples and lacks validation; update the branch where dataset_name.endswith(".jsonl") to call get_jsonl_text_samples with skip_samples applied (i.e., pass the skip/offset so it skips the requested number of records) and validate skip_samples is non-negative before use (raise or clamp on negative). Modify the logic in the function that contains the dataset_name.endswith(".jsonl") check (and any helper call to get_jsonl_text_samples) so JSONL inputs honor the same skip_samples semantics as other dataset types.tests/gpu/torch/quantization/test_gptq.py (1)
237-259:⚠️ Potential issue | 🟠 MajorThis turns the GPU e2e test into a very heavy job.
use_sequential=Truemakes GPTQ replay full-model forwards per decoder layer, and this test still calibrates TinyLlama with 512 samples while being parametrized over three configs. That is a large jump in runtime and flakiness for atests/gpu/check. Please shrink this to a smoke-sized calibration set/model or move the full sequential path behind a slower opt-in test.As per coding guidelines, "GPU-based unit tests for core ModelOpt library should be placed in tests/gpu/ and run in a few seconds".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/quantization/test_gptq.py` around lines 237 - 259, The test config currently sets quant_cfg["algorithm"] = {"method": "gptq", "use_sequential": True} and builds a calibration dataloader with get_dataset_dataloader(num_samples=512), which makes the GPU e2e very heavy; change the test to a smoke-sized run by setting use_sequential to False or reducing the calibration dataloader sample size (e.g., num_samples -> 8 or another small value) and/or using a tiny model fixture so the create_forward_loop(dataloader=calib_dataloader) + mtq.quantize(model, quant_cfg, forward_loop=calibrate_loop) executes quickly; alternatively move the full sequential path behind a marked slow/opt-in test. Ensure you update the references to quant_cfg["algorithm"], get_dataset_dataloader(..., num_samples=...), and the mtq.quantize(...) invocation accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 726-736: The vLLM fakequant export branch omits the MTP
extra-state, causing incomplete checkpoints for models using MTP; update the
branch where args.vllm_fakequant_export triggers
export_hf_vllm_fq_checkpoint(full_model, export_dir=export_path) to also pass
the mtp_state_dict returned by load_mtp_weights (same as the non-fakequant
branch passes extra_state_dict=mtp_state_dict to export_hf_checkpoint), i.e.,
modify the call to include the extra_state_dict/mtp_state_dict so
export_hf_vllm_fq_checkpoint receives and writes the MTP weights alongside
full_model and export_path.
- Around line 1088-1102: The plugin-config export branch that uses
full_model.save_pretrained() / export_hf_vllm_fq_checkpoint (when args.qformat
not in QUANT_CFG_CHOICES and hasattr(mtq, args.qformat)) skips the post-export
cleanup done by export_quantized(): restore the tokenizer's original
padding/token settings and copy any custom model files/configs required for
trust_remote_code models. Fix by invoking the same post-export cleanup steps
used by export_quantized() after saving (or by calling a shared helper): restore
the tokenizer's original padding/pad_token/padding_side state and then copy over
any custom/model config files into export_path so the artifact matches the
standard export path; ensure this runs for both the export_hf_vllm_fq_checkpoint
and full_model.save_pretrained branches before tokenizer.save_pretrained.
- Around line 1075-1086: The MSE logging is run on the unfused fake-quant model
because folding is only applied later for perplexity; when args.fold_weights is
set and activation MSE measurement is requested, call
mtq.fold_weight(language_model) before invoking
mse_logger.finish(language_model, mse_data) (i.e., apply folding to the same
language_model used for mse collection), then proceed to delete
mse_logger/mse_data and empty the CUDA cache; keep the existing
folding+perplexity block but ensure folding happens earlier when
args.measure_activation_mse (or equivalent flag) is true.
In `@modelopt/torch/quantization/config.py`:
- Around line 1324-1335: The ModeloptField for block_size currently allows zero
and negative values; add validation by setting gt=0 on the block_size
ModeloptField so config schema rejects non-positive sizes, and where group_size
is available (e.g., in the GPTQ config validation path or an initializer that
uses block_size and group_size) add a check that block_size % group_size == 0
and raise a clear validation error; update references around the block_size
ModeloptField and any GPTQ config validator that uses group_size to enforce this
multiple constraint.
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1983-1994: The code currently calls GPTQHelper.setup() for each
handle then runs forward_loop(model) but only calls GPTQHelper.cleanup()
afterwards, so if forward_loop raises the patched GPTQ forwards remain; wrap the
forward_loop call in a try/finally: after creating gptq_handles and calling
handle.setup() on each, enter the disabled_weight_quantizers(model) context and
call forward_loop(model) inside a try block, and in the finally iterate over
gptq_handles.values() and call handle.cleanup() to ensure GPTQHelper.cleanup()
always runs even on exceptions (referencing gptq_handles, GPTQHelper, setup,
cleanup, forward_loop, and disabled_weight_quantizers).
- Around line 1535-1545: After flattening to input_flat and computing
batch_size, guard the Hessian update against empty batches: if batch_size == 0,
skip the incremental averaging and outer-product steps entirely (do not modify
n_samples or hessian). Specifically, before the lines that scale down hessian
(hessian *= n_samples / (n_samples + batch_size)) and compute scaled_input and
hessian.add_((scaled_input @ scaled_input.t())...), add a check on batch_size
and return/continue from the surrounding function/block so no division or sqrt
by zero occurs and hessian remains unchanged.
In `@modelopt/torch/quantization/utils/activation_collector.py`:
- Around line 369-410: sequential_calibrate currently never checks or uses
resume helpers so resume state (_seq_calib_progress) and layer modes aren't
rebuilt; update sequential_calibrate to call detect_sequential_resume_layer() at
start, and if it returns a resume_layer_idx call
input_getter.prepare_for_resume(resume_layer_idx, forward_loop) before entering
the main layer loop, then set the loop's start index to resume_layer_idx
(instead of 0) so subsequent iterations continue from the resumed layer; ensure
you reference detect_sequential_resume_layer, input_getter.prepare_for_resume,
forward_loop, resume_layer_idx and the existing _seq_calib_progress handling
when adding this logic.
In `@modelopt/torch/quantization/utils/core_utils.py`:
- Around line 713-726: The helper disabled_weight_quantizers only toggles
module.weight_quantizer and misses non-standard weight attributes; update
disabled_weight_quantizers to iterate weight names from
weight_attr_names(module) for each module (in addition to checking
is_quantized_linear(module)), retrieve each attr (e.g., getattr(module,
attr_name, None)), if it has an is_enabled flag call its disable() and record
(module, attr_name) so the finally block can re-enable by calling enable() on
the same attribute's quantizer; ensure you still check existence and is_enabled
before disabling and re-check before enabling to avoid AttributeError.
In `@tests/gpu/torch/quantization/test_gptq_vq.py`:
- Around line 31-37: The _configs_available() helper currently only checks for
RTN_CFG_NAME and causes a hard failure if GPTQ_CFG_NAME is missing; update
_configs_available() to import modelopt.torch.quantization and return True only
when both getattr(mtq, RTN_CFG_NAME, None) and getattr(mtq, GPTQ_CFG_NAME, None)
are not None so the test is skipped unless both presets (RTN_CFG_NAME and
GPTQ_CFG_NAME) are available; reference the existing function name
_configs_available and the symbols RTN_CFG_NAME and GPTQ_CFG_NAME when making
the change.
---
Outside diff comments:
In `@modelopt/torch/utils/dataset_utils.py`:
- Around line 221-253: The .jsonl early-return ignores skip_samples and lacks
validation; update the branch where dataset_name.endswith(".jsonl") to call
get_jsonl_text_samples with skip_samples applied (i.e., pass the skip/offset so
it skips the requested number of records) and validate skip_samples is
non-negative before use (raise or clamp on negative). Modify the logic in the
function that contains the dataset_name.endswith(".jsonl") check (and any helper
call to get_jsonl_text_samples) so JSONL inputs honor the same skip_samples
semantics as other dataset types.
In `@tests/gpu/torch/quantization/test_gptq.py`:
- Around line 237-259: The test config currently sets quant_cfg["algorithm"] =
{"method": "gptq", "use_sequential": True} and builds a calibration dataloader
with get_dataset_dataloader(num_samples=512), which makes the GPU e2e very
heavy; change the test to a smoke-sized run by setting use_sequential to False
or reducing the calibration dataloader sample size (e.g., num_samples -> 8 or
another small value) and/or using a tiny model fixture so the
create_forward_loop(dataloader=calib_dataloader) + mtq.quantize(model,
quant_cfg, forward_loop=calibrate_loop) executes quickly; alternatively move the
full sequential path behind a marked slow/opt-in test. Ensure you update the
references to quant_cfg["algorithm"], get_dataset_dataloader(...,
num_samples=...), and the mtq.quantize(...) invocation accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1ca32742-d698-44d0-a52b-271964afcc02
📒 Files selected for processing (13)
examples/llm_ptq/hf_ptq.pyexamples/vllm_serve/fakequant_worker.pyexamples/vllm_serve/vllm_serve_fakequant.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/mode.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/quantization/utils/activation_collector.pymodelopt/torch/quantization/utils/checkpoint.pymodelopt/torch/quantization/utils/core_utils.pymodelopt/torch/utils/dataset_utils.pytests/gpu/torch/quantization/test_gptq.pytests/gpu/torch/quantization/test_gptq_vq.py
💤 Files with no reviewable changes (1)
- modelopt/torch/quantization/model_quant.py
| if args.vllm_fakequant_export: | ||
| export_hf_vllm_fq_checkpoint( | ||
| full_model, | ||
| export_dir=export_path, | ||
| ) | ||
| else: | ||
| export_hf_checkpoint( | ||
| full_model, | ||
| export_dir=export_path, | ||
| extra_state_dict=mtp_state_dict, | ||
| ) |
There was a problem hiding this comment.
The vLLM fakequant export path drops the extra MTP weights.
load_mtp_weights() returns mtp_state_dict specifically for weights that are not in the normal model state dict, and the regular HF export passes that through extra_state_dict. This new branch ignores it, so models with MTP layers can export incomplete checkpoints.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/hf_ptq.py` around lines 726 - 736, The vLLM fakequant export
branch omits the MTP extra-state, causing incomplete checkpoints for models
using MTP; update the branch where args.vllm_fakequant_export triggers
export_hf_vllm_fq_checkpoint(full_model, export_dir=export_path) to also pass
the mtp_state_dict returned by load_mtp_weights (same as the non-fakequant
branch passes extra_state_dict=mtp_state_dict to export_hf_checkpoint), i.e.,
modify the call to include the extra_state_dict/mtp_state_dict so
export_hf_vllm_fq_checkpoint receives and writes the MTP weights alongside
full_model and export_path.
examples/llm_ptq/hf_ptq.py
Outdated
| if mse_logger is not None: | ||
| mse_logger.finish(language_model, mse_data) | ||
| del mse_logger, mse_data | ||
| torch.cuda.empty_cache() | ||
|
|
||
| if args.eval_perplexity and tokenizer is not None: | ||
| if args.fold_weights: | ||
| print("Folding weights before perplexity evaluation...") | ||
| mtq.fold_weight(language_model) | ||
| eval_data = get_wikitext2(tokenizer, args.eval_perplexity_seq_len) | ||
| ppl = compute_perplexity(full_model, eval_data) | ||
| print(f"Wikitext-2 perplexity: {ppl:.2f}") |
There was a problem hiding this comment.
--fold_weights is wired to perplexity, not activation-MSE collection.
The flag/help text says folding should happen before collecting the quantized activations, but mse_logger.finish() still runs on the unfused fake-quant model. With --measure_activation_mse --fold_weights, the flag currently has no effect on the MSE pass.
Proposed fix
if mse_logger is not None:
+ if args.fold_weights:
+ mtq.fold_weight(language_model)
mse_logger.finish(language_model, mse_data)
del mse_logger, mse_data
torch.cuda.empty_cache()
if args.eval_perplexity and tokenizer is not None:
- if args.fold_weights:
+ if args.fold_weights:
print("Folding weights before perplexity evaluation...")
- mtq.fold_weight(language_model)
eval_data = get_wikitext2(tokenizer, args.eval_perplexity_seq_len)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/hf_ptq.py` around lines 1075 - 1086, The MSE logging is run
on the unfused fake-quant model because folding is only applied later for
perplexity; when args.fold_weights is set and activation MSE measurement is
requested, call mtq.fold_weight(language_model) before invoking
mse_logger.finish(language_model, mse_data) (i.e., apply folding to the same
language_model used for mse collection), then proceed to delete
mse_logger/mse_data and empty the CUDA cache; keep the existing
folding+perplexity block but ensure folding happens earlier when
args.measure_activation_mse (or equivalent flag) is true.
| # Plugin-registered configs (e.g. PSX LUTS from modelopt-internal) are not exportable | ||
| # via the standard TRT-LLM / HF export paths. Fall back to save_pretrained(). | ||
| if args.qformat not in QUANT_CFG_CHOICES and hasattr(mtq, args.qformat): | ||
| export_path = args.export_path | ||
| if args.vllm_fakequant_export: | ||
| print(f"Exporting vLLM fakequant checkpoint (bf16 weights + amax) to: {export_path}") | ||
| export_hf_vllm_fq_checkpoint(full_model, export_dir=export_path) | ||
| else: | ||
| print( | ||
| f"qformat '{args.qformat}' is a plugin-registered config and is not exportable " | ||
| f"via the standard export pipeline. Saving with save_pretrained() instead." | ||
| ) | ||
| full_model.save_pretrained(export_path) | ||
| if tokenizer is not None: | ||
| tokenizer.save_pretrained(export_path) |
There was a problem hiding this comment.
This plugin-config export bypass skips the normal artifact cleanup.
By bypassing export_quantized(), this branch never restores the tokenizer's original padding settings and never copies custom model files/configs. That makes the saved artifact materially different from the standard export path, especially for trust_remote_code models.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/hf_ptq.py` around lines 1088 - 1102, The plugin-config
export branch that uses full_model.save_pretrained() /
export_hf_vllm_fq_checkpoint (when args.qformat not in QUANT_CFG_CHOICES and
hasattr(mtq, args.qformat)) skips the post-export cleanup done by
export_quantized(): restore the tokenizer's original padding/token settings and
copy any custom model files/configs required for trust_remote_code models. Fix
by invoking the same post-export cleanup steps used by export_quantized() after
saving (or by calling a shared helper): restore the tokenizer's original
padding/pad_token/padding_side state and then copy over any custom/model config
files into export_path so the artifact matches the standard export path; ensure
this runs for both the export_hf_vllm_fq_checkpoint and
full_model.save_pretrained branches before tokenizer.save_pretrained.
| percdamp: float = ModeloptField( | ||
| default=0.01, | ||
| gt=0.0, | ||
| le=1.0, | ||
| title="Percentage damping factor.", | ||
| description="The percentage of average Hessian diagonal used for damping.", | ||
| ) | ||
| block_size: int | None = ModeloptField( | ||
| block_size: int = ModeloptField( | ||
| default=128, | ||
| title="Block size for GPTQ weight update.", | ||
| description="""The block size for GPTQ weight update, which must be a multiple of the | ||
| group_size used in the quantization.""", |
There was a problem hiding this comment.
Validate block_size at the config boundary.
block_size now accepts 0 and negative values, so an invalid GPTQ config can get through schema validation and only fail later inside the blockwise update path. Please add at least gt=0 here, and ideally enforce the documented group-size multiple constraint wherever that information is available.
Proposed fix
block_size: int = ModeloptField(
default=128,
+ gt=0,
title="Block size for GPTQ weight update.",
description="""The block size for GPTQ weight update, which must be a multiple of the
group_size used in the quantization.""",
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/config.py` around lines 1324 - 1335, The
ModeloptField for block_size currently allows zero and negative values; add
validation by setting gt=0 on the block_size ModeloptField so config schema
rejects non-positive sizes, and where group_size is available (e.g., in the GPTQ
config validation path or an initializer that uses block_size and group_size)
add a check that block_size % group_size == 0 and raise a clear validation
error; update references around the block_size ModeloptField and any GPTQ config
validator that uses group_size to enforce this multiple constraint.
| # Flatten to 2D (total_tokens, features) first, so batch_size counts tokens | ||
| input_flat = input.reshape(-1, input.shape[-1]).t().float() | ||
| batch_size = input_flat.shape[1] | ||
|
|
||
| # Incremental averaging: scale down old hessian | ||
| hessian *= n_samples / (n_samples + batch_size) | ||
| n_samples += batch_size | ||
|
|
||
| # Compute outer product: H += (2/n_samples) * X @ X^T | ||
| # where X is the flattened input reshaped to (features, batch*seq) | ||
| input_flat = input.reshape(-1, input.shape[-1]).t().float() | ||
| scaled_input = math.sqrt(2 / n_samples) * input_flat | ||
| hessian.add_((scaled_input @ scaled_input.t()).to(hessian.device)) |
There was a problem hiding this comment.
Ignore empty activation batches before updating the Hessian.
After the reshape, batch_size can be 0 for empty expert/token batches. The divisions on Lines 1540 and 1544 then introduce NaNs/inf into hessian, and every later GPTQ update for that module is corrupted.
Proposed fix
# Flatten to 2D (total_tokens, features) first, so batch_size counts tokens
input_flat = input.reshape(-1, input.shape[-1]).t().float()
batch_size = input_flat.shape[1]
+ if batch_size == 0:
+ return hessian, n_samples
# Incremental averaging: scale down old hessian
hessian *= n_samples / (n_samples + batch_size)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/model_calib.py` around lines 1535 - 1545, After
flattening to input_flat and computing batch_size, guard the Hessian update
against empty batches: if batch_size == 0, skip the incremental averaging and
outer-product steps entirely (do not modify n_samples or hessian). Specifically,
before the lines that scale down hessian (hessian *= n_samples / (n_samples +
batch_size)) and compute scaled_input and hessian.add_((scaled_input @
scaled_input.t())...), add a check on batch_size and return/continue from the
surrounding function/block so no division or sqrt by zero occurs and hessian
remains unchanged.
| gptq_handles = {name: GPTQHelper(m, name, offload_to_cpu=True) for name, m in quantized_layers} | ||
| for handle in gptq_handles.values(): | ||
| handle.setup() | ||
|
|
||
| calib_func(layer, _layer_forward_loop, **calib_kwargs) | ||
| print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...") | ||
|
|
||
| del layer_inputs | ||
| torch.cuda.empty_cache() | ||
| finally: | ||
| input_getter._unpatch_all_layers() | ||
| with disabled_weight_quantizers(model): | ||
| forward_loop(model) | ||
|
|
||
| for handle in gptq_handles.values(): | ||
| handle.cleanup() | ||
|
|
There was a problem hiding this comment.
Always unpatch GPTQ forwards in a finally block.
If forward_loop(model) raises here, every patched linear keeps the Hessian wrapper installed and the model is left in a partially mutated state for the rest of the process.
Proposed fix
gptq_handles = {name: GPTQHelper(m, name, offload_to_cpu=True) for name, m in quantized_layers}
- for handle in gptq_handles.values():
- handle.setup()
-
- print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...")
-
- with disabled_weight_quantizers(model):
- forward_loop(model)
-
- for handle in gptq_handles.values():
- handle.cleanup()
+ try:
+ for handle in gptq_handles.values():
+ handle.setup()
+
+ print_rank_0(f"Computing Hessians for {len(gptq_handles)} linear layers...")
+ with disabled_weight_quantizers(model):
+ forward_loop(model)
+ finally:
+ for handle in gptq_handles.values():
+ handle.cleanup()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/model_calib.py` around lines 1983 - 1994, The
code currently calls GPTQHelper.setup() for each handle then runs
forward_loop(model) but only calls GPTQHelper.cleanup() afterwards, so if
forward_loop raises the patched GPTQ forwards remain; wrap the forward_loop call
in a try/finally: after creating gptq_handles and calling handle.setup() on
each, enter the disabled_weight_quantizers(model) context and call
forward_loop(model) inside a try block, and in the finally iterate over
gptq_handles.values() and call handle.cleanup() to ensure GPTQHelper.cleanup()
always runs even on exceptions (referencing gptq_handles, GPTQHelper, setup,
cleanup, forward_loop, and disabled_weight_quantizers).
| @torch.no_grad() | ||
| def prepare_for_resume( | ||
| self, | ||
| resume_layer_idx: int, | ||
| forward_loop: ForwardLoop, | ||
| ): | ||
| """Set up layer states for resuming sequential calibration from a checkpoint. | ||
|
|
||
| Runs a single warm-up forward pass so that the next call to | ||
| :meth:`get_input_activations` for ``resume_layer_idx`` produces the | ||
| correct inputs. Layers ``0 .. K-2`` run in *original* mode (with | ||
| ``capture_output_meta`` enabled so skip-mode metadata is populated), | ||
| layer ``K-1`` in *capture* mode. After the pass, ``0 .. K-2`` switch | ||
| to *skip* and ``K-1`` retains its ``collected_inputs`` for the | ||
| subsequent *run* transition. | ||
| """ | ||
| if not self._patched: | ||
| raise RuntimeError( | ||
| "prepare_for_resume() requires _patch_all_layers() to be called first." | ||
| ) | ||
| if resume_layer_idx == 0: | ||
| return | ||
|
|
||
| k = resume_layer_idx | ||
| preceding = range(k - 1) | ||
|
|
||
| assert self._decoder_layers is not None | ||
| for i in preceding: | ||
| self._set_layer_mode(i, "original") | ||
| self._decoder_layers[i]._seq_calib.capture_output_meta = True | ||
|
|
||
| print_rank_0( | ||
| f"Running warm-up forward pass for resume " | ||
| f"(layers 0..{k - 2} original, layer {k - 1} capture)" | ||
| ) | ||
| self._run_warmup_capture(k - 1, forward_loop) | ||
|
|
||
| for i in preceding: | ||
| self._decoder_layers[i]._seq_calib.capture_output_meta = False | ||
| self._set_layer_mode(i, "skip") | ||
|
|
||
| print_rank_0(f"Warm-up complete. Ready to resume from layer {k}.") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Resume helper call sites =="
rg -nP --type=py '\b(detect_sequential_resume_layer|prepare_for_resume)\s*\('
echo
echo "== sequential_calibrate implementation =="
sed -n '1545,1615p' modelopt/torch/quantization/model_calib.pyRepository: NVIDIA/Model-Optimizer
Length of output: 3276
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Full sequential_calibrate function (lines 1545-1700) =="
wc -l modelopt/torch/quantization/model_calib.py
sed -n '1545,1700p' modelopt/torch/quantization/model_calib.py | head -200
echo
echo "== Search for any code calling sequential_calibrate =="
rg -nP --type=py -B2 -A2 'sequential_calibrate\s*\(' | head -50
echo
echo "== Search for imports or references to checkpoint resume in model_calib =="
rg -nP --type=py '(detect_sequential_resume_layer|prepare_for_resume|checkpoint|resume)' modelopt/torch/quantization/model_calib.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 10850
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Check if any test explicitly tests resume flow =="
rg -nP --type=py '(detect_sequential_resume_layer|prepare_for_resume)' tests/
echo
echo "== Check the full activation_collector.py context around prepare_for_resume =="
rg -nP --type=py -B5 -A25 'def prepare_for_resume' modelopt/torch/quantization/utils/activation_collector.py
echo
echo "== Check if there's any documentation about resume flow =="
find . -name "*.md" -o -name "*.rst" | xargs grep -l "sequential.*resume\|resume.*sequential" 2>/dev/null | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 118
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Check activation_collector.py context around prepare_for_resume =="
sed -n '360,420p' modelopt/torch/quantization/utils/activation_collector.py
echo
echo "== Check if prepare_for_resume or detect_sequential_resume_layer are imported anywhere =="
rg -nP --type=py 'from.*activation_collector.*import|from.*checkpoint.*import' | grep -E '(prepare_for_resume|detect_sequential_resume_layer)' | head -20
echo
echo "== Check all imports of checkpoint module =="
rg -nP --type=py 'import.*checkpoint|from.*checkpoint' modelopt/torch/quantization/
echo
echo "== List all public/exported functions in checkpoint.py =="
sed -n '1,100p' modelopt/torch/quantization/utils/checkpoint.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2177
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Check all calls to prepare_for_resume method (not just definition) =="
rg -nP --type=py 'prepare_for_resume\s*\(' --no-heading
echo
echo "== Check all calls to detect_sequential_resume_layer =="
rg -nP --type=py 'detect_sequential_resume_layer\s*\(' --no-heading
echo
echo "== Check if LayerActivationCollector is used outside of model_calib.py =="
rg -nP --type=py 'LayerActivationCollector' modelopt/torch/quantization/ | grep -v 'test_' | head -20
echo
echo "== Search for checkpoint module usage in sequential_calibrate context =="
rg -nP --type=py -B10 -A10 'checkpoint' modelopt/torch/quantization/model_calib.py | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 5458
prepare_for_resume() is not integrated into sequential calibration.
The new resume helpers (prepare_for_resume() and detect_sequential_resume_layer()) are defined but never called anywhere in the codebase. The main sequential_calibrate() function iterates through all layers from start to end without detecting, loading, or reconstructing resume state. No code path—tests included—exercises resume functionality. This means resuming from a checkpoint will not rebuild the skip/run/capture layer states needed to continue from layer K.
To complete resume support, integrate these helpers into sequential_calibrate():
- Call
detect_sequential_resume_layer()to check for a saved_seq_calib_progresscheckpoint. - If resuming, call
input_getter.prepare_for_resume(resume_layer_idx, forward_loop)before the main layer loop. - Adjust the loop to start from
resume_layer_idxinstead of 0.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/utils/activation_collector.py` around lines 369 -
410, sequential_calibrate currently never checks or uses resume helpers so
resume state (_seq_calib_progress) and layer modes aren't rebuilt; update
sequential_calibrate to call detect_sequential_resume_layer() at start, and if
it returns a resume_layer_idx call
input_getter.prepare_for_resume(resume_layer_idx, forward_loop) before entering
the main layer loop, then set the loop's start index to resume_layer_idx
(instead of 0) so subsequent iterations continue from the resumed layer; ensure
you reference detect_sequential_resume_layer, input_getter.prepare_for_resume,
forward_loop, resume_layer_idx and the existing _seq_calib_progress handling
when adding this logic.
| @contextmanager | ||
| def disabled_weight_quantizers(model: nn.Module): | ||
| """Disable weight quantizers during hessian collection.""" | ||
| disabled_modules = [] | ||
| for module in model.modules(): | ||
| if is_quantized_linear(module) and module.weight_quantizer.is_enabled: | ||
| module.weight_quantizer.disable() | ||
| disabled_modules.append(module) | ||
| try: | ||
| yield | ||
| finally: | ||
| for module in disabled_modules: | ||
| module.weight_quantizer.enable() | ||
|
|
There was a problem hiding this comment.
Disable every weight quantizer, not just module.weight_quantizer.
This helper ignores modules that use non-standard weight attributes like gate_up_proj / down_proj, even though weight_attr_names() in the same file explicitly supports them. In GPTQ that means Hessian collection can still run with some weight quantizers enabled, which skews the update for those supported module types.
Proposed fix
`@contextmanager`
def disabled_weight_quantizers(model: nn.Module):
"""Disable weight quantizers during hessian collection."""
- disabled_modules = []
+ disabled_quantizers = []
for module in model.modules():
- if is_quantized_linear(module) and module.weight_quantizer.is_enabled:
- module.weight_quantizer.disable()
- disabled_modules.append(module)
+ for weight_name in weight_attr_names(module):
+ quantizer_name = quantizer_attr_names(weight_name).weight_quantizer
+ quantizer = getattr(module, quantizer_name, None)
+ if quantizer is not None and quantizer.is_enabled:
+ quantizer.disable()
+ disabled_quantizers.append(quantizer)
try:
yield
finally:
- for module in disabled_modules:
- module.weight_quantizer.enable()
+ for quantizer in disabled_quantizers:
+ quantizer.enable()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/utils/core_utils.py` around lines 713 - 726, The
helper disabled_weight_quantizers only toggles module.weight_quantizer and
misses non-standard weight attributes; update disabled_weight_quantizers to
iterate weight names from weight_attr_names(module) for each module (in addition
to checking is_quantized_linear(module)), retrieve each attr (e.g.,
getattr(module, attr_name, None)), if it has an is_enabled flag call its
disable() and record (module, attr_name) so the finally block can re-enable by
calling enable() on the same attribute's quantizer; ensure you still check
existence and is_enabled before disabling and re-check before enabling to avoid
AttributeError.
| def _configs_available(): | ||
| try: | ||
| import modelopt.torch.quantization as mtq | ||
|
|
||
| return getattr(mtq, RTN_CFG_NAME, None) is not None | ||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
Skip the test only when both presets are unavailable.
_configs_available() only checks RTN_CFG_NAME, but Line 57 also requires GPTQ_CFG_NAME. If RTN is registered and GPTQ is not, this test hard-fails instead of being skipped.
Proposed fix
def _configs_available():
try:
import modelopt.torch.quantization as mtq
- return getattr(mtq, RTN_CFG_NAME, None) is not None
+ return (
+ getattr(mtq, RTN_CFG_NAME, None) is not None
+ and getattr(mtq, GPTQ_CFG_NAME, None) is not None
+ )
except Exception:
return False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/gpu/torch/quantization/test_gptq_vq.py` around lines 31 - 37, The
_configs_available() helper currently only checks for RTN_CFG_NAME and causes a
hard failure if GPTQ_CFG_NAME is missing; update _configs_available() to import
modelopt.torch.quantization and return True only when both getattr(mtq,
RTN_CFG_NAME, None) and getattr(mtq, GPTQ_CFG_NAME, None) are not None so the
test is skipped unless both presets (RTN_CFG_NAME and GPTQ_CFG_NAME) are
available; reference the existing function name _configs_available and the
symbols RTN_CFG_NAME and GPTQ_CFG_NAME when making the change.
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Release Notes
New Features
Refactor