Refine layerwise non-mutating calibration by realAsma · Pull Request #1592 · NVIDIA/Model-Optimizer

realAsma · 2026-06-01T21:45:09Z

Stacked on #1571.

Summary:

Add calib_mutates_weights gating for non-mutating layerwise calibration.
Skip layer weight checkpoint/writeback for quantizer-state-only calibration.
Keep FSDP2/Accelerate writeback conditional and improve layerwise progress reporting.
Make _zeros_from_meta allocate skip placeholders on torch.device("meta") so skip-mode preserves structure without real-device tensor allocation.

Testing:

Pre-commit hooks passed during commit.

Signed-off-by: realAsma <akuriparambi@nvidia.com>

copy-pr-bot · 2026-06-01T21:45:12Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-01T21:45:15Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2bc49397-9cd8-4c10-adee-e640efc2f68c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch asma/layerwise_skip_in_meta

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

realAsma · 2026-06-01T21:48:57Z

+    if not writeback:
+        with _fsdp2_unshard_context(fsdp_module):
+            yield
+        return
+


@sugunav14 here is an easy perf improvement for layerwise FSDP2

@realAsma Claude claims a correctness issue on this branch, it makes sense to me, please check from your side:

Bug: FSDP2 writeback=False path never all-gathers weights (non-mutating layerwise calibration computes on sharded shards)

Where: _fsdp2_unshard_context (core_utils.py:480) reached via persistent_materialization(layer, writeback=False).

Root cause — a collision between two layers of the same context stack.

persistent_materialization enters its context managers in this order:

with ( _disable_fsdp_unshard_reshard(layer), # ① enters FIRST enable_weight_access_and_writeback(layer, layer, writeback=False), # ② enters SECOND temporarily_remove_accelerate_hook(layer), ):

① monkeypatches the class method FSDPParamGroup.unshard to a no-op (to stop FSDP from re-sharding weights between
calibration batches). This patch is global and is now active.

② with writeback=False routes into _fsdp2_unshard_context, whose gather step is:

if was_sharded: fsdp_module.unshard() # core_utils.py:480

But FSDPModule.unshard() delegates the actual all-gather to fsdp_param_group.unshard() — which is exactly the method ①
just patched to a no-op. So no all-gather happens; the params stay sharded DTensors, and the calibration/capture forward
runs on partial shards → wrong _amax (or a shape/dtype error).

The layer's own FSDP pre-forward hook can't save it either: that hook also calls FSDPParamGroup.unshard, which is still the
no-op.

Why writeback=True (GPTQ/AWQ/SmoothQuant) is unaffected: that branch of fsdp2_weight_access_and_writeback_context gathers
via param.redistribute(... Replicate ...) — a DTensor collective that doesn't touch the patched unshard(). So only the
writeback=False (max/mse/local_hessian) path, i.e. the non-mutating optimization this PR adds, is broken on FSDP2.

_fsdp2_unshard_context is internally self-contradictory under this caller: line 480 calls unshard() (needs it to work) while
line 482 calls _disable_fsdp_unshard_reshard (disables it). It assumes it runs before anyone disables unshard, but its only
production caller disables it first.

Impact / trigger: FSDP2 layerwise calibration with calib_mutates_weights=False. The default (True) takes the redistribute
path and works, so the bug only fires when a user opts into the non-mutating feature.

Test that should catch this (likely unrun — it's a multi-GPU GPU test on a draft PR):
tests/gpu/torch/quantization/test_fsdp2.py:264

with persistent_materialization(layer, writeback=False): assert not isinstance(layer[0].weight, DTensor) # fails: still a DTensor layer(inputs)

Suggested fix (don't double-disable):

Have _fsdp2_unshard_context call the unpatched unshard (_disable_fsdp_unshard_reshard already captures orig_unshard;
expose it), so the gather works while per-forward reshard stays suppressed; or

Skip persistent_materialization's outer _disable_fsdp_unshard_reshard on the FSDP writeback=False path, since
_fsdp2_unshard_context already suppresses reshard internally after a real unshard.

This is possible. can we move _disable_fsdp_unshard_reshard(layer) to _disable_fsdp_reshard(layer) ??

🤖 Bot comment.

Fixed in 9bfec17479.

persistent_materialization() now enters enable_weight_access_and_writeback(..., writeback=writeback) before _disable_fsdp_unshard_reshard(...), so the writeback=False FSDP2 path performs the real one-time unshard before per-forward unshard/reshard is suppressed.

I also added regression coverage for the actual public flow: mtq.quantize(...) with layerwise calib_mutates_weights=False on FSDP2 now asserts that the layer is materialized inside persistent_materialization (params are no longer DTensors), and still checks the quantized FSDP2 output against the non-FSDP reference.

Validated with:

pytest_pwd tests/unit/torch/quantization/test_layerwise_calibrate.py -q pytest_pwd CUDA_VISIBLE_DEVICES=2,3 tests/gpu/torch/quantization/test_fsdp2.py -k "layerwise_calibrate_fsdp2 or persistent_materialization" -q

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Fridah-nv · 2026-06-01T23:56:05Z

-        if self.layerwise.save_quantizers_only and not self._supports_save_quantizers_only:
+    def _validate_non_mutating_layerwise_supported(self):
+        """Enforce the ``calib_mutates_weights=False`` whitelist."""
+        if not self.layerwise.calib_mutates_weights and not self._supports_save_quantizers_only:


nit: we can rename _supports_save_quantizers_only to names like _calib_is_amax_only to align with new vocabulary, otherwise a reader has to learn that _supports_save_quantizers_only=True means "amax-only algorithm, so calib_mutates_weights=False is allowed"

Fridah-nv · 2026-06-02T00:01:50Z

-            if ckpt:
-                ckpt.save(layer_idx, model, transformer_layers, next_inputs)
+                if ckpt:
+                    ckpt.save(layer_idx, model, transformer_layers, next_inputs)


could you explain why we move ckpt.save inside the persistent_materialization context?

We should materialize a layer once and do all the operations (get inputs, calibrate, save); outside of this context; the layer might be sharded or moved to disk in which case we need to materialize again. The point is to avoid redundant materialization.

realAsma · 2026-06-03T22:20:24Z

+    with (
+        _disable_fsdp_unshard_reshard(layer),
+        enable_weight_access_and_writeback(layer, layer, writeback=writeback),
+        temporarily_remove_accelerate_hook(layer),


I added this during an experiment which saved the zeros_from_meta in meta device instead of actual device.

However zeros_from_meta in meta device was breaking a test and I did not check that in - but I think we should modify that test instead.

Signed-off-by: realAsma <akuriparambi@nvidia.com>

The skip-layer-weight-checkpoint optimization (save_quantizers_only) is moved out of this foundation PR; it lands complete in the stacked PR #1592 (as calib_mutates_weights). _CheckpointState now always saves the full layer weights blob. save_every and the rest of the nested LayerwiseConfig are kept. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

Fridah-nv · 2026-06-05T22:44:44Z

Move the changes in current PR to #1640 as the base branch changed

The skip-layer-weight-checkpoint optimization (save_quantizers_only) is moved out of this foundation PR; it lands complete in the stacked PR #1592 (as calib_mutates_weights). _CheckpointState now always saves the full layer weights blob. save_every and the rest of the nested LayerwiseConfig are kept. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

save_every); this PR adds the two optimizations on top: calib_mutates_weights (skip weight checkpoint + writeback for amax-only algorithms) and meta-device skip-layer placeholders. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>

realAsma added 2 commits June 1, 2026 21:41

Fix layerwise skip placeholder memory growth

772fd92

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Refine layerwise non-mutating calibration

289ea73

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma mentioned this pull request Jun 1, 2026

feat: Layerwise calibration: nested config + QDQ-from-prev-layer flag + checkpoint I/O knobs #1571

Open

realAsma commented Jun 1, 2026

View reviewed changes

Keep layerwise checkpoint save under materialization

350f028

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Fridah-nv reviewed Jun 1, 2026

View reviewed changes

Fridah-nv reviewed Jun 2, 2026

View reviewed changes

realAsma commented Jun 3, 2026

View reviewed changes

Fix non-mutating layerwise materialization

9bfec17

Signed-off-by: realAsma <akuriparambi@nvidia.com>

Fridah-nv closed this Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine layerwise non-mutating calibration#1592

Refine layerwise non-mutating calibration#1592
realAsma wants to merge 4 commits into
fridah/layerwise-configfrom
asma/layerwise_skip_in_meta

realAsma commented Jun 1, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review skipped

Uh oh!

realAsma Jun 1, 2026 •

edited

Loading

Uh oh!

Fridah-nv Jun 1, 2026

Uh oh!

realAsma Jun 3, 2026

Uh oh!

realAsma Jun 5, 2026

Uh oh!

Fridah-nv Jun 1, 2026

Uh oh!

Fridah-nv Jun 2, 2026

Uh oh!

realAsma Jun 3, 2026

Uh oh!

realAsma Jun 3, 2026

Uh oh!

Fridah-nv commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realAsma commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 1, 2026

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

realAsma Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fridah-nv Jun 1, 2026

Choose a reason for hiding this comment

Bug: FSDP2 writeback=False path never all-gathers weights (non-mutating layerwise calibration computes on sharded shards)

Uh oh!

realAsma Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Fridah-nv Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Fridah-nv Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

realAsma Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Fridah-nv commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

realAsma commented Jun 1, 2026 •

edited

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

realAsma Jun 1, 2026 •

edited

Loading

Bug: FSDP2 `writeback=False` path never all-gathers weights (non-mutating layerwise calibration computes on sharded shards)