Skip to content

feat: parallelize fakequant export across GPUs via ThreadPoolExecutor#1177

Open
sungsooha wants to merge 4 commits intoNVIDIA:mainfrom
sungsooha:sungsooh/parallel-fakequant-export-pr
Open

feat: parallelize fakequant export across GPUs via ThreadPoolExecutor#1177
sungsooha wants to merge 4 commits intoNVIDIA:mainfrom
sungsooha:sungsooh/parallel-fakequant-export-pr

Conversation

@sungsooha
Copy link
Copy Markdown
Contributor

@sungsooha sungsooha commented Apr 3, 2026

What does this PR do?

Type of change: New feature

Refactors export_hf_vllm_fq_checkpoint() Step 1 (weight fake-quantization) to process weights concurrently across GPUs using ThreadPoolExecutor.

Motivation: When exporting large models with compute-heavy quantization backends (e.g., barboqt exhaustive search), the export loop processes one weight tensor at a time on a single GPU while other GPUs sit idle. On clusters with idle GPU reapers (e.g., cw-dfw), this causes jobs to be killed after 30 minutes of idle GPU time. Parallel export keeps all GPUs active during the quantize phase.

Changes:

  • Collects all quantizer work items, groups by device
  • One thread per GPU, each processes its weight partition concurrently
  • Falls back to sequential on single GPU or CPU (no thread overhead)
  • Adds parallel: bool = True parameter (default on, backwards compatible)
  • GIL released during CUDA kernels → true parallelism
  • state_dict[key] = value writes are atomic (single-key dict assignment)

Measured speedup on Qwen3-8B with barboqt_exhaustive_2b (8× H100 DGX):

  • Quantize step: 293s → 49s (6.0x)
  • Total export: 312s → 66s (4.7x)

Verified bitwise identical output (SHA256 checksums match) across FP8, NVFP4, NVFP4+rotation, psx_luts, psx_formats, luts/lo-bcq, and barboqt backends.

Usage

from modelopt.torch.export import export_hf_vllm_fq_checkpoint

# Default: parallel=True (uses all GPUs automatically)
export_hf_vllm_fq_checkpoint(model, export_dir="./export")

# Opt out if needed:
export_hf_vllm_fq_checkpoint(model, export_dir="./export", parallel=False)

No changes needed in hf_ptq.py or downstream scripts — parallel=True is the default and falls back gracefully on single GPU.

Testing

  • Unit test (new): tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py
    • test_parallel_vs_sequential_identical: Requires ≥2 GPUs, verifies bitwise identical HF weights + quantizer state between parallel and sequential export
    • test_single_gpu_fallback: Verifies parallel=True on single GPU works without error
  • Existing test: test_vllm_fakequant_hf_export.py passes unchanged (uses parallel=True default, falls back to sequential on single GPU)
  • Manual validation: SHA256 file checksums verified identical for FP8, NVFP4, NVFP4_rotate presets on 8× H100

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (parallel=True default; single GPU falls back to sequential; no API break)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A (no copied code, no new deps)
  • Did you write any new necessary tests?: ✅ (2 new GPU tests)
  • Did you update Changelog?: ❌ (will update if requested)

Additional Information

  • Concurrent ThreadPoolExecutor is safe because: (1) state_dict is a detached copy, (2) each TensorQuantizer has per-instance state, (3) torch.inference_mode() context, (4) CUDA kernels release GIL, (5) same_device_as() context manager handles per-thread device switching
  • For models loaded with device_map="auto", weights are naturally distributed across GPUs → parallel export uses all GPUs with no configuration needed
  • device_map="sequential" on small models may place everything on GPU 0 → parallel export gracefully falls back to sequential

Summary by CodeRabbit

  • New Features

    • Optional multi‑GPU parallel processing for faster fake‑quantization exports with a new export parameter to toggle parallelism; CPU work handled inline and GPU work dispatched to workers. Added logging of export mode, GPU work distribution, and elapsed time.
  • Bug Fixes

    • Fixed rotate‑restore logic to avoid unintended reassignment during export; ensured consistent application of exported weight results.
  • Tests

    • Added GPU tests validating deterministic equivalence between parallel and sequential exports and a single‑GPU fallback check.

@sungsooha sungsooha requested a review from a team as a code owner April 3, 2026 19:48
@sungsooha sungsooha requested a review from ChenhanYu April 3, 2026 19:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds pre-collected per-weight fake-quantization work items, optional device-parallel execution using ThreadPoolExecutor (controlled by new parallel: bool flag), per-device batching and merging of results, logging/timing, and a conditional rotation restore; includes GPU tests comparing parallel vs sequential exports.

Changes

Cohort / File(s) Summary
Export refactor & parallelization
modelopt/torch/export/plugins/vllm_fakequant_hf.py
Introduce _WeightQuantWork, _collect_quant_work, _process_weight, _process_device_batch; add parallel: bool = True to export_hf_vllm_fq_checkpoint; group work by device, optionally dispatch per-GPU batches to ThreadPoolExecutor, gather/merge results (parallel vs immediate application in sequential mode), add logger and timing, and restore wq._rotate only when orig_rotate is not None.
GPU tests for parallel export
tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py
Add GPU-only tests that require CUDA devices: compare outputs of export_hf_vllm_fq_checkpoint with parallel=False and parallel=True (tensor-wise equality of HF checkpoints and recursive equality of vllm_fq_modelopt_state.pth); include single-GPU fallback test verifying checkpoint presence and reload.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant Export as export_hf_vllm_fq_checkpoint
    participant Collector as _collect_quant_work
    participant Grouper as Device Grouper
    participant Executor as ThreadPoolExecutor
    participant Worker as _process_device_batch
    participant State as state_dict / vllm_fq_modelopt_state.pth

    Caller->>Export: call export_hf_vllm_fq_checkpoint(parallel=...)
    Export->>Collector: collect per-weight work items
    Collector-->>Export: work_items
    Export->>Grouper: group work_items by device
    Grouper-->>Export: device_groups

    alt parallel && multiple CUDA devices
        Export->>Executor: submit per-device batches
        par per-device processing
            Executor->>Worker: process device_n batch
        and
            Executor->>Worker: process device_m batch
        end
        Worker-->>Executor: per-device results
        Executor-->>Export: merged results
    else sequential or single-device
        Export->>Worker: process batches sequentially inline
        Worker-->>Export: results
    end

    Export->>State: apply results to state_dict and write checkpoint
    State-->>Export: checkpoint persisted
    Export-->>Caller: complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error The new test file uses torch.load(..., weights_only=False) without inline comments justifying why it is safe to disable the security feature. Add inline comments before each torch.load(..., weights_only=False) call explaining the file is internally-generated and not user-supplied, or refactor to use weights_only=True.
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: parallelizing fakequant export using ThreadPoolExecutor across GPUs. It is specific, concise, and directly matches the primary objective of the PR.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 18.60465% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.15%. Comparing base (80d2f02) to head (867b856).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/export/plugins/vllm_fakequant_hf.py 18.60% 70 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1177      +/-   ##
==========================================
- Coverage   74.77%   74.15%   -0.62%     
==========================================
  Files         351      352       +1     
  Lines       40289    41841    +1552     
==========================================
+ Hits        30125    31027     +902     
- Misses      10164    10814     +650     
Flag Coverage Δ
unit 54.98% <18.60%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

267-275: ⚠️ Potential issue | 🟠 Major

Restore quantizer state from a finally block.

If any of the state-serialization or save calls raise here, the function exits with weight quantizers still disabled and _rotate possibly overwritten. That leaves the caller's model mutated and makes retrying export on the same instance unreliable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 267 - 275,
Wrap the serialization and model.save_pretrained calls (the block that calls
torch.save(modelopt_state, ...) and model.save_pretrained(export_dir,
state_dict=clean_sd, ...)) in a try/finally and move the wqs_to_restore loop
into the finally so quantizers are always restored even on exception; ensure
wqs_to_restore holds each quantizer and its original state (e.g., (wq,
was_enabled, orig_rotate) or at least (wq, orig_rotate) plus whether it was
enabled) and in the finally set wq._rotate = orig_rotate and re-enable or
disable the quantizer to the original enabled state (use
wq.enable()/wq.disable() or equivalent) so model state is not left mutated after
failures.
🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

115-120: Re-enter torch.inference_mode() inside the worker thread.

torch.inference_mode() is thread-local, so the context opened in export_hf_vllm_fq_checkpoint() does not apply to ThreadPoolExecutor workers. The parallel path therefore runs quantizer forwards under different autograd/inference settings than the sequential path.

♻️ Proposed fix
 def _process_device_batch(items: list[_WeightQuantWork], device: torch.device):
     """Process all weight items on a single GPU. Runs in a dedicated thread."""
-    with torch.cuda.device(device):
+    with torch.inference_mode(), torch.cuda.device(device):
         results = [_process_weight(item) for item in items]
         torch.cuda.synchronize(device)
     return results
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 115 - 120,
The worker thread function _process_device_batch currently runs _process_weight
without the torch.inference_mode() context (which is thread-local), causing
different autograd/inference behavior than the sequential path; fix it by
entering torch.inference_mode() inside _process_device_batch (e.g., wrap the
list comprehension/results creation in a with torch.inference_mode(): block so
each call to _process_weight runs under the same inference mode as
export_hf_vllm_fq_checkpoint), then synchronize and return results as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 79-89: The branch that folds an input's _pre_quant_scale should
also check the quantizer's active flag so we don't fold a scale that was already
applied and then had _enable_pre_quant_scale set False by
disable_pre_quant_scale_and_resmooth(); update the condition in the block that
sets inp_q/inp_q_key (the candidate check around _pre_quant_scale and _disabled)
to also require candidate._enable_pre_quant_scale is True (or invert the logic
to skip folding when _enable_pre_quant_scale is False), referencing candidate,
_pre_quant_scale, _enable_pre_quant_scale, inp_q, inp_q_key and
get_unwrapped_name so the code only folds scales for quantizers that are still
enabled.

In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py`:
- Around line 91-106: Test currently only compares modelopt_state_weights
(seq_qsd vs par_qsd) so changes to
modelopt_state_dict[*]["metadata"]["quantizer_state"] would be missed; update
the test (where seq_state and par_state are produced by
export_hf_vllm_fq_checkpoint) to compare the full exported ModelOpt payload
instead of only modelopt_state_weights: iterate over top-level keys of seq_state
and par_state (including modelopt_state_dict and its metadata), recursively
compare dicts and tensors (use torch.equal for tensors, equality for scalars)
and assert identical keys and values for
modelopt_state_dict[*]["metadata"]["quantizer_state"] as well as
modelopt_state_weights so any regression in metadata is caught.
- Around line 109-129: The test function test_single_gpu_fallback
unconditionally calls model.cuda() and should be skipped when CUDA isn't
available; wrap or guard the test with the standard GPU check (e.g., a
pytest.skipif or pytest.mark.skipif using torch.cuda.is_available()) so the body
(including the .cuda() call and subsequent export_hf_vllm_fq_checkpoint call)
only runs when a CUDA device exists; update the test decorator or add an early
skip at the start of test_single_gpu_fallback to prevent CPU-only runs from
failing.
- Around line 88-89: Add an inline comment next to the two torch.load calls
where weights_only=False (the assignments to seq_state and par_state loading
"vllm_fq_modelopt_state.pth" from seq_dir and par_dir) that documents this
exception: note that these checkpoint files are generated earlier in the test
(trusted/test-generated on lines where the test writes the files) so using
weights_only=False is safe and intentional per guidelines; reference seq_dir,
par_dir and the filename in the comment for clarity.

---

Outside diff comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 267-275: Wrap the serialization and model.save_pretrained calls
(the block that calls torch.save(modelopt_state, ...) and
model.save_pretrained(export_dir, state_dict=clean_sd, ...)) in a try/finally
and move the wqs_to_restore loop into the finally so quantizers are always
restored even on exception; ensure wqs_to_restore holds each quantizer and its
original state (e.g., (wq, was_enabled, orig_rotate) or at least (wq,
orig_rotate) plus whether it was enabled) and in the finally set wq._rotate =
orig_rotate and re-enable or disable the quantizer to the original enabled state
(use wq.enable()/wq.disable() or equivalent) so model state is not left mutated
after failures.

---

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 115-120: The worker thread function _process_device_batch
currently runs _process_weight without the torch.inference_mode() context (which
is thread-local), causing different autograd/inference behavior than the
sequential path; fix it by entering torch.inference_mode() inside
_process_device_batch (e.g., wrap the list comprehension/results creation in a
with torch.inference_mode(): block so each call to _process_weight runs under
the same inference mode as export_hf_vllm_fq_checkpoint), then synchronize and
return results as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 551f89ec-1823-498c-a631-9a7529b6cb77

📥 Commits

Reviewing files that changed from the base of the PR and between 18ce04f and d28cda9.

📒 Files selected for processing (2)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py
  • tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py

Comment on lines +88 to +89
seq_state = torch.load(seq_dir / "vllm_fq_modelopt_state.pth", weights_only=False)
par_state = torch.load(par_dir / "vllm_fq_modelopt_state.pth", weights_only=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py | sed -n '50,120p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3520


🏁 Script executed:

head -30 tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1371


Add inline comment documenting why weights_only=False is safe.

The checkpoint files are test-generated on lines 66 and 73, so they are internally trusted. However, per coding guidelines, an inline comment must be added to document this whenever weights_only=False is used.

🔐 Fix
     # Compare modelopt state.
+    # Safe: both files are written by this test above and are not user-supplied.
     seq_state = torch.load(seq_dir / "vllm_fq_modelopt_state.pth", weights_only=False)
     par_state = torch.load(par_dir / "vllm_fq_modelopt_state.pth", weights_only=False)

As per coding guidelines: "Do not use torch.load(..., weights_only=False) unless a documented exception is provided. If weights_only=False is required, provide an inline comment explaining why and confirming the file is internally-generated or trusted."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py` around
lines 88 - 89, Add an inline comment next to the two torch.load calls where
weights_only=False (the assignments to seq_state and par_state loading
"vllm_fq_modelopt_state.pth" from seq_dir and par_dir) that documents this
exception: note that these checkpoint files are generated earlier in the test
(trusted/test-generated on lines where the test writes the files) so using
weights_only=False is safe and intentional per guidelines; reference seq_dir,
par_dir and the filename in the comment for clarity.

@realAsma realAsma requested a review from cjluo-nv April 3, 2026 20:32
@sungsooha sungsooha force-pushed the sungsooh/parallel-fakequant-export-pr branch from d28cda9 to e944e99 Compare April 3, 2026 20:46
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py (2)

87-90: ⚠️ Potential issue | 🔴 Critical

State why these weights_only=False loads are trusted.

The current comment explains why full unpickling is needed, but it still does not say that both vllm_fq_modelopt_state.pth files were written earlier in this test and are trusted. That trust-boundary justification needs to live inline next to these torch.load() calls.

🔐 Suggested comment
-    # Compare full modelopt state payload (weights_only=False: modelopt_state contains
-    # Python objects — dicts, strings, quantizer configs — that require unpickling).
+    # Safe: both vllm_fq_modelopt_state.pth files were written by this test above and are
+    # trusted. weights_only=False is required because modelopt_state contains Python
+    # objects such as dicts, strings, and quantizer configs.
     seq_state = torch.load(seq_dir / "vllm_fq_modelopt_state.pth", weights_only=False)
     par_state = torch.load(par_dir / "vllm_fq_modelopt_state.pth", weights_only=False)

As per coding guidelines: "Do not use torch.load(..., weights_only=False) unless a documented exception is provided. Always use weights_only=True to prevent arbitrary code execution from malicious checkpoints. If weights_only=False is required, provide an inline comment explaining why and confirming the file is internally-generated or trusted."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py` around
lines 87 - 90, The torch.load calls for vllm_fq_modelopt_state.pth use
weights_only=False which requires an explicit trust justification; edit the
lines that call torch.load (for seq_dir / "vllm_fq_modelopt_state.pth" and
par_dir / "vllm_fq_modelopt_state.pth") to add an inline comment stating these
checkpoint files were created earlier in this test (i.e., are
internally-generated and trusted) and therefore unpickling is safe, and
reference that full modelopt state (not just tensors) is required for
comparison.

92-98: ⚠️ Potential issue | 🟡 Minor

Deep-compare modelopt_state_dict, not just its shape.

This loop only checks list length and mode_str. A regression in m_state["metadata"]["quantizer_state"] would still pass as long as the mode ordering stays the same. Please compare seq_ms and par_ms recursively as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py` around
lines 92 - 98, The test currently only checks modelopt_state_dict length and
mode strings (seq_msd, par_msd and seq_mode/par_mode) but not the actual
metadata content; update the loop over (seq_mode, seq_ms), (par_mode, par_ms) to
perform a deep recursive equality check of seq_ms vs par_ms (the quantizer
metadata such as m_state["metadata"]["quantizer_state"]) and fail with a clear
message on mismatch; use a structural deep-compare (e.g., equality for
dicts/lists or a recursive comparator) so any nested regression in
modelopt_state_dict is detected.
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

79-89: ⚠️ Potential issue | 🟠 Major

Gate _pre_quant_scale folding on _enable_pre_quant_scale.

When disable_pre_quant_scale_and_resmooth() has already consumed the scale, _pre_quant_scale can still be populated while _enable_pre_quant_scale is False. This branch will fold that scale again once the input quantizer is disabled, exporting an incorrectly scaled weight.

🛠️ Suggested guard
             if hasattr(module, inp_attr):
                 candidate = getattr(module, inp_attr)
                 if (
+                    getattr(candidate, "_enable_pre_quant_scale", True)
                     and hasattr(candidate, "_pre_quant_scale")
                     and candidate._pre_quant_scale is not None
                     and candidate._disabled
                 ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 79 - 89, The
branch that folds a module's _pre_quant_scale should also check that the
pre-quant folding is currently enabled; update the condition around
candidate/_pre_quant_scale in the block handling module and inp_attr to also
require candidate._enable_pre_quant_scale is True (in addition to
_pre_quant_scale not None and _disabled) so that scales already consumed by
disable_pre_quant_scale_and_resmooth() are not folded again; ensure you keep the
same logic that sets inp_q and inp_q_key (using get_unwrapped_name with
module_name and model) only when the enable flag is present and true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 79-89: The branch that folds a module's _pre_quant_scale should
also check that the pre-quant folding is currently enabled; update the condition
around candidate/_pre_quant_scale in the block handling module and inp_attr to
also require candidate._enable_pre_quant_scale is True (in addition to
_pre_quant_scale not None and _disabled) so that scales already consumed by
disable_pre_quant_scale_and_resmooth() are not folded again; ensure you keep the
same logic that sets inp_q and inp_q_key (using get_unwrapped_name with
module_name and model) only when the enable flag is present and true.

In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py`:
- Around line 87-90: The torch.load calls for vllm_fq_modelopt_state.pth use
weights_only=False which requires an explicit trust justification; edit the
lines that call torch.load (for seq_dir / "vllm_fq_modelopt_state.pth" and
par_dir / "vllm_fq_modelopt_state.pth") to add an inline comment stating these
checkpoint files were created earlier in this test (i.e., are
internally-generated and trusted) and therefore unpickling is safe, and
reference that full modelopt state (not just tensors) is required for
comparison.
- Around line 92-98: The test currently only checks modelopt_state_dict length
and mode strings (seq_msd, par_msd and seq_mode/par_mode) but not the actual
metadata content; update the loop over (seq_mode, seq_ms), (par_mode, par_ms) to
perform a deep recursive equality check of seq_ms vs par_ms (the quantizer
metadata such as m_state["metadata"]["quantizer_state"]) and fail with a clear
message on mismatch; use a structural deep-compare (e.g., equality for
dicts/lists or a recursive comparator) so any nested regression in
modelopt_state_dict is detected.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 00bfd862-591a-4904-b57b-72dc8dbbc343

📥 Commits

Reviewing files that changed from the base of the PR and between d28cda9 and e944e99.

📒 Files selected for processing (2)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py
  • tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py

Copy link
Copy Markdown
Contributor

@kinjalpatel27 kinjalpatel27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left a comment. Could you please address it before merging?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py (1)

87-90: ⚠️ Potential issue | 🟡 Minor

Add trust justification to inline comment for weights_only=False.

The comment explains why weights_only=False is needed (Python objects require unpickling), but per coding guidelines, it should also confirm the files are internally-generated/trusted.

🔐 Suggested fix
-    # Compare full modelopt state payload (weights_only=False: modelopt_state contains
-    # Python objects — dicts, strings, quantizer configs — that require unpickling).
+    # Compare full modelopt state payload.
+    # Safe: files are test-generated above (lines 66, 73) and not user-supplied.
+    # weights_only=False required: modelopt_state contains Python objects (dicts, configs).
     seq_state = torch.load(seq_dir / "vllm_fq_modelopt_state.pth", weights_only=False)
     par_state = torch.load(par_dir / "vllm_fq_modelopt_state.pth", weights_only=False)

As per coding guidelines: "If weights_only=False is required, provide an inline comment explaining why and confirming the file is internally-generated or trusted."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py` around
lines 87 - 90, The inline comment for torch.load(..., weights_only=False) in the
seq_state/par_state loads lacks a trust justification; update the comment near
torch.load calls for seq_dir/"vllm_fq_modelopt_state.pth" and
par_dir/"vllm_fq_modelopt_state.pth" to state that weights_only=False is
required because the modelopt state contains Python objects that must be
unpickled and explicitly confirm these files are produced by the test code
(internally-generated and trusted), e.g., mention they were written earlier in
the test so unpickling is safe.
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

79-89: ⚠️ Potential issue | 🟠 Major

Gate _pre_quant_scale folding on the actual enable state.

The condition should also check candidate._enable_pre_quant_scale to avoid double-folding when disable_pre_quant_scale_and_resmooth() has already resmoothed the weight but left _pre_quant_scale populated with _enable_pre_quant_scale = False.

🛠️ Suggested fix
             if hasattr(module, inp_attr):
                 candidate = getattr(module, inp_attr)
                 if (
+                    getattr(candidate, "_enable_pre_quant_scale", True)
+                    and hasattr(candidate, "_pre_quant_scale")
-                    hasattr(candidate, "_pre_quant_scale")
                     and candidate._pre_quant_scale is not None
                     and candidate._disabled
                 ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 79 - 89, The
current folding check uses candidate._pre_quant_scale and candidate._disabled
but misses candidate._enable_pre_quant_scale, causing double-folding after
disable_pre_quant_scale_and_resmooth() leaves _pre_quant_scale present but
_enable_pre_quant_scale == False; update the conditional in the block handling
(module, inp_attr) so it also requires candidate._enable_pre_quant_scale to be
truthy before assigning inp_q/inp_q_key (referencing candidate,
_pre_quant_scale, _enable_pre_quant_scale,
disable_pre_quant_scale_and_resmooth(), get_unwrapped_name, module_name, model).
🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

272-275: Consider simplifying the orig_rotate restoration guard.

Per the TensorQuantizer implementation, _rotate is always initialized (to RotateConfig, dict, or bool) and should never be None. The if orig_rotate is not None guard is defensive but may mask bugs if _rotate is unexpectedly unset.

If this is intentional defensive coding, adding a brief comment would clarify the intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 272 - 275,
The loop restoring quantizers currently conditionally resets wq._rotate only if
orig_rotate is not None; because TensorQuantizer always sets _rotate, remove the
defensive guard and always restore it (i.e., in the wqs_to_restore loop call
wq.enable() and set wq._rotate = orig_rotate unconditionally) to surface any
unexpected unset state, or if you prefer to keep the guard, add a short comment
referencing TensorQuantizer to explain the defensive intent; update the code
around the wqs_to_restore iteration and the references to orig_rotate and
wq._rotate accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 79-89: The current folding check uses candidate._pre_quant_scale
and candidate._disabled but misses candidate._enable_pre_quant_scale, causing
double-folding after disable_pre_quant_scale_and_resmooth() leaves
_pre_quant_scale present but _enable_pre_quant_scale == False; update the
conditional in the block handling (module, inp_attr) so it also requires
candidate._enable_pre_quant_scale to be truthy before assigning inp_q/inp_q_key
(referencing candidate, _pre_quant_scale, _enable_pre_quant_scale,
disable_pre_quant_scale_and_resmooth(), get_unwrapped_name, module_name, model).

In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py`:
- Around line 87-90: The inline comment for torch.load(..., weights_only=False)
in the seq_state/par_state loads lacks a trust justification; update the comment
near torch.load calls for seq_dir/"vllm_fq_modelopt_state.pth" and
par_dir/"vllm_fq_modelopt_state.pth" to state that weights_only=False is
required because the modelopt state contains Python objects that must be
unpickled and explicitly confirm these files are produced by the test code
(internally-generated and trusted), e.g., mention they were written earlier in
the test so unpickling is safe.

---

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 272-275: The loop restoring quantizers currently conditionally
resets wq._rotate only if orig_rotate is not None; because TensorQuantizer
always sets _rotate, remove the defensive guard and always restore it (i.e., in
the wqs_to_restore loop call wq.enable() and set wq._rotate = orig_rotate
unconditionally) to surface any unexpected unset state, or if you prefer to keep
the guard, add a short comment referencing TensorQuantizer to explain the
defensive intent; update the code around the wqs_to_restore iteration and the
references to orig_rotate and wq._rotate accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 02ae9151-b318-4488-b80a-d9792440d9f0

📥 Commits

Reviewing files that changed from the base of the PR and between e944e99 and c7dee6a.

📒 Files selected for processing (2)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py
  • tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py

@realAsma realAsma enabled auto-merge (squash) April 6, 2026 20:30
Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have two questions:

  • Is this approach only effective when models are sharded evenly on all GPUs?
  • Will it work for CPU offloading with HF accelerate?

@sungsooha
Copy link
Copy Markdown
Contributor Author

Thanks for the PR. I have two questions:

  • Is this approach only effective when models are sharded evenly on all GPUs?
  • Will it work for CPU offloading with HF accelerate?
  1. It works with any sharding. Uneven sharding means some threads finish earlier, but all GPUs still work concurrently. device_map="auto" balances by memory, which is roughly even for most models.

  2. yes. weights on cpu are handled in the sequential fallback path. only cuda devices get threads. If all weights are on cpu (full offload), it falls back entirely to sequential. If some weights are on gpu and some on cpu, the gpu weights are processed in parallel while cpu weights are processed inline.

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.inference_mode() is thread-local: the parallel worker threads in _process_device_batch don't re-enter torch.inference_mode(), so the parallel path runs under different autograd settings than the sequential path. Simple one-line fix:

 def _process_device_batch(items: list[_WeightQuantWork], device: torch.device):
     """Process all weight items on a single GPU. Runs in a dedicated thread."""
-    with torch.cuda.device(device):
+    with torch.inference_mode(), torch.cuda.device(device):
         results = [_process_weight(item) for item in items]
         torch.cuda.synchronize(device)
     return results

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.inference_mode() is thread-local: the parallel worker threads in _process_device_batch don't re-enter torch.inference_mode(), so the parallel path runs under different autograd settings than the sequential path. Simple one-line fix:

 def _process_device_batch(items: list[_WeightQuantWork], device: torch.device):
     """Process all weight items on a single GPU. Runs in a dedicated thread."""
-    with torch.cuda.device(device):
+    with torch.inference_mode(), torch.cuda.device(device):
         results = [_process_weight(item) for item in items]
         torch.cuda.synchronize(device)
     return results

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.inference_mode() is thread-local: the parallel worker threads in _process_device_batch don't re-enter torch.inference_mode(), so the parallel path runs under different autograd settings than the sequential path. Simple one-line fix:

 def _process_device_batch(items: list[_WeightQuantWork], device: torch.device):
     """Process all weight items on a single GPU. Runs in a dedicated thread."""
-    with torch.cuda.device(device):
+    with torch.inference_mode(), torch.cuda.device(device):
         results = [_process_weight(item) for item in items]
         torch.cuda.synchronize(device)
     return results

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.inference_mode() is thread-local: the parallel worker threads in _process_device_batch do not re-enter torch.inference_mode(), so the parallel path runs under different autograd settings than the sequential path. Please add torch.inference_mode() inside the worker thread.

@realAsma
Copy link
Copy Markdown
Contributor

realAsma commented Apr 6, 2026

Apologies for the duplicate comments — glitch in my Claude review workflow. Cleaned up now.

auto-merge was automatically disabled April 7, 2026 17:50

Head branch was pushed to by a user without write access

sungsooha and others added 3 commits April 7, 2026 14:37
Refactors `export_hf_vllm_fq_checkpoint()` Step 1 (weight fake-quantization)
to process weights concurrently across GPUs using ThreadPoolExecutor.

- Collects all quantizer work items, groups by device
- One thread per GPU, each processes its weight partition
- Falls back to sequential on single GPU or CPU (no overhead)
- Adds `parallel: bool = True` parameter (default on, backwards compatible)
- Adds GPU test: parallel vs sequential produces bitwise identical output
- Adds single-GPU fallback test

Measured speedup on Qwen3-8B with barboqt_exhaustive_2b (8x H100):
- Quantize step: 293s → 49s (6.0x)
- Total export: 312s → 66s (4.7x)

Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
- Fold pre_quant_scale on GPU before .cpu() move (perf fix)
- Use torch.allclose instead of torch.equal in test (nit)

Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
torch.inference_mode() is thread-local — ThreadPoolExecutor workers
don't inherit the parent's context. Without it, parallel workers run
with autograd enabled (extra memory, different semantics than
sequential path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
@sungsooha sungsooha force-pushed the sungsooh/parallel-fakequant-export-pr branch from 1d56d31 to 4e16400 Compare April 7, 2026 21:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

79-89: ⚠️ Potential issue | 🟠 Major

Still guard _pre_quant_scale folding on _enable_pre_quant_scale.

disable_pre_quant_scale_and_resmooth() leaves _pre_quant_scale populated after it has already been folded into the weight. If that input quantizer is later disabled, this branch will fold the same scale again and export an over-scaled weight unless it also checks _enable_pre_quant_scale.

Suggested guard
             if hasattr(module, inp_attr):
                 candidate = getattr(module, inp_attr)
                 if (
+                    getattr(candidate, "_enable_pre_quant_scale", True)
                     hasattr(candidate, "_pre_quant_scale")
                     and candidate._pre_quant_scale is not None
                     and candidate._disabled
                 ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 79 - 89,
When deciding to fold an input quantizer's _pre_quant_scale into the weight
(inside the branch that checks hasattr(module, inp_attr) and inspects
candidate._pre_quant_scale and candidate._disabled), also require that
candidate._enable_pre_quant_scale is True; update the condition that sets
inp_q/inp_q_key to include "and getattr(candidate, '_enable_pre_quant_scale',
False)" so that inputs for which disable_pre_quant_scale_and_resmooth() only
cleared enabling (but left _pre_quant_scale populated) are not folded again;
touch the same branch where get_unwrapped_name is called to add this guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 183-193: The current loop mixes submitting GPU work and processing
CPU shards inline which can cause CPU shards to run before any GPU futures are
launched; change the logic so you first iterate device_groups and submit all
CUDA tasks to the ThreadPoolExecutor (using _process_device_batch), collecting
CPU groups separately (or deferring them), then after all futures are submitted
await futures and finally process CPU shards inline with _process_weight and
extend all_results; update references: device_groups,
ThreadPoolExecutor/futures, _process_device_batch, _process_weight, and
all_results so CPU items are only processed after GPU futures are launched to
avoid sequential export and extra CPU tensor retention.

---

Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 79-89: When deciding to fold an input quantizer's _pre_quant_scale
into the weight (inside the branch that checks hasattr(module, inp_attr) and
inspects candidate._pre_quant_scale and candidate._disabled), also require that
candidate._enable_pre_quant_scale is True; update the condition that sets
inp_q/inp_q_key to include "and getattr(candidate, '_enable_pre_quant_scale',
False)" so that inputs for which disable_pre_quant_scale_and_resmooth() only
cleared enabling (but left _pre_quant_scale populated) are not folded again;
touch the same branch where get_unwrapped_name is called to add this guard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 074d1b56-02a8-4fc5-8558-559f14194702

📥 Commits

Reviewing files that changed from the base of the PR and between 1d56d31 and 4e16400.

📒 Files selected for processing (2)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py
  • tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py
✅ Files skipped from review due to trivial changes (1)
  • tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py

…patch

1. Gate _pre_quant_scale folding on _enable_pre_quant_scale to prevent
   double-fold after disable_pre_quant_scale_and_resmooth().

2. Submit GPU futures before processing CPU shard inline. Previously
   CPU weights could block GPU submission when they appear first in
   device_groups (common with offloaded models).

Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
@realAsma
Copy link
Copy Markdown
Contributor

realAsma commented Apr 8, 2026

/ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants