feat: parallelize fakequant export across GPUs via ThreadPoolExecutor#1177
feat: parallelize fakequant export across GPUs via ThreadPoolExecutor#1177sungsooha wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds pre-collected per-weight fake-quantization work items, optional device-parallel execution using ThreadPoolExecutor (controlled by new Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant Export as export_hf_vllm_fq_checkpoint
participant Collector as _collect_quant_work
participant Grouper as Device Grouper
participant Executor as ThreadPoolExecutor
participant Worker as _process_device_batch
participant State as state_dict / vllm_fq_modelopt_state.pth
Caller->>Export: call export_hf_vllm_fq_checkpoint(parallel=...)
Export->>Collector: collect per-weight work items
Collector-->>Export: work_items
Export->>Grouper: group work_items by device
Grouper-->>Export: device_groups
alt parallel && multiple CUDA devices
Export->>Executor: submit per-device batches
par per-device processing
Executor->>Worker: process device_n batch
and
Executor->>Worker: process device_m batch
end
Worker-->>Executor: per-device results
Executor-->>Export: merged results
else sequential or single-device
Export->>Worker: process batches sequentially inline
Worker-->>Export: results
end
Export->>State: apply results to state_dict and write checkpoint
State-->>Export: checkpoint persisted
Export-->>Caller: complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 error, 1 warning)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1177 +/- ##
==========================================
- Coverage 74.77% 74.15% -0.62%
==========================================
Files 351 352 +1
Lines 40289 41841 +1552
==========================================
+ Hits 30125 31027 +902
- Misses 10164 10814 +650
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
267-275:⚠️ Potential issue | 🟠 MajorRestore quantizer state from a
finallyblock.If any of the state-serialization or save calls raise here, the function exits with weight quantizers still disabled and
_rotatepossibly overwritten. That leaves the caller's model mutated and makes retrying export on the same instance unreliable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 267 - 275, Wrap the serialization and model.save_pretrained calls (the block that calls torch.save(modelopt_state, ...) and model.save_pretrained(export_dir, state_dict=clean_sd, ...)) in a try/finally and move the wqs_to_restore loop into the finally so quantizers are always restored even on exception; ensure wqs_to_restore holds each quantizer and its original state (e.g., (wq, was_enabled, orig_rotate) or at least (wq, orig_rotate) plus whether it was enabled) and in the finally set wq._rotate = orig_rotate and re-enable or disable the quantizer to the original enabled state (use wq.enable()/wq.disable() or equivalent) so model state is not left mutated after failures.
🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
115-120: Re-entertorch.inference_mode()inside the worker thread.
torch.inference_mode()is thread-local, so the context opened inexport_hf_vllm_fq_checkpoint()does not apply toThreadPoolExecutorworkers. The parallel path therefore runs quantizer forwards under different autograd/inference settings than the sequential path.♻️ Proposed fix
def _process_device_batch(items: list[_WeightQuantWork], device: torch.device): """Process all weight items on a single GPU. Runs in a dedicated thread.""" - with torch.cuda.device(device): + with torch.inference_mode(), torch.cuda.device(device): results = [_process_weight(item) for item in items] torch.cuda.synchronize(device) return results🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 115 - 120, The worker thread function _process_device_batch currently runs _process_weight without the torch.inference_mode() context (which is thread-local), causing different autograd/inference behavior than the sequential path; fix it by entering torch.inference_mode() inside _process_device_batch (e.g., wrap the list comprehension/results creation in a with torch.inference_mode(): block so each call to _process_weight runs under the same inference mode as export_hf_vllm_fq_checkpoint), then synchronize and return results as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 79-89: The branch that folds an input's _pre_quant_scale should
also check the quantizer's active flag so we don't fold a scale that was already
applied and then had _enable_pre_quant_scale set False by
disable_pre_quant_scale_and_resmooth(); update the condition in the block that
sets inp_q/inp_q_key (the candidate check around _pre_quant_scale and _disabled)
to also require candidate._enable_pre_quant_scale is True (or invert the logic
to skip folding when _enable_pre_quant_scale is False), referencing candidate,
_pre_quant_scale, _enable_pre_quant_scale, inp_q, inp_q_key and
get_unwrapped_name so the code only folds scales for quantizers that are still
enabled.
In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py`:
- Around line 91-106: Test currently only compares modelopt_state_weights
(seq_qsd vs par_qsd) so changes to
modelopt_state_dict[*]["metadata"]["quantizer_state"] would be missed; update
the test (where seq_state and par_state are produced by
export_hf_vllm_fq_checkpoint) to compare the full exported ModelOpt payload
instead of only modelopt_state_weights: iterate over top-level keys of seq_state
and par_state (including modelopt_state_dict and its metadata), recursively
compare dicts and tensors (use torch.equal for tensors, equality for scalars)
and assert identical keys and values for
modelopt_state_dict[*]["metadata"]["quantizer_state"] as well as
modelopt_state_weights so any regression in metadata is caught.
- Around line 109-129: The test function test_single_gpu_fallback
unconditionally calls model.cuda() and should be skipped when CUDA isn't
available; wrap or guard the test with the standard GPU check (e.g., a
pytest.skipif or pytest.mark.skipif using torch.cuda.is_available()) so the body
(including the .cuda() call and subsequent export_hf_vllm_fq_checkpoint call)
only runs when a CUDA device exists; update the test decorator or add an early
skip at the start of test_single_gpu_fallback to prevent CPU-only runs from
failing.
- Around line 88-89: Add an inline comment next to the two torch.load calls
where weights_only=False (the assignments to seq_state and par_state loading
"vllm_fq_modelopt_state.pth" from seq_dir and par_dir) that documents this
exception: note that these checkpoint files are generated earlier in the test
(trusted/test-generated on lines where the test writes the files) so using
weights_only=False is safe and intentional per guidelines; reference seq_dir,
par_dir and the filename in the comment for clarity.
---
Outside diff comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 267-275: Wrap the serialization and model.save_pretrained calls
(the block that calls torch.save(modelopt_state, ...) and
model.save_pretrained(export_dir, state_dict=clean_sd, ...)) in a try/finally
and move the wqs_to_restore loop into the finally so quantizers are always
restored even on exception; ensure wqs_to_restore holds each quantizer and its
original state (e.g., (wq, was_enabled, orig_rotate) or at least (wq,
orig_rotate) plus whether it was enabled) and in the finally set wq._rotate =
orig_rotate and re-enable or disable the quantizer to the original enabled state
(use wq.enable()/wq.disable() or equivalent) so model state is not left mutated
after failures.
---
Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 115-120: The worker thread function _process_device_batch
currently runs _process_weight without the torch.inference_mode() context (which
is thread-local), causing different autograd/inference behavior than the
sequential path; fix it by entering torch.inference_mode() inside
_process_device_batch (e.g., wrap the list comprehension/results creation in a
with torch.inference_mode(): block so each call to _process_weight runs under
the same inference mode as export_hf_vllm_fq_checkpoint), then synchronize and
return results as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 551f89ec-1823-498c-a631-9a7529b6cb77
📒 Files selected for processing (2)
modelopt/torch/export/plugins/vllm_fakequant_hf.pytests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py
| seq_state = torch.load(seq_dir / "vllm_fq_modelopt_state.pth", weights_only=False) | ||
| par_state = torch.load(par_dir / "vllm_fq_modelopt_state.pth", weights_only=False) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py | sed -n '50,120p'Repository: NVIDIA/Model-Optimizer
Length of output: 3520
🏁 Script executed:
head -30 tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1371
Add inline comment documenting why weights_only=False is safe.
The checkpoint files are test-generated on lines 66 and 73, so they are internally trusted. However, per coding guidelines, an inline comment must be added to document this whenever weights_only=False is used.
🔐 Fix
# Compare modelopt state.
+ # Safe: both files are written by this test above and are not user-supplied.
seq_state = torch.load(seq_dir / "vllm_fq_modelopt_state.pth", weights_only=False)
par_state = torch.load(par_dir / "vllm_fq_modelopt_state.pth", weights_only=False)As per coding guidelines: "Do not use torch.load(..., weights_only=False) unless a documented exception is provided. If weights_only=False is required, provide an inline comment explaining why and confirming the file is internally-generated or trusted."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py` around
lines 88 - 89, Add an inline comment next to the two torch.load calls where
weights_only=False (the assignments to seq_state and par_state loading
"vllm_fq_modelopt_state.pth" from seq_dir and par_dir) that documents this
exception: note that these checkpoint files are generated earlier in the test
(trusted/test-generated on lines where the test writes the files) so using
weights_only=False is safe and intentional per guidelines; reference seq_dir,
par_dir and the filename in the comment for clarity.
d28cda9 to
e944e99
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py (2)
87-90:⚠️ Potential issue | 🔴 CriticalState why these
weights_only=Falseloads are trusted.The current comment explains why full unpickling is needed, but it still does not say that both
vllm_fq_modelopt_state.pthfiles were written earlier in this test and are trusted. That trust-boundary justification needs to live inline next to thesetorch.load()calls.🔐 Suggested comment
- # Compare full modelopt state payload (weights_only=False: modelopt_state contains - # Python objects — dicts, strings, quantizer configs — that require unpickling). + # Safe: both vllm_fq_modelopt_state.pth files were written by this test above and are + # trusted. weights_only=False is required because modelopt_state contains Python + # objects such as dicts, strings, and quantizer configs. seq_state = torch.load(seq_dir / "vllm_fq_modelopt_state.pth", weights_only=False) par_state = torch.load(par_dir / "vllm_fq_modelopt_state.pth", weights_only=False)As per coding guidelines: "Do not use
torch.load(..., weights_only=False)unless a documented exception is provided. Always useweights_only=Trueto prevent arbitrary code execution from malicious checkpoints. Ifweights_only=Falseis required, provide an inline comment explaining why and confirming the file is internally-generated or trusted."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py` around lines 87 - 90, The torch.load calls for vllm_fq_modelopt_state.pth use weights_only=False which requires an explicit trust justification; edit the lines that call torch.load (for seq_dir / "vllm_fq_modelopt_state.pth" and par_dir / "vllm_fq_modelopt_state.pth") to add an inline comment stating these checkpoint files were created earlier in this test (i.e., are internally-generated and trusted) and therefore unpickling is safe, and reference that full modelopt state (not just tensors) is required for comparison.
92-98:⚠️ Potential issue | 🟡 MinorDeep-compare
modelopt_state_dict, not just its shape.This loop only checks list length and
mode_str. A regression inm_state["metadata"]["quantizer_state"]would still pass as long as the mode ordering stays the same. Please compareseq_msandpar_msrecursively as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py` around lines 92 - 98, The test currently only checks modelopt_state_dict length and mode strings (seq_msd, par_msd and seq_mode/par_mode) but not the actual metadata content; update the loop over (seq_mode, seq_ms), (par_mode, par_ms) to perform a deep recursive equality check of seq_ms vs par_ms (the quantizer metadata such as m_state["metadata"]["quantizer_state"]) and fail with a clear message on mismatch; use a structural deep-compare (e.g., equality for dicts/lists or a recursive comparator) so any nested regression in modelopt_state_dict is detected.modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
79-89:⚠️ Potential issue | 🟠 MajorGate
_pre_quant_scalefolding on_enable_pre_quant_scale.When
disable_pre_quant_scale_and_resmooth()has already consumed the scale,_pre_quant_scalecan still be populated while_enable_pre_quant_scaleisFalse. This branch will fold that scale again once the input quantizer is disabled, exporting an incorrectly scaled weight.🛠️ Suggested guard
if hasattr(module, inp_attr): candidate = getattr(module, inp_attr) if ( + getattr(candidate, "_enable_pre_quant_scale", True) and hasattr(candidate, "_pre_quant_scale") and candidate._pre_quant_scale is not None and candidate._disabled ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 79 - 89, The branch that folds a module's _pre_quant_scale should also check that the pre-quant folding is currently enabled; update the condition around candidate/_pre_quant_scale in the block handling module and inp_attr to also require candidate._enable_pre_quant_scale is True (in addition to _pre_quant_scale not None and _disabled) so that scales already consumed by disable_pre_quant_scale_and_resmooth() are not folded again; ensure you keep the same logic that sets inp_q and inp_q_key (using get_unwrapped_name with module_name and model) only when the enable flag is present and true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 79-89: The branch that folds a module's _pre_quant_scale should
also check that the pre-quant folding is currently enabled; update the condition
around candidate/_pre_quant_scale in the block handling module and inp_attr to
also require candidate._enable_pre_quant_scale is True (in addition to
_pre_quant_scale not None and _disabled) so that scales already consumed by
disable_pre_quant_scale_and_resmooth() are not folded again; ensure you keep the
same logic that sets inp_q and inp_q_key (using get_unwrapped_name with
module_name and model) only when the enable flag is present and true.
In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py`:
- Around line 87-90: The torch.load calls for vllm_fq_modelopt_state.pth use
weights_only=False which requires an explicit trust justification; edit the
lines that call torch.load (for seq_dir / "vllm_fq_modelopt_state.pth" and
par_dir / "vllm_fq_modelopt_state.pth") to add an inline comment stating these
checkpoint files were created earlier in this test (i.e., are
internally-generated and trusted) and therefore unpickling is safe, and
reference that full modelopt state (not just tensors) is required for
comparison.
- Around line 92-98: The test currently only checks modelopt_state_dict length
and mode strings (seq_msd, par_msd and seq_mode/par_mode) but not the actual
metadata content; update the loop over (seq_mode, seq_ms), (par_mode, par_ms) to
perform a deep recursive equality check of seq_ms vs par_ms (the quantizer
metadata such as m_state["metadata"]["quantizer_state"]) and fail with a clear
message on mismatch; use a structural deep-compare (e.g., equality for
dicts/lists or a recursive comparator) so any nested regression in
modelopt_state_dict is detected.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 00bfd862-591a-4904-b57b-72dc8dbbc343
📒 Files selected for processing (2)
modelopt/torch/export/plugins/vllm_fakequant_hf.pytests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py
tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
♻️ Duplicate comments (2)
tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py (1)
87-90:⚠️ Potential issue | 🟡 MinorAdd trust justification to inline comment for
weights_only=False.The comment explains why
weights_only=Falseis needed (Python objects require unpickling), but per coding guidelines, it should also confirm the files are internally-generated/trusted.🔐 Suggested fix
- # Compare full modelopt state payload (weights_only=False: modelopt_state contains - # Python objects — dicts, strings, quantizer configs — that require unpickling). + # Compare full modelopt state payload. + # Safe: files are test-generated above (lines 66, 73) and not user-supplied. + # weights_only=False required: modelopt_state contains Python objects (dicts, configs). seq_state = torch.load(seq_dir / "vllm_fq_modelopt_state.pth", weights_only=False) par_state = torch.load(par_dir / "vllm_fq_modelopt_state.pth", weights_only=False)As per coding guidelines: "If
weights_only=Falseis required, provide an inline comment explaining why and confirming the file is internally-generated or trusted."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py` around lines 87 - 90, The inline comment for torch.load(..., weights_only=False) in the seq_state/par_state loads lacks a trust justification; update the comment near torch.load calls for seq_dir/"vllm_fq_modelopt_state.pth" and par_dir/"vllm_fq_modelopt_state.pth" to state that weights_only=False is required because the modelopt state contains Python objects that must be unpickled and explicitly confirm these files are produced by the test code (internally-generated and trusted), e.g., mention they were written earlier in the test so unpickling is safe.modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
79-89:⚠️ Potential issue | 🟠 MajorGate
_pre_quant_scalefolding on the actual enable state.The condition should also check
candidate._enable_pre_quant_scaleto avoid double-folding whendisable_pre_quant_scale_and_resmooth()has already resmoothed the weight but left_pre_quant_scalepopulated with_enable_pre_quant_scale = False.🛠️ Suggested fix
if hasattr(module, inp_attr): candidate = getattr(module, inp_attr) if ( + getattr(candidate, "_enable_pre_quant_scale", True) + and hasattr(candidate, "_pre_quant_scale") - hasattr(candidate, "_pre_quant_scale") and candidate._pre_quant_scale is not None and candidate._disabled ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 79 - 89, The current folding check uses candidate._pre_quant_scale and candidate._disabled but misses candidate._enable_pre_quant_scale, causing double-folding after disable_pre_quant_scale_and_resmooth() leaves _pre_quant_scale present but _enable_pre_quant_scale == False; update the conditional in the block handling (module, inp_attr) so it also requires candidate._enable_pre_quant_scale to be truthy before assigning inp_q/inp_q_key (referencing candidate, _pre_quant_scale, _enable_pre_quant_scale, disable_pre_quant_scale_and_resmooth(), get_unwrapped_name, module_name, model).
🧹 Nitpick comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
272-275: Consider simplifying theorig_rotaterestoration guard.Per the
TensorQuantizerimplementation,_rotateis always initialized (toRotateConfig,dict, orbool) and should never beNone. Theif orig_rotate is not Noneguard is defensive but may mask bugs if_rotateis unexpectedly unset.If this is intentional defensive coding, adding a brief comment would clarify the intent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 272 - 275, The loop restoring quantizers currently conditionally resets wq._rotate only if orig_rotate is not None; because TensorQuantizer always sets _rotate, remove the defensive guard and always restore it (i.e., in the wqs_to_restore loop call wq.enable() and set wq._rotate = orig_rotate unconditionally) to surface any unexpected unset state, or if you prefer to keep the guard, add a short comment referencing TensorQuantizer to explain the defensive intent; update the code around the wqs_to_restore iteration and the references to orig_rotate and wq._rotate accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 79-89: The current folding check uses candidate._pre_quant_scale
and candidate._disabled but misses candidate._enable_pre_quant_scale, causing
double-folding after disable_pre_quant_scale_and_resmooth() leaves
_pre_quant_scale present but _enable_pre_quant_scale == False; update the
conditional in the block handling (module, inp_attr) so it also requires
candidate._enable_pre_quant_scale to be truthy before assigning inp_q/inp_q_key
(referencing candidate, _pre_quant_scale, _enable_pre_quant_scale,
disable_pre_quant_scale_and_resmooth(), get_unwrapped_name, module_name, model).
In `@tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py`:
- Around line 87-90: The inline comment for torch.load(..., weights_only=False)
in the seq_state/par_state loads lacks a trust justification; update the comment
near torch.load calls for seq_dir/"vllm_fq_modelopt_state.pth" and
par_dir/"vllm_fq_modelopt_state.pth" to state that weights_only=False is
required because the modelopt state contains Python objects that must be
unpickled and explicitly confirm these files are produced by the test code
(internally-generated and trusted), e.g., mention they were written earlier in
the test so unpickling is safe.
---
Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 272-275: The loop restoring quantizers currently conditionally
resets wq._rotate only if orig_rotate is not None; because TensorQuantizer
always sets _rotate, remove the defensive guard and always restore it (i.e., in
the wqs_to_restore loop call wq.enable() and set wq._rotate = orig_rotate
unconditionally) to surface any unexpected unset state, or if you prefer to keep
the guard, add a short comment referencing TensorQuantizer to explain the
defensive intent; update the code around the wqs_to_restore iteration and the
references to orig_rotate and wq._rotate accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 02ae9151-b318-4488-b80a-d9792440d9f0
📒 Files selected for processing (2)
modelopt/torch/export/plugins/vllm_fakequant_hf.pytests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py
meenchen
left a comment
There was a problem hiding this comment.
Thanks for the PR. I have two questions:
- Is this approach only effective when models are sharded evenly on all GPUs?
- Will it work for CPU offloading with HF accelerate?
|
realAsma
left a comment
There was a problem hiding this comment.
torch.inference_mode() is thread-local: the parallel worker threads in _process_device_batch don't re-enter torch.inference_mode(), so the parallel path runs under different autograd settings than the sequential path. Simple one-line fix:
def _process_device_batch(items: list[_WeightQuantWork], device: torch.device):
"""Process all weight items on a single GPU. Runs in a dedicated thread."""
- with torch.cuda.device(device):
+ with torch.inference_mode(), torch.cuda.device(device):
results = [_process_weight(item) for item in items]
torch.cuda.synchronize(device)
return results
realAsma
left a comment
There was a problem hiding this comment.
torch.inference_mode() is thread-local: the parallel worker threads in _process_device_batch don't re-enter torch.inference_mode(), so the parallel path runs under different autograd settings than the sequential path. Simple one-line fix:
def _process_device_batch(items: list[_WeightQuantWork], device: torch.device):
"""Process all weight items on a single GPU. Runs in a dedicated thread."""
- with torch.cuda.device(device):
+ with torch.inference_mode(), torch.cuda.device(device):
results = [_process_weight(item) for item in items]
torch.cuda.synchronize(device)
return results
realAsma
left a comment
There was a problem hiding this comment.
torch.inference_mode() is thread-local: the parallel worker threads in _process_device_batch don't re-enter torch.inference_mode(), so the parallel path runs under different autograd settings than the sequential path. Simple one-line fix:
def _process_device_batch(items: list[_WeightQuantWork], device: torch.device):
"""Process all weight items on a single GPU. Runs in a dedicated thread."""
- with torch.cuda.device(device):
+ with torch.inference_mode(), torch.cuda.device(device):
results = [_process_weight(item) for item in items]
torch.cuda.synchronize(device)
return results
realAsma
left a comment
There was a problem hiding this comment.
torch.inference_mode() is thread-local: the parallel worker threads in _process_device_batch do not re-enter torch.inference_mode(), so the parallel path runs under different autograd settings than the sequential path. Please add torch.inference_mode() inside the worker thread.
|
Apologies for the duplicate comments — glitch in my Claude review workflow. Cleaned up now. |
Head branch was pushed to by a user without write access
Refactors `export_hf_vllm_fq_checkpoint()` Step 1 (weight fake-quantization) to process weights concurrently across GPUs using ThreadPoolExecutor. - Collects all quantizer work items, groups by device - One thread per GPU, each processes its weight partition - Falls back to sequential on single GPU or CPU (no overhead) - Adds `parallel: bool = True` parameter (default on, backwards compatible) - Adds GPU test: parallel vs sequential produces bitwise identical output - Adds single-GPU fallback test Measured speedup on Qwen3-8B with barboqt_exhaustive_2b (8x H100): - Quantize step: 293s → 49s (6.0x) - Total export: 312s → 66s (4.7x) Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
- Fold pre_quant_scale on GPU before .cpu() move (perf fix) - Use torch.allclose instead of torch.equal in test (nit) Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
torch.inference_mode() is thread-local — ThreadPoolExecutor workers don't inherit the parent's context. Without it, parallel workers run with autograd enabled (extra memory, different semantics than sequential path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
1d56d31 to
4e16400
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
79-89:⚠️ Potential issue | 🟠 MajorStill guard
_pre_quant_scalefolding on_enable_pre_quant_scale.
disable_pre_quant_scale_and_resmooth()leaves_pre_quant_scalepopulated after it has already been folded into the weight. If that input quantizer is later disabled, this branch will fold the same scale again and export an over-scaled weight unless it also checks_enable_pre_quant_scale.Suggested guard
if hasattr(module, inp_attr): candidate = getattr(module, inp_attr) if ( + getattr(candidate, "_enable_pre_quant_scale", True) hasattr(candidate, "_pre_quant_scale") and candidate._pre_quant_scale is not None and candidate._disabled ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 79 - 89, When deciding to fold an input quantizer's _pre_quant_scale into the weight (inside the branch that checks hasattr(module, inp_attr) and inspects candidate._pre_quant_scale and candidate._disabled), also require that candidate._enable_pre_quant_scale is True; update the condition that sets inp_q/inp_q_key to include "and getattr(candidate, '_enable_pre_quant_scale', False)" so that inputs for which disable_pre_quant_scale_and_resmooth() only cleared enabling (but left _pre_quant_scale populated) are not folded again; touch the same branch where get_unwrapped_name is called to add this guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 183-193: The current loop mixes submitting GPU work and processing
CPU shards inline which can cause CPU shards to run before any GPU futures are
launched; change the logic so you first iterate device_groups and submit all
CUDA tasks to the ThreadPoolExecutor (using _process_device_batch), collecting
CPU groups separately (or deferring them), then after all futures are submitted
await futures and finally process CPU shards inline with _process_weight and
extend all_results; update references: device_groups,
ThreadPoolExecutor/futures, _process_device_batch, _process_weight, and
all_results so CPU items are only processed after GPU futures are launched to
avoid sequential export and extra CPU tensor retention.
---
Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 79-89: When deciding to fold an input quantizer's _pre_quant_scale
into the weight (inside the branch that checks hasattr(module, inp_attr) and
inspects candidate._pre_quant_scale and candidate._disabled), also require that
candidate._enable_pre_quant_scale is True; update the condition that sets
inp_q/inp_q_key to include "and getattr(candidate, '_enable_pre_quant_scale',
False)" so that inputs for which disable_pre_quant_scale_and_resmooth() only
cleared enabling (but left _pre_quant_scale populated) are not folded again;
touch the same branch where get_unwrapped_name is called to add this guard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 074d1b56-02a8-4fc5-8558-559f14194702
📒 Files selected for processing (2)
modelopt/torch/export/plugins/vllm_fakequant_hf.pytests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py
✅ Files skipped from review due to trivial changes (1)
- tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.py
…patch 1. Gate _pre_quant_scale folding on _enable_pre_quant_scale to prevent double-fold after disable_pre_quant_scale_and_resmooth(). 2. Submit GPU futures before processing CPU shard inline. Previously CPU weights could block GPU submission when they appear first in device_groups (common with offloaded models). Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
|
/ok |
What does this PR do?
Type of change: New feature
Refactors
export_hf_vllm_fq_checkpoint()Step 1 (weight fake-quantization) to process weights concurrently across GPUs usingThreadPoolExecutor.Motivation: When exporting large models with compute-heavy quantization backends (e.g., barboqt exhaustive search), the export loop processes one weight tensor at a time on a single GPU while other GPUs sit idle. On clusters with idle GPU reapers (e.g., cw-dfw), this causes jobs to be killed after 30 minutes of idle GPU time. Parallel export keeps all GPUs active during the quantize phase.
Changes:
parallel: bool = Trueparameter (default on, backwards compatible)state_dict[key] = valuewrites are atomic (single-key dict assignment)Measured speedup on Qwen3-8B with
barboqt_exhaustive_2b(8× H100 DGX):Verified bitwise identical output (SHA256 checksums match) across FP8, NVFP4, NVFP4+rotation, psx_luts, psx_formats, luts/lo-bcq, and barboqt backends.
Usage
No changes needed in
hf_ptq.pyor downstream scripts —parallel=Trueis the default and falls back gracefully on single GPU.Testing
tests/gpu/torch/export/test_vllm_fakequant_hf_parallel_export.pytest_parallel_vs_sequential_identical: Requires ≥2 GPUs, verifies bitwise identical HF weights + quantizer state between parallel and sequential exporttest_single_gpu_fallback: Verifiesparallel=Trueon single GPU works without errortest_vllm_fakequant_hf_export.pypasses unchanged (usesparallel=Truedefault, falls back to sequential on single GPU)Before your PR is "Ready for review"
parallel=Truedefault; single GPU falls back to sequential; no API break)CONTRIBUTING.md: N/A (no copied code, no new deps)Additional Information
ThreadPoolExecutoris safe because: (1)state_dictis a detached copy, (2) eachTensorQuantizerhas per-instance state, (3)torch.inference_mode()context, (4) CUDA kernels release GIL, (5)same_device_as()context manager handles per-thread device switchingdevice_map="auto", weights are naturally distributed across GPUs → parallel export uses all GPUs with no configuration neededdevice_map="sequential"on small models may place everything on GPU 0 → parallel export gracefully falls back to sequentialSummary by CodeRabbit
New Features
Bug Fixes
Tests