Added support for MoE for vllm >= 0.14.0rc1#1162
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughResolve and cache the runtime vLLM fused-MoE kernel entrypoint name, rebind quantized expert weights as the kernel Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant QuantMoE as _QuantFusedMoEBase
participant Quantizer
participant vLLM as vllm_fused_moe_package
Caller->>QuantMoE: forward(...)
Note right of QuantMoE: ensure entrypoint resolved and cached\n(self.invoke_fused_moe_kernel_func)
QuantMoE->>QuantMoE: enter replace_function -> swap entrypoint to invoke_fused_moe_quantized
alt weight quantizer enabled
QuantMoE->>Quantizer: quantize(expert_weight)
Quantizer-->>QuantMoE: quantized_weight
QuantMoE->>vLLM: invoke_fused_moe_kernel(..., B=quantized_weight)
else quantizer disabled
QuantMoE->>vLLM: invoke_fused_moe_kernel(..., B=original_weight)
end
vLLM-->>QuantMoE: kernel result
QuantMoE->>QuantMoE: exit replace_function -> restore original entrypoint
QuantMoE-->>Caller: return result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
modelopt/torch/quantization/plugins/vllm.py (1)
387-388: Please add regression tests for both symbol paths and patch restore behavior.Given the compatibility branch and runtime patching, add tests that cover: (1)
invoke_fused_moe_kernel, (2)invoke_fused_moe_triton_kernel, and (3) restoration on exceptions duringsuper().forward(...).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/vllm.py` around lines 387 - 388, Add regression tests that exercise both symbol paths and verify the runtime patching in _patch_moe_kernel is applied and always restored: write tests that (1) trigger the branch where invoke_fused_moe_kernel is used, (2) trigger the branch where invoke_fused_moe_triton_kernel is used, and (3) simulate an exception raised during super().forward(...) to assert the original symbols are restored after the exception. Locate and call the class/method that uses _patch_moe_kernel and forward to run these cases, patch or monkeypatch the target symbols to observable fakes, and assert pre-/post-conditions on the original functions to confirm restore behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/plugins/vllm.py`:
- Around line 369-383: The current _patch_moe_kernel contextmanager mutates
vllm_fused_moe_package globals unsafely; make it concurrency-safe by serializing
patch/unpatch with a module-level threading.RLock and per-attribute reference
counting (or a reentrancy counter) so nested/concurrent entries don't clobber
originals: on entry, acquire the lock, for each attr
("invoke_fused_moe_kernel","invoke_fused_moe_triton_kernel") save the original
into a local map only if not already saved and replace the attr with
self.invoke_fused_moe_quantized while incrementing a refcount; yield; in
finally, decrement the refcount and only when it reaches zero restore the
original to vllm_fused_moe_package[attr] and remove the saved original, then
release the lock—use the symbols _patch_moe_kernel, vllm_fused_moe_package,
invoke_fused_moe_kernel, invoke_fused_moe_triton_kernel,
_invoke_fused_moe_kernel and invoke_fused_moe_quantized to locate and implement
this change.
- Around line 349-351: The kernel is currently called with the original weight
because B was bound before swapping, and restoration isn't in a finally block so
exceptions leave the quantized tensor in place; fix by assigning orig =
self.w13_weight, replacing self.w13_weight with
self.w13_weight_quantizer(self.w13_weight), then invoke
vllm_fused_moe_package._invoke_fused_moe_kernel using the swapped
self.w13_weight (not the previously bound B), and always restore self.w13_weight
= orig in a finally block; apply the identical pattern to the other symmetric
block (the one that swaps/restores the other weight at lines 359-361).
---
Nitpick comments:
In `@modelopt/torch/quantization/plugins/vllm.py`:
- Around line 387-388: Add regression tests that exercise both symbol paths and
verify the runtime patching in _patch_moe_kernel is applied and always restored:
write tests that (1) trigger the branch where invoke_fused_moe_kernel is used,
(2) trigger the branch where invoke_fused_moe_triton_kernel is used, and (3)
simulate an exception raised during super().forward(...) to assert the original
symbols are restored after the exception. Locate and call the class/method that
uses _patch_moe_kernel and forward to run these cases, patch or monkeypatch the
target symbols to observable fakes, and assert pre-/post-conditions on the
original functions to confirm restore behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 476612da-4375-40a3-b6c6-f1dfd76df7b5
📒 Files selected for processing (1)
modelopt/torch/quantization/plugins/vllm.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1162 +/- ##
==========================================
+ Coverage 71.86% 76.12% +4.25%
==========================================
Files 352 353 +1
Lines 40351 41169 +818
==========================================
+ Hits 28999 31338 +2339
+ Misses 11352 9831 -1521
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a99f27f to
435caab
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
modelopt/torch/quantization/plugins/vllm.py (1)
364-373:⚠️ Potential issue | 🔴 CriticalRestore swapped expert weights in
finallyblocks.On Line 371 and Line 387, an exception in the kernel call will skip restoration at Line 372 and Line 388, leaving
self.w13_weight/self.w2_weightpermanently swapped for the rest of execution.Proposed fix
if B is self.w13_weight: # First layer of expert A = self.w13_input_quantizer(A) # noqa: N806 if self.w13_weight_quantizer.is_enabled: original_weight, self.w13_weight = ( self.w13_weight, self.w13_weight_quantizer(self.w13_weight), ) # In case the weight quantizer isn't folded yet in vllm_serve_fakequant, pass the # quantized weight to the kernel. B = self.w13_weight # noqa: N806 - vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) - self.w13_weight = original_weight + try: + vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) + finally: + self.w13_weight = original_weight else: vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) @@ elif B is self.w2_weight: A = self.w2_input_quantizer(A) # noqa: N806 if self.w2_weight_quantizer.is_enabled: original_weight, self.w2_weight = ( self.w2_weight, self.w2_weight_quantizer(self.w2_weight), ) # In case the weight quantizer isn't folded yet in vllm_serve_fakequant, pass the # quantized weight to the kernel. B = self.w2_weight # noqa: N806 - vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) - self.w2_weight = original_weight + try: + vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) + finally: + self.w2_weight = original_weight else: vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs)Also applies to: 380-389
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/vllm.py` around lines 364 - 373, The kernel invocation swaps in quantized expert weights (e.g., setting original_weight, self.w13_weight = (self.w13_weight, self.w13_weight_quantizer(self.w13_weight)) and similarly for self.w2_weight) but restores them only after the kernel call, so exceptions can leave weights swapped; wrap the kernel call and the restore in a try/finally so that self.w13_weight and self.w2_weight are always set back to original_weight even if vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) raises—i.e., perform the swap, call kernel inside try, and put self.w13_weight = original_weight (and self.w2_weight = original_weight for the other block) in the finally block.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/quantization/plugins/vllm.py`:
- Around line 364-373: The kernel invocation swaps in quantized expert weights
(e.g., setting original_weight, self.w13_weight = (self.w13_weight,
self.w13_weight_quantizer(self.w13_weight)) and similarly for self.w2_weight)
but restores them only after the kernel call, so exceptions can leave weights
swapped; wrap the kernel call and the restore in a try/finally so that
self.w13_weight and self.w2_weight are always set back to original_weight even
if vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs)
raises—i.e., perform the swap, call kernel inside try, and put self.w13_weight =
original_weight (and self.w2_weight = original_weight for the other block) in
the finally block.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6d0f9a81-1879-41e8-89c7-1365685a7041
📒 Files selected for processing (1)
modelopt/torch/quantization/plugins/vllm.py
There was a problem hiding this comment.
♻️ Duplicate comments (2)
modelopt/torch/quantization/plugins/vllm.py (2)
358-367:⚠️ Potential issue | 🟠 MajorRestore the temporary expert weights in a
finally.A kernel/CUDA failure here leaves
self.w13_weight/self.w2_weightpermanently swapped to the quantized tensor, which can poison later forwards on the same module.🩹 Proposed fix
if self.w13_weight_quantizer.is_enabled: # pragma: no cover original_weight, self.w13_weight = ( self.w13_weight, self.w13_weight_quantizer(self.w13_weight), ) # In case the weight quantizer isn't folded yet in vllm_serve_fakequant, pass the # quantized weight to the kernel. B = self.w13_weight # noqa: N806 - vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) - self.w13_weight = original_weight + try: + vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) + finally: + self.w13_weight = original_weight @@ if self.w2_weight_quantizer.is_enabled: # pragma: no cover original_weight, self.w2_weight = ( self.w2_weight, self.w2_weight_quantizer(self.w2_weight), ) # In case the weight quantizer isn't folded yet in vllm_serve_fakequant, pass the # quantized weight to the kernel. B = self.w2_weight # noqa: N806 - vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) - self.w2_weight = original_weight + try: + vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) + finally: + self.w2_weight = original_weightAlso applies to: 374-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/vllm.py` around lines 358 - 367, The temporary swap of expert weights with their quantized version (e.g., when self.w13_weight_quantizer.is_enabled and similarly for self.w2_weight) must be restored in a finally block so kernel/CUDA exceptions don't leave self.w13_weight/self.w2_weight permanently replaced; update the blocks around self.w13_weight = self.w13_weight_quantizer(...)/B = self.w13_weight and the call to vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, ...) so that original_weight is assigned back inside a finally (use try: set quantized, call kernel; finally: self.w13_weight = original_weight) and apply the same change to the corresponding w2_weight section.
392-398:⚠️ Potential issue | 🔴 Critical
replace_function()is not safe enough for this module-global kernel swap.
replace_functioninmodelopt/torch/quantization/utils/core_utils.pyrestores only afteryield, with notry/finallyor synchronization. A failing or overlapping forward can therefore leavevllm_fused_moe_packagepatched to the wrong bound method and break later MoE calls. Please harden the helper or use a dedicated lockedtry/finallypatch on this path.Run this read-only check to confirm the helper is exception-unsafe and non-reentrant:
#!/bin/bash set -euo pipefail sed -n '326,336p' modelopt/torch/quantization/utils/core_utils.py python - <<'PY' from contextlib import contextmanager `@contextmanager` def replace(obj, name, new, cache="_cached"): old = getattr(obj, name) setattr(obj, name, new) setattr(obj, cache, old) yield setattr(obj, name, old) delattr(obj, cache) class Obj: pass o = Obj() o.f = "orig" try: with replace(o, "f", "A"): with replace(o, "f", "B"): pass except Exception as exc: print(type(exc).__name__, exc) events = [] `@contextmanager` def cm(): events.append("enter") yield events.append("cleanup") try: with cm(): raise RuntimeError("boom") except RuntimeError: pass print(events) PYExpected results:
- The helper source shows restore statements after
yield, but notry/finallyor locking.- The nested toy example raises during teardown, showing the pattern is not re-entrant.
- The last line prints
['enter'], showing code afteryieldis skipped on exceptions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/vllm.py` around lines 392 - 398, The current use of replace_function for swapping the module-global vllm_fused_moe_package binding in the forward override is exception-unsafe and non-reentrant; instead, wrap the attribute swap around a locked try/finally in the forward implementation: acquire a module-level threading.RLock, save the original bound method from vllm_fused_moe_package (the symbol cached under "_invoke_fused_moe_kernel"), set the attribute to self.invoke_fused_moe_quantized, call super().forward(hidden_states, router_logits), and in a finally block restore the original attribute and remove any cache attribute; alternatively harden replace_function itself to perform the same lock + try/finally restore behavior to protect invoke_fused_moe_kernel_func / invoke_fused_moe_quantized swaps from exceptions and nested calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/quantization/plugins/vllm.py`:
- Around line 358-367: The temporary swap of expert weights with their quantized
version (e.g., when self.w13_weight_quantizer.is_enabled and similarly for
self.w2_weight) must be restored in a finally block so kernel/CUDA exceptions
don't leave self.w13_weight/self.w2_weight permanently replaced; update the
blocks around self.w13_weight = self.w13_weight_quantizer(...)/B =
self.w13_weight and the call to
vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, ...) so that
original_weight is assigned back inside a finally (use try: set quantized, call
kernel; finally: self.w13_weight = original_weight) and apply the same change to
the corresponding w2_weight section.
- Around line 392-398: The current use of replace_function for swapping the
module-global vllm_fused_moe_package binding in the forward override is
exception-unsafe and non-reentrant; instead, wrap the attribute swap around a
locked try/finally in the forward implementation: acquire a module-level
threading.RLock, save the original bound method from vllm_fused_moe_package (the
symbol cached under "_invoke_fused_moe_kernel"), set the attribute to
self.invoke_fused_moe_quantized, call super().forward(hidden_states,
router_logits), and in a finally block restore the original attribute and remove
any cache attribute; alternatively harden replace_function itself to perform the
same lock + try/finally restore behavior to protect invoke_fused_moe_kernel_func
/ invoke_fused_moe_quantized swaps from exceptions and nested calls.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 12eed472-d865-4c37-bdc1-2eaa551309b2
📒 Files selected for processing (1)
modelopt/torch/quantization/plugins/vllm.py
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
96d081a to
80c400e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
modelopt/torch/quantization/plugins/vllm.py (1)
358-367:⚠️ Potential issue | 🟠 MajorRestore the temporary expert weights in
finally.These branches still leave
self.w13_weight/self.w2_weightswapped to the temporary quantized tensor if_invoke_fused_moe_kernel(...)throws, so later fake-quant calls run with corrupted layer state.♻️ Proposed fix
if self.w13_weight_quantizer.is_enabled: # pragma: no cover original_weight, self.w13_weight = ( self.w13_weight, self.w13_weight_quantizer(self.w13_weight), ) # In case the weight quantizer isn't folded yet in vllm_serve_fakequant, pass the # quantized weight to the kernel. B = self.w13_weight # noqa: N806 - vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) - self.w13_weight = original_weight + try: + vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) + finally: + self.w13_weight = original_weight @@ if self.w2_weight_quantizer.is_enabled: # pragma: no cover original_weight, self.w2_weight = ( self.w2_weight, self.w2_weight_quantizer(self.w2_weight), ) # In case the weight quantizer isn't folded yet in vllm_serve_fakequant, pass the # quantized weight to the kernel. B = self.w2_weight # noqa: N806 - vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) - self.w2_weight = original_weight + try: + vllm_fused_moe_package._invoke_fused_moe_kernel(A, B, C, *args, **kwargs) + finally: + self.w2_weight = original_weightAlso applies to: 374-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/vllm.py` around lines 358 - 367, The temporary assignment of quantized expert weights (e.g., swapping self.w13_weight and self.w2_weight before calling vllm_fused_moe_package._invoke_fused_moe_kernel) can leave the model in a corrupted state if the kernel throws; wrap the kernel invocation in a try/finally so the original weight is always restored (store original_weight = self.w13_weight / self.w2_weight, assign the quantized tensor, call vllm_fused_moe_package._invoke_fused_moe_kernel(...) inside try, and restore self.w13_weight / self.w2_weight in finally). Apply the same try/finally pattern to both the w13 branch (uses self.w13_weight_quantizer and B variable) and the w2 branch to ensure consistent restoration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/plugins/vllm.py`:
- Around line 392-398: The context manager replace_function currently yields
without guaranteeing cleanup on exceptions, which can leave
vllm_fused_moe_package permanently patched and the _invoke_fused_moe_kernel
cache attribute lingering; change replace_function to wrap the yield in a
try/finally so the patch/unpatch and any attribute restore/deletion (the same
cleanup logic already present after the yield) always runs even if code inside
the context (e.g., super().forward in the caller that uses
invoke_fused_moe_kernel_func/invoke_fused_moe_quantized) raises an exception.
---
Duplicate comments:
In `@modelopt/torch/quantization/plugins/vllm.py`:
- Around line 358-367: The temporary assignment of quantized expert weights
(e.g., swapping self.w13_weight and self.w2_weight before calling
vllm_fused_moe_package._invoke_fused_moe_kernel) can leave the model in a
corrupted state if the kernel throws; wrap the kernel invocation in a
try/finally so the original weight is always restored (store original_weight =
self.w13_weight / self.w2_weight, assign the quantized tensor, call
vllm_fused_moe_package._invoke_fused_moe_kernel(...) inside try, and restore
self.w13_weight / self.w2_weight in finally). Apply the same try/finally pattern
to both the w13 branch (uses self.w13_weight_quantizer and B variable) and the
w2 branch to ensure consistent restoration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 96b8d5b6-4e3c-4e72-a2b8-32129861fd90
📒 Files selected for processing (1)
modelopt/torch/quantization/plugins/vllm.py
31be877 to
c42e844
Compare
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
c42e844 to
037f9a9
Compare
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
What does this PR do?
Type of change: Bug fix
_QuantFusedMoEBase.forward()previously replacedvllm_fused_moe_package.invoke_fused_moe_kernel, which was renamed todispatch_fused_moe_kernelstarting in vLLM v0.14.0rc1. This caused anAttributeError/ assertion failure for any MoE model quantized with vLLM ≥ v0.14.0rc1.The fix refactors the kernel-patching logic into a
_patch_moe_kernel()context manager that probes for both attribute names (the two names are mutually exclusive across vLLM versions — confirmed by inspecting every release from v0.10.0 to v0.18.1).Usage
NA
Testing
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
Summary by CodeRabbit