[None][fix] Revert "Merge Eagle3 and MTP-eagle one-model workers (#12353)"#15217
[None][fix] Revert "Merge Eagle3 and MTP-eagle one-model workers (#12353)"#15217tensorrt-cicd wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughThis PR refactors speculative decoding to separate Eagle3 linear-tree logic from MTP one-model paths by removing unified worker patterns, inlining spec metadata helpers, introducing a dedicated ChangesSpeculative Decoding Eagle3/MTP Refactoring
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/speculative/utils.py (1)
46-55:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the checkpoint MTP layer count separate from the runtime draft cap.
update_spec_config_from_model_config()explicitly allowsmax_draft_lento be smaller thannum_nextn_predict_layers, but these two helpers now swap those meanings:get_spec_metadata()reportsmtp_num_modules=max_draft_len, whileget_num_spec_layers()reports the checkpoint count. A config likenum_nextn_predict_layers=4, max_draft_len=2will make the one-model MTP metadata and layer-count helpers disagree about the same model, which is a good way to mis-size capture/allocation logic.Suggested fix
if spec_config.spec_dec_mode.is_mtp_one_model(): return MTPSpecMetadata( max_draft_len=spec_config.max_draft_len, max_total_draft_tokens=spec_config.tokens_per_gen_step - 1, spec_dec_mode=spec_config.spec_dec_mode, - mtp_num_modules=spec_config.max_draft_len, + mtp_num_modules=spec_config.num_nextn_predict_layers, max_num_requests=max_num_requests, mtp_hidden_states_manager=spec_resource_manager, allow_advanced_sampling=spec_config.allow_advanced_sampling, ) @@ def get_num_spec_layers(spec_config): if spec_config.spec_dec_mode.is_mtp_one_model(): - return spec_config.num_nextn_predict_layers + return spec_config.max_draft_lenAlso applies to: 317-318
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/speculative/utils.py` around lines 46 - 55, The MTPSpecMetadata currently sets mtp_num_modules from spec_config.max_draft_len which conflates runtime draft cap with the checkpoint layer count; change the assignment in get_spec_metadata() so mtp_num_modules is derived from the model checkpoint count (spec_config.num_nextn_predict_layers or equivalent) while leaving max_draft_len intact for runtime caps, and make the same correction at the other occurrence (lines 317-318) so get_num_spec_layers() and MTPSpecMetadata agree on the checkpoint MTP layer count; update any references to mtp_num_modules accordingly (functions/constructors: get_spec_metadata(), MTPSpecMetadata, get_num_spec_layers(), update_spec_config_from_model_config()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Around line 406-414: The top-level assert unconditionally requires
self.embed_tokens even when inputs_embeds is provided; change it to only assert
when inputs_embeds is None (e.g., move or wrap the check so that if
inputs_embeds is None: assert self.embed_tokens is not None), then proceed to
materialize embeddings with self.embed_tokens(input_ids).to(self.dtype) as
before; make the identical change in MistralLarge3DraftModel.forward to avoid
rejecting valid calls that pass inputs_embeds or use lazy/shared embed_tokens.
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 2249-2254: The removed helper that wrapped tp_cp_allgather did
more than unpack values: it also populated the per-rank bookkeeping fields
required for one-model MTP/MTPEagle generation (used during CUDA-graph warmup).
Restore that helper (or reintroduce its logic inline at each call site) so that
after calling tp_cp_allgather you set spec_metadata.all_rank_num_tokens,
spec_metadata.all_rank_num_seqs and the additional mode-specific per-rank fields
the helper previously populated for MTP/MTPEagle generation (the fields consumed
by the generation-side spec worker and CUDA-graph warmup); update the call sites
where you replaced the helper (the locations that currently only set
all_rank_num_tokens/all_rank_num_seqs) to use the full restored helper behavior.
- Around line 3644-3653: The code overwrites attn_metadata.all_rank_num_tokens
with unpadded values (from all_rank_num_tokens) which can conflict with the
padded counts produced by _get_padding_params; to fix, do not assign
attn_metadata.all_rank_num_tokens here—either set only
spec_metadata.all_rank_num_tokens and spec_metadata.all_rank_num_seqs (leave
attn_metadata alone), or if you must update attn_metadata use
attn_metadata.padded_num_tokens (or the padded result from _get_padding_params)
instead of attn_all_rank_num_tokens; update the block that currently assigns
attn_metadata.all_rank_num_tokens, spec_metadata.all_rank_num_tokens, and
spec_metadata.all_rank_num_seqs to follow one of these approaches so downstream
feed-forward/allreduce sees the padded attention counts.
In `@tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py`:
- Around line 183-186: The constructor of Eagle3OneModelDynamicTreeWorker
currently only asserts use_dynamic_tree but allows spec_config.sa_config, which
causes silent degradation for EAGLE3+SA; update
Eagle3OneModelDynamicTreeWorker.__init__ to explicitly reject SA by asserting or
raising when spec_config.sa_config (or spec_config.sa_enabled) is present, e.g.
check the sa_config field and fail with a clear message stating dynamic-tree
EAGLE3 does not support SA yet so an sa_manager won't be created and
maybe_override_all_draft_tokens won't be called; reference the class/constructor
(Eagle3OneModelDynamicTreeWorker.__init__) and the related symbols sa_config,
sa_manager, and maybe_override_all_draft_tokens so the guard is obvious and
prevents instantiation with SA until full support is implemented.
In `@tensorrt_llm/_torch/speculative/mtp.py`:
- Around line 1236-1256: The loop over self.mtp_num_modules incorrectly always
calls draft_model.mtp_layers[0] (and its shared_head) so later draft steps reuse
the first MTP predictor; change the call sites in the loop to use the loop index
(i) so each iteration invokes draft_model.mtp_layers[i] (and the corresponding
shared_head) instead of index 0; apply the same fix to the other occurrence
noted (the similar block around the 1287-1303 range) to ensure each draft step
uses the matching MTP module.
- Around line 1216-1217: The MTPEagle speculative/draft loop calls
_prepare_attn_metadata_for_spec_dec(attn_metadata) and later
_restore_attn_metadata_from_spec_dec(attn_metadata) but those helpers only
snapshot _seq_lens*; explicitly save attn_metadata.kv_lens_cuda before the draft
loop and restore it after the loop (i.e., wrap the MTPEagle draft/replay section
with a local saved_kv = attn_metadata.kv_lens_cuda and reassign
attn_metadata.kv_lens_cuda = saved_kv on exit) — mirror the explicit
save/restore added in Eagle3OneModelWorker to prevent KV-length leakage between
drafts and replays.
In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 1756-1762: MTPDecodingConfig currently exposes unconstrained
numeric fields (relaxed_topk, relaxed_delta, begin_thinking_phase_token,
end_thinking_phase_token) which can produce runtime errors in torch.topk and
trtllm.mtp_relaxed_acceptance_op; update the Pydantic declarations in
MTPDecodingConfig to enforce valid ranges (e.g., relaxed_topk: int >= 1,
relaxed_delta: float >= 0.0, begin_thinking_phase_token and
end_thinking_phase_token: int >= 0) by adding Field constraints (or use pydantic
types like conint/confloat) so invalid configs fail-fast during validation.
Ensure you change the Field(...) definitions for relaxed_topk, relaxed_delta,
begin_thinking_phase_token, and end_thinking_phase_token in MTPDecodingConfig
accordingly and run tests that exercise torch.topk(...) and
trtllm.mtp_relaxed_acceptance_op usage paths.
---
Outside diff comments:
In `@tensorrt_llm/_torch/speculative/utils.py`:
- Around line 46-55: The MTPSpecMetadata currently sets mtp_num_modules from
spec_config.max_draft_len which conflates runtime draft cap with the checkpoint
layer count; change the assignment in get_spec_metadata() so mtp_num_modules is
derived from the model checkpoint count (spec_config.num_nextn_predict_layers or
equivalent) while leaving max_draft_len intact for runtime caps, and make the
same correction at the other occurrence (lines 317-318) so get_num_spec_layers()
and MTPSpecMetadata agree on the checkpoint MTP layer count; update any
references to mtp_num_modules accordingly (functions/constructors:
get_spec_metadata(), MTPSpecMetadata, get_num_spec_layers(),
update_spec_config_from_model_config()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 566938ed-de65-4364-aa05-ef76dd6dd8f3
📒 Files selected for processing (10)
tensorrt_llm/_torch/models/modeling_exaone_moe.pytensorrt_llm/_torch/models/modeling_speculative.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/speculative/__init__.pytensorrt_llm/_torch/speculative/eagle3.pytensorrt_llm/_torch/speculative/eagle3_dynamic_tree.pytensorrt_llm/_torch/speculative/interface.pytensorrt_llm/_torch/speculative/mtp.pytensorrt_llm/_torch/speculative/utils.pytensorrt_llm/llmapi/llm_args.py
💤 Files with no reviewable changes (1)
- tensorrt_llm/_torch/models/modeling_exaone_moe.py
| assert self.embed_tokens is not None | ||
|
|
||
| try: | ||
| if (input_ids is None) ^ (inputs_embeds is not None): | ||
| raise ValueError( | ||
| "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one" | ||
| ) | ||
| if (input_ids is None) ^ (inputs_embeds is not None): | ||
| raise ValueError( | ||
| "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one" | ||
| ) | ||
|
|
||
| if inputs_embeds is None: | ||
| assert self.embed_tokens is not None | ||
| inputs_embeds = self.embed_tokens(input_ids).to(self.dtype) | ||
|
|
||
| assert hidden_states is not None | ||
| # NOTE: If hidden states from the target model have to be concatenated, | ||
| # ideally, we expect that to happen outside the model definition. This | ||
| # helps us avoid data-dependent control flow and gives us better CUDA | ||
| # graph coverage. | ||
| if self._eh_proj_before_attn: | ||
| input_embeds = self.enorm(inputs_embeds) | ||
| hidden_states = torch.cat([input_embeds, hidden_states], dim=-1) | ||
| hidden_states = self.eh_proj(hidden_states) | ||
|
|
||
| residual = None | ||
| if self.num_layers > 1: | ||
| for layer in self.midlayer: | ||
| if residual is not None: | ||
| hidden_states = hidden_states + residual | ||
| hidden_states, residual = layer( | ||
| position_ids=position_ids, | ||
| embeds=inputs_embeds, | ||
| hidden_states=hidden_states, | ||
| attn_metadata=attn_metadata, | ||
| spec_metadata=spec_metadata, | ||
| ) | ||
| else: | ||
| hidden_states, residual = self.midlayer( | ||
| if inputs_embeds is None: | ||
| inputs_embeds = self.embed_tokens(input_ids).to(self.dtype) |
There was a problem hiding this comment.
Only require embed_tokens when inputs_embeds is absent.
Both forward() methods still support the inputs_embeds path, but the new top-level assert now rejects that valid call pattern whenever embeddings are shared or lazily wired from the target model. In those cases, embed_tokens can legitimately be None here and should only be required if you actually need to materialize embeddings from input_ids.
Suggested fix
- assert self.embed_tokens is not None
-
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
"You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
)
if inputs_embeds is None:
+ if self.embed_tokens is None:
+ raise ValueError("input_ids requires embed_tokens to be initialized")
inputs_embeds = self.embed_tokens(input_ids).to(self.dtype)Apply the same change in MistralLarge3DraftModel.forward().
Also applies to: 635-643
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/models/modeling_speculative.py` around lines 406 - 414,
The top-level assert unconditionally requires self.embed_tokens even when
inputs_embeds is provided; change it to only assert when inputs_embeds is None
(e.g., move or wrap the check so that if inputs_embeds is None: assert
self.embed_tokens is not None), then proceed to materialize embeddings with
self.embed_tokens(input_ids).to(self.dtype) as before; make the identical change
in MistralLarge3DraftModel.forward to avoid rejecting valid calls that pass
inputs_embeds or use lazy/shared embed_tokens.
| spec_metadata.all_rank_num_tokens = [ | ||
| item[0] for item in all_rank_num_tokens | ||
| ] | ||
| spec_metadata.all_rank_num_seqs = [ | ||
| item[1] for item in all_rank_num_tokens | ||
| ] |
There was a problem hiding this comment.
Restore the mode-specific per-rank spec metadata setup.
The deleted helper did more than unpack tp_cp_allgather(...): it also populated the extra per-rank bookkeeping needed by the one-model MTP/MTPEagle path. These inline replacements only set all_rank_num_tokens and all_rank_num_seqs, so the generation-side spec worker now sees incomplete distributed metadata during CUDA-graph warmup. That lines up with the MTP GEN-capture crash called out in this PR. Please move the full logic back behind a shared helper or inline the missing mode-specific assignment at every call site.
Also applies to: 3477-3482, 3647-3653
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 2249 - 2254, The
removed helper that wrapped tp_cp_allgather did more than unpack values: it also
populated the per-rank bookkeeping fields required for one-model MTP/MTPEagle
generation (used during CUDA-graph warmup). Restore that helper (or reintroduce
its logic inline at each call site) so that after calling tp_cp_allgather you
set spec_metadata.all_rank_num_tokens, spec_metadata.all_rank_num_seqs and the
additional mode-specific per-rank fields the helper previously populated for
MTP/MTPEagle generation (the fields consumed by the generation-side spec worker
and CUDA-graph warmup); update the call sites where you replaced the helper (the
locations that currently only set all_rank_num_tokens/all_rank_num_seqs) to use
the full restored helper behavior.
Source: Pipeline failures
| attn_all_rank_num_tokens = [ | ||
| item[0] for item in all_rank_num_tokens | ||
| ] | ||
| self._set_spec_metadata_all_rank_num_tokens( | ||
| spec_metadata, [item[1] for item in all_rank_num_tokens], | ||
| [item[2] for item in all_rank_num_tokens]) | ||
| spec_all_rank_num_tokens = [ | ||
| item[1] for item in all_rank_num_tokens | ||
| ] | ||
| all_rank_num_seqs = [item[2] for item in all_rank_num_tokens] | ||
| attn_metadata.all_rank_num_tokens = attn_all_rank_num_tokens | ||
| spec_metadata.all_rank_num_tokens = spec_all_rank_num_tokens | ||
| spec_metadata.all_rank_num_seqs = all_rank_num_seqs |
There was a problem hiding this comment.
Don't overwrite the padded attention-DP token counts here.
Lines 3583-3590 already run _get_padding_params(...), which may replace the raw per-rank attention token counts with the padded capture bucket. This new gather writes attn_metadata.all_rank_num_tokens back from unpadded attn_metadata.num_tokens, so attn_metadata.padded_num_tokens and attn_metadata.all_rank_num_tokens can diverge when piecewise padding is active. Downstream feed-forward/allreduce code reads attn_metadata.all_rank_num_tokens, so this breaks the attention/spec metadata contract. Only populate spec_metadata here, or preserve the padded attention counts computed earlier.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 3644 - 3653, The
code overwrites attn_metadata.all_rank_num_tokens with unpadded values (from
all_rank_num_tokens) which can conflict with the padded counts produced by
_get_padding_params; to fix, do not assign attn_metadata.all_rank_num_tokens
here—either set only spec_metadata.all_rank_num_tokens and
spec_metadata.all_rank_num_seqs (leave attn_metadata alone), or if you must
update attn_metadata use attn_metadata.padded_num_tokens (or the padded result
from _get_padding_params) instead of attn_all_rank_num_tokens; update the block
that currently assigns attn_metadata.all_rank_num_tokens,
spec_metadata.all_rank_num_tokens, and spec_metadata.all_rank_num_seqs to follow
one of these approaches so downstream feed-forward/allreduce sees the padded
attention counts.
| super().__init__(spec_config, mapping, use_separate_draft_kv_cache) | ||
| assert self.use_dynamic_tree, ( | ||
| "Eagle3OneModelDynamicTreeWorker requires use_dynamic_tree=True" | ||
| ) |
There was a problem hiding this comment.
Keep rejecting sa_config here until dynamic-tree EAGLE3 actually supports it.
Removing the guard now lets this worker instantiate with SA enabled, but the dynamic-tree path still never gets an sa_manager and never applies maybe_override_all_draft_tokens() after drafting. That silently degrades EAGLE3+SA dynamic-tree runs back to plain dynamic tree instead of failing fast.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py` around lines 183 -
186, The constructor of Eagle3OneModelDynamicTreeWorker currently only asserts
use_dynamic_tree but allows spec_config.sa_config, which causes silent
degradation for EAGLE3+SA; update Eagle3OneModelDynamicTreeWorker.__init__ to
explicitly reject SA by asserting or raising when spec_config.sa_config (or
spec_config.sa_enabled) is present, e.g. check the sa_config field and fail with
a clear message stating dynamic-tree EAGLE3 does not support SA yet so an
sa_manager won't be created and maybe_override_all_draft_tokens won't be called;
reference the class/constructor (Eagle3OneModelDynamicTreeWorker.__init__) and
the related symbols sa_config, sa_manager, and maybe_override_all_draft_tokens
so the guard is obvious and prevents instantiation with SA until full support is
implemented.
| # Save the old attn_metadata and spec_metadata | ||
| self._prepare_attn_metadata_for_spec_dec(attn_metadata) |
There was a problem hiding this comment.
Restore kv_lens_cuda around the MTPEagle draft loop.
This path decrements/increments attn_metadata.kv_lens_cuda, but it only uses the base _prepare/_restore_attn_metadata_from_spec_dec() helpers. Those snapshot _seq_lens*, not the KV lengths, so the target step can inherit draft-loop KV metadata on the next replay. Eagle3OneModelWorker already had to add an explicit save/restore path for the same field.
Also applies to: 1324-1328, 1344-1360
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/speculative/mtp.py` around lines 1216 - 1217, The
MTPEagle speculative/draft loop calls
_prepare_attn_metadata_for_spec_dec(attn_metadata) and later
_restore_attn_metadata_from_spec_dec(attn_metadata) but those helpers only
snapshot _seq_lens*; explicitly save attn_metadata.kv_lens_cuda before the draft
loop and restore it after the loop (i.e., wrap the MTPEagle draft/replay section
with a local saved_kv = attn_metadata.kv_lens_cuda and reassign
attn_metadata.kv_lens_cuda = saved_kv on exit) — mirror the explicit
save/restore added in Eagle3OneModelWorker to prevent KV-length leakage between
drafts and replays.
| for i in range(self.mtp_num_modules): | ||
| if i == 0: | ||
| hidden_states = draft_model.mtp_layers[0]( | ||
| embed_tokens=draft_model.embed_tokens, | ||
| all_rank_num_tokens=spec_metadata.all_rank_num_tokens, | ||
| **inputs) | ||
|
|
||
| start_ids_gen = ( | ||
| spec_metadata.batch_indices_cuda[:num_gens] * | ||
| (self.mtp_num_modules + 1)).long() | ||
| gather_ids_gen = (start_ids_gen + | ||
| num_accepted_tokens[num_contexts:] - 1 + | ||
| attn_metadata.num_ctx_tokens) | ||
| gather_ids = torch.concat( | ||
| [last_tokens_idx[:num_contexts], gather_ids_gen], dim=0) | ||
| else: | ||
| hidden_states = draft_model.mtp_layers[0]( | ||
| embed_tokens=draft_model.embed_tokens, | ||
| all_rank_num_tokens=spec_metadata. | ||
| subseq_all_rank_num_tokens, | ||
| **inputs) |
There was a problem hiding this comment.
Use the matching MTP module on each draft step.
This loop always calls draft_model.mtp_layers[0] and its shared_head, regardless of i. For max_draft_len > 1, steps 1..N keep reusing the first predictor instead of the later trained MTP modules, so the drafter emits the wrong token sequence.
Suggested fix
- with self.draft_kv_cache_context(attn_metadata, draft_kv_cache_manager):
- for i in range(self.mtp_num_modules):
+ with self.draft_kv_cache_context(attn_metadata, draft_kv_cache_manager):
+ for i, mtp_layer in enumerate(draft_model.mtp_layers):
if i == 0:
- hidden_states = draft_model.mtp_layers[0](
+ hidden_states = mtp_layer(
embed_tokens=draft_model.embed_tokens,
all_rank_num_tokens=spec_metadata.all_rank_num_tokens,
**inputs)
@@
else:
- hidden_states = draft_model.mtp_layers[0](
+ hidden_states = mtp_layer(
embed_tokens=draft_model.embed_tokens,
all_rank_num_tokens=spec_metadata.
subseq_all_rank_num_tokens,
**inputs)
@@
- logits = draft_model.mtp_layers[0].shared_head(
+ logits = mtp_layer.shared_head(
padded_hidden_states, draft_model.lm_head,
attn_metadata, True)
else:
- logits = draft_model.mtp_layers[0].shared_head(
+ logits = mtp_layer.shared_head(
hidden_states[gather_ids], draft_model.lm_head,
attn_metadata, True)
@@
- mapping_lm_head_tp = draft_model.mtp_layers[
- 0].shared_head.mapping_lm_head_tp
+ mapping_lm_head_tp = mtp_layer.shared_head.mapping_lm_head_tpAlso applies to: 1287-1303
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/_torch/speculative/mtp.py` around lines 1236 - 1256, The loop
over self.mtp_num_modules incorrectly always calls draft_model.mtp_layers[0]
(and its shared_head) so later draft steps reuse the first MTP predictor; change
the call sites in the loop to use the loop index (i) so each iteration invokes
draft_model.mtp_layers[i] (and the corresponding shared_head) instead of index
0; apply the same fix to the other occurrence noted (the similar block around
the 1287-1303 range) to ensure each draft step uses the matching MTP module.
| relaxed_topk: int = Field( | ||
| default=1, | ||
| description= | ||
| "Number of top candidate tokens to consider for relaxed acceptance. Draft token is accepted if it matches any of these." | ||
| ) | ||
| relaxed_delta: NonNegativeFloat = Field( | ||
| default=0.0, | ||
| relaxed_delta: float = Field( | ||
| default=0., |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify unconstrained field types in MTPDecodingConfig.
rg -n -A4 -B4 "class MTPDecodingConfig|relaxed_topk|relaxed_delta|begin_thinking_phase_token|end_thinking_phase_token" tensorrt_llm/llmapi/llm_args.py
# Verify these fields are consumed directly by MTP runtime ops.
rg -n -A4 -B4 "torch.topk|mtp_relaxed_acceptance_op|relaxed_topk|relaxed_delta|begin_thinking_phase_token|end_thinking_phase_token" tensorrt_llm/_torch/speculative/mtp.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 6343
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the full MTPDecodingConfig block for surrounding patterns (line window kept small)
python3 - <<'PY'
import itertools, pathlib
path = pathlib.Path("tensorrt_llm/llmapi/llm_args.py")
start, end = 1725, 1840
with path.open("r", encoding="utf-8") as f:
for i, line in enumerate(f, start=1):
if start <= i <= end:
print(f"{i}:{line.rstrip()}")
if i > end:
break
PY
# Find nearby typing patterns for token-id-like fields
rg -n "token(_id|Id)?|phase_token|begin_.*token|end_.*token" tensorrt_llm/llmapi/llm_args.py
# Check how PositiveInt/NonNegativeInt are used in this file (to align with repo style)
rg -n "NonNegativeInt|PositiveInt|Field\\(.*(ge|gt)=" tensorrt_llm/llmapi/llm_args.py | head -n 80Repository: NVIDIA/TensorRT-LLM
Length of output: 25048
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
import pathlib
path = pathlib.Path("tensorrt_llm/_torch/speculative/mtp.py")
# Print only the relevant windows around the earlier matches
for start in (650, 770, 795, 815, 820):
end = start + 35
print(f"\n---- {start+1}-{end} ----")
with path.open("r", encoding="utf-8") as f:
for i, line in enumerate(f, start=1):
if start < i <= end:
print(f"{i}:{line.rstrip()}")
if i > end:
break
PY
# Also directly locate begin/end token usage
rg -n "begin_thinking_phase_token|end_thinking_phase_token|relaxed_delta|relaxed_topk" tensorrt_llm/_torch/speculative/mtp.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 11439
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find other top-k / threshold-like config fields and their types
rg -n "topk|top_k|top-k|delta|threshold|thinking_phase_token|phase_token" tensorrt_llm/llmapi/llm_args.py
# Narrow to likely Pydantic config fields (same region around MTP and related speculative decoding)
rg -n "relaxed_topk|relaxed_delta|begin_thinking_phase_token|end_thinking_phase_token" -n tensorrt_llm/llmapi/llm_args.py
# Find other fields that use PositiveInt / NonNegativeFloat specifically
rg -n "PositiveInt|NonNegativeInt|NonNegativeFloat|PositiveFloat" tensorrt_llm/llmapi/llm_args.py | head -n 120Repository: NVIDIA/TensorRT-LLM
Length of output: 6160
Reinstate numeric Pydantic constraints for MTP relaxed-acceptance config
MTPDecodingConfig defines relaxed_topk/begin_thinking_phase_token/end_thinking_phase_token as unconstrained int and relaxed_delta as unconstrained float in tensorrt_llm/llmapi/llm_args.py (lines ~1756-1802). These values are consumed directly at runtime in tensorrt_llm/_torch/speculative/mtp.py (torch.topk(..., k=self.spec_config.relaxed_topk) and passed into trtllm.mtp_relaxed_acceptance_op), so invalid configs (e.g. relaxed_topk <= 0, negative deltas/tokens) can bypass fail-fast validation and lead to runtime errors or incorrect acceptance behavior.
💡 Suggested fix
- relaxed_topk: int = Field(
+ relaxed_topk: PositiveInt = Field(
default=1,
description=
"Number of top candidate tokens to consider for relaxed acceptance. Draft token is accepted if it matches any of these."
)
- relaxed_delta: float = Field(
+ relaxed_delta: NonNegativeFloat = Field(
default=0.,
description=
"Probability threshold for relaxed acceptance. Only candidates with prob >= (top-1 prob - delta) are kept."
)
@@
- begin_thinking_phase_token: int = Field(
+ begin_thinking_phase_token: NonNegativeInt = Field(
default=128798,
description=
"Token ID marking start of thinking phase. Relaxed acceptance only applies within this phase."
)
- end_thinking_phase_token: int = Field(
+ end_thinking_phase_token: NonNegativeInt = Field(
default=128799,
description=
"Token ID marking end of thinking phase. Strict acceptance resumes after this."
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tensorrt_llm/llmapi/llm_args.py` around lines 1756 - 1762, MTPDecodingConfig
currently exposes unconstrained numeric fields (relaxed_topk, relaxed_delta,
begin_thinking_phase_token, end_thinking_phase_token) which can produce runtime
errors in torch.topk and trtllm.mtp_relaxed_acceptance_op; update the Pydantic
declarations in MTPDecodingConfig to enforce valid ranges (e.g., relaxed_topk:
int >= 1, relaxed_delta: float >= 0.0, begin_thinking_phase_token and
end_thinking_phase_token: int >= 0) by adding Field constraints (or use pydantic
types like conint/confloat) so invalid configs fail-fast during validation.
Ensure you change the Field(...) definitions for relaxed_topk, relaxed_delta,
begin_thinking_phase_token, and end_thinking_phase_token in MTPDecodingConfig
accordingly and run tests that exercise torch.topk(...) and
trtllm.mtp_relaxed_acceptance_op usage paths.
Source: Coding guidelines
…DIA#12353)" This reverts commit 8e5d9e2 (NVIDIA#12353). The MTP-eagle / Eagle3 one-model worker merge regresses the GB200 2-node disagg perf-sanity gen-only test: test_perf_sanity.py::test_e2e[disagg_upload-gen_only- gb200_deepseek-v32-fp4_1k1k_con2048_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL] The GEN server (MTP speculative decoding, num_nextn_predict_layers=1) crashes during executor init while capturing the generation-phase CUDA graphs: _run_cuda_graph_warmup -> _capture_generation_cuda_graphs -> torch.cuda.synchronize() torch.AcceleratorError: CUDA error: unspecified launch failure All 4 gen ranks abort (exit 134). Bisect between the last passing post-merge commit (33b0a32) and the first failing one (316430f) isolates NVIDIA#12353 as the only change touching the spec-decode / model_engine CUDA-graph capture path; the remaining commits in the range are AutoDeploy, test, and infra changes. Reverting restores the gen server startup and unblocks the perf-sanity test. Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
…revert Remove all 14 deepseek-v32-fp4 perf-sanity test waivers (nvbugs 6280721, 6280649, 6085022, 6200257). These were skipped due to the gen-server CUDA-graph capture crash introduced by NVIDIA#12353, which is reverted in this branch, so the cases can run again. Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
b1d54c7 to
628ff8a
Compare
|
/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-4" |
|
PR_Github #53306 [ run ] triggered by Bot. Commit: |
|
PR_Github #53306 [ run ] completed with state
|
This reverts commit 8e5d9e2 (#12353).
Summary
The MTP-eagle / Eagle3 one-model worker merge (#12353) regresses the GB200 2-node disagg perf-sanity gen-only test:
Failure signature
The GEN server (MTP speculative decoding,
num_nextn_predict_layers=1) crashes during executor init while capturing the generation-phase CUDA graphs. All 4 gen ranks abort with exit code 134, so the benchmark never runs:Bisect
Comparing the last passing post-merge commit
33b0a32with the first failing one316430f(12 commits): #12353 is the only change touching the spec-decode /model_engineCUDA-graph capture path. The remaining commits in the range are AutoDeploy, test, and infra changes (none touch deepseek-v32 / MoE / attention / CUDA-graph / spec-decode kernels). The cutlass-dsl version is unchanged across the range.Reverting restores gen-server startup and unblocks the perf-sanity test.
Summary by CodeRabbit
Refactor
Chores
MTPDecodingConfigfrom constrained to standard numeric types.