Skip to content

[None][fix] Revert "Merge Eagle3 and MTP-eagle one-model workers (#12353)"#15217

Open
tensorrt-cicd wants to merge 2 commits into
NVIDIA:mainfrom
chenfeiz0326:revert-mtp-eagle-12353
Open

[None][fix] Revert "Merge Eagle3 and MTP-eagle one-model workers (#12353)"#15217
tensorrt-cicd wants to merge 2 commits into
NVIDIA:mainfrom
chenfeiz0326:revert-mtp-eagle-12353

Conversation

@tensorrt-cicd

@tensorrt-cicd tensorrt-cicd commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

This reverts commit 8e5d9e2 (#12353).

Summary

The MTP-eagle / Eagle3 one-model worker merge (#12353) regresses the GB200 2-node disagg perf-sanity gen-only test:

test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_deepseek-v32-fp4_1k1k_con2048_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL]

Failure signature

The GEN server (MTP speculative decoding, num_nextn_predict_layers=1) crashes during executor init while capturing the generation-phase CUDA graphs. All 4 gen ranks abort with exit code 134, so the benchmark never runs:

File "model_engine.py", line 1169, in _run_cuda_graph_warmup
    self._capture_generation_cuda_graphs(resource_manager)
File "model_engine.py", line 1240, in _capture_generation_cuda_graphs
    torch.cuda.synchronize()
torch.AcceleratorError: CUDA error: unspecified launch failure

Bisect

Comparing the last passing post-merge commit 33b0a32 with the first failing one 316430f (12 commits): #12353 is the only change touching the spec-decode / model_engine CUDA-graph capture path. The remaining commits in the range are AutoDeploy, test, and infra changes (none touch deepseek-v32 / MoE / attention / CUDA-graph / spec-decode kernels). The cutlass-dsl version is unchanged across the range.

Reverting restores gen-server startup and unblocks the perf-sanity test.

Summary by CodeRabbit

  • Refactor

    • Refactored EAGLE3 one-model speculative decoding to streamline linear-tree processing and remove legacy code paths.
    • Simplified speculative decoding worker constructors and method signatures for improved maintainability.
    • Consolidated speculative metadata handling and removed redundant parameters.
  • Chores

    • Cleaned up license header comments and reorganized import statements across speculative decoding modules.
    • Updated configuration field types in MTPDecodingConfig from constrained to standard numeric types.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors speculative decoding to separate Eagle3 linear-tree logic from MTP one-model paths by removing unified worker patterns, inlining spec metadata helpers, introducing a dedicated MTPEagleWorker class, and updating mode routing across the subsystem.

Changes

Speculative Decoding Eagle3/MTP Refactoring

Layer / File(s) Summary
Spec metadata helper inlining and draft model signature
tensorrt_llm/_torch/models/modeling_speculative.py, tensorrt_llm/_torch/pyexecutor/model_engine.py
Eagle3DraftModel.forward removes the all_rank_num_tokens override parameter and its try/finally restoration logic. The _set_spec_metadata_all_rank_num_tokens helper is removed and its token/sequence assignments are inlined at three call sites in _prepare_incremental_update_metadata, _prepare_tp_inputs, and _prepare_tp_inputs_no_cache using direct list comprehensions.
Eagle3 one-model worker refactoring and simplification
tensorrt_llm/_torch/speculative/eagle3.py
Module is simplified to focus on linear-tree decoding: imports trimmed, Eagle3OneModelSpecMetadata loses slot_ids and subseq_all_rank_num_tokens fields, Eagle3OneModelWorker constructor simplified (removes unified worker parameters), attention metadata is cached/restored via kv_lens_cuda preservation, linear draft loop is rewritten with dynamic attn_metadata.all_rank_num_tokens updates, sampling/acceptance helpers consolidated, prepare_1st_drafter_inputs always applies Eagle3 FC projection, and MTPEagleWorker alias removed.
MTPEagleWorker introduction
tensorrt_llm/_torch/speculative/mtp.py
New MTPEagleWorker class implements Eagle-mode draft prediction: adds compiled helpers for draft token updates, position ID derivation, and drafter input preparation; implements forward() with Eagle-specific gather/index logic across MTP layers, metadata updates, optional Mamba cache state handling, and SA draft-token overrides.
Speculative interface and dynamic tree signature updates
tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py, tensorrt_llm/_torch/speculative/interface.py
Method signatures updated: sample_and_accept_draft_tokens removes input_ids parameter in both dynamic-tree worker and base interface; SpecWorkerBase.skip_forward removes resource_manager parameter; support_dynamic_draft_len now returns only for EAGLE3_ONE_MODEL, dropping MTP_EAGLE_ONE_MODEL.
Mode routing and resource manager consolidation
tensorrt_llm/_torch/speculative/utils.py
Mode predicate checks switch from is_mtp_eagle_one_model() to is_mtp_one_model() for metadata, decoder, and layer-count routing. Resource manager logic refactored: removes Eagle3 fallback for old predicate, adds dedicated MTP-one-model branch returning MTPHiddenStatesManager, and conditionally applies SuffixAutomatonManager.
Configuration type relaxation and cleanup
tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/_torch/models/modeling_exaone_moe.py, tensorrt_llm/_torch/speculative/__init__.py
Remove Pydantic constrained types (PositiveInt, NonNegativeFloat, NonNegativeInt) from MTPDecodingConfig fields (relaxed_topk, relaxed_delta, thinking-phase tokens) and replace with unconstrained int and float. Remove SPDX/license headers from module files. Reorganize __init__.py imports: move Eagle3SpecMetadata to separate eagle3 import and MTPEagleWorker to mtp import.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#12353: Reverses unified Eagle3+MTP-eagle one-model plumbing by removing all_rank_num_tokens override support and adjusting spec metadata population helpers.

Suggested reviewers

  • mikeiovine
  • schetlur-nv
  • hlu1
  • nv-guomingz
  • ruodil
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: reverting a previous merge that caused a regression in GB200 disaggregated perf-sanity tests.
Description check ✅ Passed The description provides a clear summary of why the revert is necessary, including the regression signature, bisect analysis confirming the root cause, and evidence that reverting fixes the issue.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/speculative/utils.py (1)

46-55: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the checkpoint MTP layer count separate from the runtime draft cap.

update_spec_config_from_model_config() explicitly allows max_draft_len to be smaller than num_nextn_predict_layers, but these two helpers now swap those meanings: get_spec_metadata() reports mtp_num_modules=max_draft_len, while get_num_spec_layers() reports the checkpoint count. A config like num_nextn_predict_layers=4, max_draft_len=2 will make the one-model MTP metadata and layer-count helpers disagree about the same model, which is a good way to mis-size capture/allocation logic.

Suggested fix
     if spec_config.spec_dec_mode.is_mtp_one_model():
         return MTPSpecMetadata(
             max_draft_len=spec_config.max_draft_len,
             max_total_draft_tokens=spec_config.tokens_per_gen_step - 1,
             spec_dec_mode=spec_config.spec_dec_mode,
-            mtp_num_modules=spec_config.max_draft_len,
+            mtp_num_modules=spec_config.num_nextn_predict_layers,
             max_num_requests=max_num_requests,
             mtp_hidden_states_manager=spec_resource_manager,
             allow_advanced_sampling=spec_config.allow_advanced_sampling,
         )
@@
 def get_num_spec_layers(spec_config):
     if spec_config.spec_dec_mode.is_mtp_one_model():
-        return spec_config.num_nextn_predict_layers
+        return spec_config.max_draft_len

Also applies to: 317-318

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/speculative/utils.py` around lines 46 - 55, The
MTPSpecMetadata currently sets mtp_num_modules from spec_config.max_draft_len
which conflates runtime draft cap with the checkpoint layer count; change the
assignment in get_spec_metadata() so mtp_num_modules is derived from the model
checkpoint count (spec_config.num_nextn_predict_layers or equivalent) while
leaving max_draft_len intact for runtime caps, and make the same correction at
the other occurrence (lines 317-318) so get_num_spec_layers() and
MTPSpecMetadata agree on the checkpoint MTP layer count; update any references
to mtp_num_modules accordingly (functions/constructors: get_spec_metadata(),
MTPSpecMetadata, get_num_spec_layers(), update_spec_config_from_model_config()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Around line 406-414: The top-level assert unconditionally requires
self.embed_tokens even when inputs_embeds is provided; change it to only assert
when inputs_embeds is None (e.g., move or wrap the check so that if
inputs_embeds is None: assert self.embed_tokens is not None), then proceed to
materialize embeddings with self.embed_tokens(input_ids).to(self.dtype) as
before; make the identical change in MistralLarge3DraftModel.forward to avoid
rejecting valid calls that pass inputs_embeds or use lazy/shared embed_tokens.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 2249-2254: The removed helper that wrapped tp_cp_allgather did
more than unpack values: it also populated the per-rank bookkeeping fields
required for one-model MTP/MTPEagle generation (used during CUDA-graph warmup).
Restore that helper (or reintroduce its logic inline at each call site) so that
after calling tp_cp_allgather you set spec_metadata.all_rank_num_tokens,
spec_metadata.all_rank_num_seqs and the additional mode-specific per-rank fields
the helper previously populated for MTP/MTPEagle generation (the fields consumed
by the generation-side spec worker and CUDA-graph warmup); update the call sites
where you replaced the helper (the locations that currently only set
all_rank_num_tokens/all_rank_num_seqs) to use the full restored helper behavior.
- Around line 3644-3653: The code overwrites attn_metadata.all_rank_num_tokens
with unpadded values (from all_rank_num_tokens) which can conflict with the
padded counts produced by _get_padding_params; to fix, do not assign
attn_metadata.all_rank_num_tokens here—either set only
spec_metadata.all_rank_num_tokens and spec_metadata.all_rank_num_seqs (leave
attn_metadata alone), or if you must update attn_metadata use
attn_metadata.padded_num_tokens (or the padded result from _get_padding_params)
instead of attn_all_rank_num_tokens; update the block that currently assigns
attn_metadata.all_rank_num_tokens, spec_metadata.all_rank_num_tokens, and
spec_metadata.all_rank_num_seqs to follow one of these approaches so downstream
feed-forward/allreduce sees the padded attention counts.

In `@tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py`:
- Around line 183-186: The constructor of Eagle3OneModelDynamicTreeWorker
currently only asserts use_dynamic_tree but allows spec_config.sa_config, which
causes silent degradation for EAGLE3+SA; update
Eagle3OneModelDynamicTreeWorker.__init__ to explicitly reject SA by asserting or
raising when spec_config.sa_config (or spec_config.sa_enabled) is present, e.g.
check the sa_config field and fail with a clear message stating dynamic-tree
EAGLE3 does not support SA yet so an sa_manager won't be created and
maybe_override_all_draft_tokens won't be called; reference the class/constructor
(Eagle3OneModelDynamicTreeWorker.__init__) and the related symbols sa_config,
sa_manager, and maybe_override_all_draft_tokens so the guard is obvious and
prevents instantiation with SA until full support is implemented.

In `@tensorrt_llm/_torch/speculative/mtp.py`:
- Around line 1236-1256: The loop over self.mtp_num_modules incorrectly always
calls draft_model.mtp_layers[0] (and its shared_head) so later draft steps reuse
the first MTP predictor; change the call sites in the loop to use the loop index
(i) so each iteration invokes draft_model.mtp_layers[i] (and the corresponding
shared_head) instead of index 0; apply the same fix to the other occurrence
noted (the similar block around the 1287-1303 range) to ensure each draft step
uses the matching MTP module.
- Around line 1216-1217: The MTPEagle speculative/draft loop calls
_prepare_attn_metadata_for_spec_dec(attn_metadata) and later
_restore_attn_metadata_from_spec_dec(attn_metadata) but those helpers only
snapshot _seq_lens*; explicitly save attn_metadata.kv_lens_cuda before the draft
loop and restore it after the loop (i.e., wrap the MTPEagle draft/replay section
with a local saved_kv = attn_metadata.kv_lens_cuda and reassign
attn_metadata.kv_lens_cuda = saved_kv on exit) — mirror the explicit
save/restore added in Eagle3OneModelWorker to prevent KV-length leakage between
drafts and replays.

In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 1756-1762: MTPDecodingConfig currently exposes unconstrained
numeric fields (relaxed_topk, relaxed_delta, begin_thinking_phase_token,
end_thinking_phase_token) which can produce runtime errors in torch.topk and
trtllm.mtp_relaxed_acceptance_op; update the Pydantic declarations in
MTPDecodingConfig to enforce valid ranges (e.g., relaxed_topk: int >= 1,
relaxed_delta: float >= 0.0, begin_thinking_phase_token and
end_thinking_phase_token: int >= 0) by adding Field constraints (or use pydantic
types like conint/confloat) so invalid configs fail-fast during validation.
Ensure you change the Field(...) definitions for relaxed_topk, relaxed_delta,
begin_thinking_phase_token, and end_thinking_phase_token in MTPDecodingConfig
accordingly and run tests that exercise torch.topk(...) and
trtllm.mtp_relaxed_acceptance_op usage paths.

---

Outside diff comments:
In `@tensorrt_llm/_torch/speculative/utils.py`:
- Around line 46-55: The MTPSpecMetadata currently sets mtp_num_modules from
spec_config.max_draft_len which conflates runtime draft cap with the checkpoint
layer count; change the assignment in get_spec_metadata() so mtp_num_modules is
derived from the model checkpoint count (spec_config.num_nextn_predict_layers or
equivalent) while leaving max_draft_len intact for runtime caps, and make the
same correction at the other occurrence (lines 317-318) so get_num_spec_layers()
and MTPSpecMetadata agree on the checkpoint MTP layer count; update any
references to mtp_num_modules accordingly (functions/constructors:
get_spec_metadata(), MTPSpecMetadata, get_num_spec_layers(),
update_spec_config_from_model_config()).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 566938ed-de65-4364-aa05-ef76dd6dd8f3

📥 Commits

Reviewing files that changed from the base of the PR and between 62c6521 and b09dcef.

📒 Files selected for processing (10)
  • tensorrt_llm/_torch/models/modeling_exaone_moe.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/speculative/__init__.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/speculative/utils.py
  • tensorrt_llm/llmapi/llm_args.py
💤 Files with no reviewable changes (1)
  • tensorrt_llm/_torch/models/modeling_exaone_moe.py

Comment on lines +406 to +414
assert self.embed_tokens is not None

try:
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
"You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
)
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
"You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
)

if inputs_embeds is None:
assert self.embed_tokens is not None
inputs_embeds = self.embed_tokens(input_ids).to(self.dtype)

assert hidden_states is not None
# NOTE: If hidden states from the target model have to be concatenated,
# ideally, we expect that to happen outside the model definition. This
# helps us avoid data-dependent control flow and gives us better CUDA
# graph coverage.
if self._eh_proj_before_attn:
input_embeds = self.enorm(inputs_embeds)
hidden_states = torch.cat([input_embeds, hidden_states], dim=-1)
hidden_states = self.eh_proj(hidden_states)

residual = None
if self.num_layers > 1:
for layer in self.midlayer:
if residual is not None:
hidden_states = hidden_states + residual
hidden_states, residual = layer(
position_ids=position_ids,
embeds=inputs_embeds,
hidden_states=hidden_states,
attn_metadata=attn_metadata,
spec_metadata=spec_metadata,
)
else:
hidden_states, residual = self.midlayer(
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids).to(self.dtype)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only require embed_tokens when inputs_embeds is absent.

Both forward() methods still support the inputs_embeds path, but the new top-level assert now rejects that valid call pattern whenever embeddings are shared or lazily wired from the target model. In those cases, embed_tokens can legitimately be None here and should only be required if you actually need to materialize embeddings from input_ids.

Suggested fix
-        assert self.embed_tokens is not None
-
         if (input_ids is None) ^ (inputs_embeds is not None):
             raise ValueError(
                 "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
             )
 
         if inputs_embeds is None:
+            if self.embed_tokens is None:
+                raise ValueError("input_ids requires embed_tokens to be initialized")
             inputs_embeds = self.embed_tokens(input_ids).to(self.dtype)

Apply the same change in MistralLarge3DraftModel.forward().

Also applies to: 635-643

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/models/modeling_speculative.py` around lines 406 - 414,
The top-level assert unconditionally requires self.embed_tokens even when
inputs_embeds is provided; change it to only assert when inputs_embeds is None
(e.g., move or wrap the check so that if inputs_embeds is None: assert
self.embed_tokens is not None), then proceed to materialize embeddings with
self.embed_tokens(input_ids).to(self.dtype) as before; make the identical change
in MistralLarge3DraftModel.forward to avoid rejecting valid calls that pass
inputs_embeds or use lazy/shared embed_tokens.

Comment on lines +2249 to +2254
spec_metadata.all_rank_num_tokens = [
item[0] for item in all_rank_num_tokens
]
spec_metadata.all_rank_num_seqs = [
item[1] for item in all_rank_num_tokens
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Restore the mode-specific per-rank spec metadata setup.

The deleted helper did more than unpack tp_cp_allgather(...): it also populated the extra per-rank bookkeeping needed by the one-model MTP/MTPEagle path. These inline replacements only set all_rank_num_tokens and all_rank_num_seqs, so the generation-side spec worker now sees incomplete distributed metadata during CUDA-graph warmup. That lines up with the MTP GEN-capture crash called out in this PR. Please move the full logic back behind a shared helper or inline the missing mode-specific assignment at every call site.

Also applies to: 3477-3482, 3647-3653

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 2249 - 2254, The
removed helper that wrapped tp_cp_allgather did more than unpack values: it also
populated the per-rank bookkeeping fields required for one-model MTP/MTPEagle
generation (used during CUDA-graph warmup). Restore that helper (or reintroduce
its logic inline at each call site) so that after calling tp_cp_allgather you
set spec_metadata.all_rank_num_tokens, spec_metadata.all_rank_num_seqs and the
additional mode-specific per-rank fields the helper previously populated for
MTP/MTPEagle generation (the fields consumed by the generation-side spec worker
and CUDA-graph warmup); update the call sites where you replaced the helper (the
locations that currently only set all_rank_num_tokens/all_rank_num_seqs) to use
the full restored helper behavior.

Source: Pipeline failures

Comment on lines +3644 to +3653
attn_all_rank_num_tokens = [
item[0] for item in all_rank_num_tokens
]
self._set_spec_metadata_all_rank_num_tokens(
spec_metadata, [item[1] for item in all_rank_num_tokens],
[item[2] for item in all_rank_num_tokens])
spec_all_rank_num_tokens = [
item[1] for item in all_rank_num_tokens
]
all_rank_num_seqs = [item[2] for item in all_rank_num_tokens]
attn_metadata.all_rank_num_tokens = attn_all_rank_num_tokens
spec_metadata.all_rank_num_tokens = spec_all_rank_num_tokens
spec_metadata.all_rank_num_seqs = all_rank_num_seqs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't overwrite the padded attention-DP token counts here.

Lines 3583-3590 already run _get_padding_params(...), which may replace the raw per-rank attention token counts with the padded capture bucket. This new gather writes attn_metadata.all_rank_num_tokens back from unpadded attn_metadata.num_tokens, so attn_metadata.padded_num_tokens and attn_metadata.all_rank_num_tokens can diverge when piecewise padding is active. Downstream feed-forward/allreduce code reads attn_metadata.all_rank_num_tokens, so this breaks the attention/spec metadata contract. Only populate spec_metadata here, or preserve the padded attention counts computed earlier.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 3644 - 3653, The
code overwrites attn_metadata.all_rank_num_tokens with unpadded values (from
all_rank_num_tokens) which can conflict with the padded counts produced by
_get_padding_params; to fix, do not assign attn_metadata.all_rank_num_tokens
here—either set only spec_metadata.all_rank_num_tokens and
spec_metadata.all_rank_num_seqs (leave attn_metadata alone), or if you must
update attn_metadata use attn_metadata.padded_num_tokens (or the padded result
from _get_padding_params) instead of attn_all_rank_num_tokens; update the block
that currently assigns attn_metadata.all_rank_num_tokens,
spec_metadata.all_rank_num_tokens, and spec_metadata.all_rank_num_seqs to follow
one of these approaches so downstream feed-forward/allreduce sees the padded
attention counts.

Comment on lines +183 to 186
super().__init__(spec_config, mapping, use_separate_draft_kv_cache)
assert self.use_dynamic_tree, (
"Eagle3OneModelDynamicTreeWorker requires use_dynamic_tree=True"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep rejecting sa_config here until dynamic-tree EAGLE3 actually supports it.

Removing the guard now lets this worker instantiate with SA enabled, but the dynamic-tree path still never gets an sa_manager and never applies maybe_override_all_draft_tokens() after drafting. That silently degrades EAGLE3+SA dynamic-tree runs back to plain dynamic tree instead of failing fast.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py` around lines 183 -
186, The constructor of Eagle3OneModelDynamicTreeWorker currently only asserts
use_dynamic_tree but allows spec_config.sa_config, which causes silent
degradation for EAGLE3+SA; update Eagle3OneModelDynamicTreeWorker.__init__ to
explicitly reject SA by asserting or raising when spec_config.sa_config (or
spec_config.sa_enabled) is present, e.g. check the sa_config field and fail with
a clear message stating dynamic-tree EAGLE3 does not support SA yet so an
sa_manager won't be created and maybe_override_all_draft_tokens won't be called;
reference the class/constructor (Eagle3OneModelDynamicTreeWorker.__init__) and
the related symbols sa_config, sa_manager, and maybe_override_all_draft_tokens
so the guard is obvious and prevents instantiation with SA until full support is
implemented.

Comment on lines +1216 to +1217
# Save the old attn_metadata and spec_metadata
self._prepare_attn_metadata_for_spec_dec(attn_metadata)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore kv_lens_cuda around the MTPEagle draft loop.

This path decrements/increments attn_metadata.kv_lens_cuda, but it only uses the base _prepare/_restore_attn_metadata_from_spec_dec() helpers. Those snapshot _seq_lens*, not the KV lengths, so the target step can inherit draft-loop KV metadata on the next replay. Eagle3OneModelWorker already had to add an explicit save/restore path for the same field.

Also applies to: 1324-1328, 1344-1360

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/speculative/mtp.py` around lines 1216 - 1217, The
MTPEagle speculative/draft loop calls
_prepare_attn_metadata_for_spec_dec(attn_metadata) and later
_restore_attn_metadata_from_spec_dec(attn_metadata) but those helpers only
snapshot _seq_lens*; explicitly save attn_metadata.kv_lens_cuda before the draft
loop and restore it after the loop (i.e., wrap the MTPEagle draft/replay section
with a local saved_kv = attn_metadata.kv_lens_cuda and reassign
attn_metadata.kv_lens_cuda = saved_kv on exit) — mirror the explicit
save/restore added in Eagle3OneModelWorker to prevent KV-length leakage between
drafts and replays.

Comment on lines +1236 to +1256
for i in range(self.mtp_num_modules):
if i == 0:
hidden_states = draft_model.mtp_layers[0](
embed_tokens=draft_model.embed_tokens,
all_rank_num_tokens=spec_metadata.all_rank_num_tokens,
**inputs)

start_ids_gen = (
spec_metadata.batch_indices_cuda[:num_gens] *
(self.mtp_num_modules + 1)).long()
gather_ids_gen = (start_ids_gen +
num_accepted_tokens[num_contexts:] - 1 +
attn_metadata.num_ctx_tokens)
gather_ids = torch.concat(
[last_tokens_idx[:num_contexts], gather_ids_gen], dim=0)
else:
hidden_states = draft_model.mtp_layers[0](
embed_tokens=draft_model.embed_tokens,
all_rank_num_tokens=spec_metadata.
subseq_all_rank_num_tokens,
**inputs)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the matching MTP module on each draft step.

This loop always calls draft_model.mtp_layers[0] and its shared_head, regardless of i. For max_draft_len > 1, steps 1..N keep reusing the first predictor instead of the later trained MTP modules, so the drafter emits the wrong token sequence.

Suggested fix
-        with self.draft_kv_cache_context(attn_metadata, draft_kv_cache_manager):
-            for i in range(self.mtp_num_modules):
+        with self.draft_kv_cache_context(attn_metadata, draft_kv_cache_manager):
+            for i, mtp_layer in enumerate(draft_model.mtp_layers):
                 if i == 0:
-                    hidden_states = draft_model.mtp_layers[0](
+                    hidden_states = mtp_layer(
                         embed_tokens=draft_model.embed_tokens,
                         all_rank_num_tokens=spec_metadata.all_rank_num_tokens,
                         **inputs)
@@
                 else:
-                    hidden_states = draft_model.mtp_layers[0](
+                    hidden_states = mtp_layer(
                         embed_tokens=draft_model.embed_tokens,
                         all_rank_num_tokens=spec_metadata.
                         subseq_all_rank_num_tokens,
                         **inputs)
@@
-                    logits = draft_model.mtp_layers[0].shared_head(
+                    logits = mtp_layer.shared_head(
                         padded_hidden_states, draft_model.lm_head,
                         attn_metadata, True)
                 else:
-                    logits = draft_model.mtp_layers[0].shared_head(
+                    logits = mtp_layer.shared_head(
                         hidden_states[gather_ids], draft_model.lm_head,
                         attn_metadata, True)
@@
-                    mapping_lm_head_tp = draft_model.mtp_layers[
-                        0].shared_head.mapping_lm_head_tp
+                    mapping_lm_head_tp = mtp_layer.shared_head.mapping_lm_head_tp

Also applies to: 1287-1303

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/speculative/mtp.py` around lines 1236 - 1256, The loop
over self.mtp_num_modules incorrectly always calls draft_model.mtp_layers[0]
(and its shared_head) so later draft steps reuse the first MTP predictor; change
the call sites in the loop to use the loop index (i) so each iteration invokes
draft_model.mtp_layers[i] (and the corresponding shared_head) instead of index
0; apply the same fix to the other occurrence noted (the similar block around
the 1287-1303 range) to ensure each draft step uses the matching MTP module.

Comment on lines +1756 to +1762
relaxed_topk: int = Field(
default=1,
description=
"Number of top candidate tokens to consider for relaxed acceptance. Draft token is accepted if it matches any of these."
)
relaxed_delta: NonNegativeFloat = Field(
default=0.0,
relaxed_delta: float = Field(
default=0.,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify unconstrained field types in MTPDecodingConfig.
rg -n -A4 -B4 "class MTPDecodingConfig|relaxed_topk|relaxed_delta|begin_thinking_phase_token|end_thinking_phase_token" tensorrt_llm/llmapi/llm_args.py

# Verify these fields are consumed directly by MTP runtime ops.
rg -n -A4 -B4 "torch.topk|mtp_relaxed_acceptance_op|relaxed_topk|relaxed_delta|begin_thinking_phase_token|end_thinking_phase_token" tensorrt_llm/_torch/speculative/mtp.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 6343


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the full MTPDecodingConfig block for surrounding patterns (line window kept small)
python3 - <<'PY'
import itertools, pathlib
path = pathlib.Path("tensorrt_llm/llmapi/llm_args.py")
start, end = 1725, 1840
with path.open("r", encoding="utf-8") as f:
    for i, line in enumerate(f, start=1):
        if start <= i <= end:
            print(f"{i}:{line.rstrip()}")
        if i > end:
            break
PY

# Find nearby typing patterns for token-id-like fields
rg -n "token(_id|Id)?|phase_token|begin_.*token|end_.*token" tensorrt_llm/llmapi/llm_args.py

# Check how PositiveInt/NonNegativeInt are used in this file (to align with repo style)
rg -n "NonNegativeInt|PositiveInt|Field\\(.*(ge|gt)=" tensorrt_llm/llmapi/llm_args.py | head -n 80

Repository: NVIDIA/TensorRT-LLM

Length of output: 25048


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
import pathlib
path = pathlib.Path("tensorrt_llm/_torch/speculative/mtp.py")
# Print only the relevant windows around the earlier matches
for start in (650, 770, 795, 815, 820):
    end = start + 35
    print(f"\n---- {start+1}-{end} ----")
    with path.open("r", encoding="utf-8") as f:
        for i, line in enumerate(f, start=1):
            if start < i <= end:
                print(f"{i}:{line.rstrip()}")
            if i > end:
                break
PY

# Also directly locate begin/end token usage
rg -n "begin_thinking_phase_token|end_thinking_phase_token|relaxed_delta|relaxed_topk" tensorrt_llm/_torch/speculative/mtp.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 11439


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find other top-k / threshold-like config fields and their types
rg -n "topk|top_k|top-k|delta|threshold|thinking_phase_token|phase_token" tensorrt_llm/llmapi/llm_args.py

# Narrow to likely Pydantic config fields (same region around MTP and related speculative decoding)
rg -n "relaxed_topk|relaxed_delta|begin_thinking_phase_token|end_thinking_phase_token" -n tensorrt_llm/llmapi/llm_args.py

# Find other fields that use PositiveInt / NonNegativeFloat specifically
rg -n "PositiveInt|NonNegativeInt|NonNegativeFloat|PositiveFloat" tensorrt_llm/llmapi/llm_args.py | head -n 120

Repository: NVIDIA/TensorRT-LLM

Length of output: 6160


Reinstate numeric Pydantic constraints for MTP relaxed-acceptance config

MTPDecodingConfig defines relaxed_topk/begin_thinking_phase_token/end_thinking_phase_token as unconstrained int and relaxed_delta as unconstrained float in tensorrt_llm/llmapi/llm_args.py (lines ~1756-1802). These values are consumed directly at runtime in tensorrt_llm/_torch/speculative/mtp.py (torch.topk(..., k=self.spec_config.relaxed_topk) and passed into trtllm.mtp_relaxed_acceptance_op), so invalid configs (e.g. relaxed_topk <= 0, negative deltas/tokens) can bypass fail-fast validation and lead to runtime errors or incorrect acceptance behavior.

💡 Suggested fix
-    relaxed_topk: int = Field(
+    relaxed_topk: PositiveInt = Field(
         default=1,
         description=
         "Number of top candidate tokens to consider for relaxed acceptance. Draft token is accepted if it matches any of these."
     )
-    relaxed_delta: float = Field(
+    relaxed_delta: NonNegativeFloat = Field(
         default=0.,
         description=
         "Probability threshold for relaxed acceptance. Only candidates with prob >= (top-1 prob - delta) are kept."
     )
@@
-    begin_thinking_phase_token: int = Field(
+    begin_thinking_phase_token: NonNegativeInt = Field(
         default=128798,
         description=
         "Token ID marking start of thinking phase. Relaxed acceptance only applies within this phase."
     )
-    end_thinking_phase_token: int = Field(
+    end_thinking_phase_token: NonNegativeInt = Field(
         default=128799,
         description=
         "Token ID marking end of thinking phase. Strict acceptance resumes after this."
     )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/llmapi/llm_args.py` around lines 1756 - 1762, MTPDecodingConfig
currently exposes unconstrained numeric fields (relaxed_topk, relaxed_delta,
begin_thinking_phase_token, end_thinking_phase_token) which can produce runtime
errors in torch.topk and trtllm.mtp_relaxed_acceptance_op; update the Pydantic
declarations in MTPDecodingConfig to enforce valid ranges (e.g., relaxed_topk:
int >= 1, relaxed_delta: float >= 0.0, begin_thinking_phase_token and
end_thinking_phase_token: int >= 0) by adding Field constraints (or use pydantic
types like conint/confloat) so invalid configs fail-fast during validation.
Ensure you change the Field(...) definitions for relaxed_topk, relaxed_delta,
begin_thinking_phase_token, and end_thinking_phase_token in MTPDecodingConfig
accordingly and run tests that exercise torch.topk(...) and
trtllm.mtp_relaxed_acceptance_op usage paths.

Source: Coding guidelines

…DIA#12353)"

This reverts commit 8e5d9e2 (NVIDIA#12353).

The MTP-eagle / Eagle3 one-model worker merge regresses the GB200 2-node
disagg perf-sanity gen-only test:

  test_perf_sanity.py::test_e2e[disagg_upload-gen_only-
  gb200_deepseek-v32-fp4_1k1k_con2048_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL]

The GEN server (MTP speculative decoding, num_nextn_predict_layers=1) crashes
during executor init while capturing the generation-phase CUDA graphs:

  _run_cuda_graph_warmup -> _capture_generation_cuda_graphs ->
  torch.cuda.synchronize()
  torch.AcceleratorError: CUDA error: unspecified launch failure

All 4 gen ranks abort (exit 134). Bisect between the last passing post-merge
commit (33b0a32) and the first failing one (316430f) isolates NVIDIA#12353 as the
only change touching the spec-decode / model_engine CUDA-graph capture path;
the remaining commits in the range are AutoDeploy, test, and infra changes.

Reverting restores the gen server startup and unblocks the perf-sanity test.

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
…revert

Remove all 14 deepseek-v32-fp4 perf-sanity test waivers (nvbugs 6280721,
6280649, 6085022, 6200257). These were skipped due to the gen-server
CUDA-graph capture crash introduced by NVIDIA#12353, which is reverted in this
branch, so the cases can run again.

Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>
@chenfeiz0326 chenfeiz0326 force-pushed the revert-mtp-eagle-12353 branch from b1d54c7 to 628ff8a Compare June 10, 2026 10:40
@chenfeiz0326

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-4"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator Author

PR_Github #53306 [ run ] triggered by Bot. Commit: 628ff8a Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator Author

PR_Github #53306 [ run ] completed with state FAILURE. Commit: 628ff8a
/LLM/main/L0_MergeRequest_PR pipeline #42493 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants