[None][fix] Revert "Merge Eagle3 and MTP-eagle one-model workers (#12353)" by tensorrt-cicd · Pull Request #15217 · NVIDIA/TensorRT-LLM

tensorrt-cicd · 2026-06-10T10:13:41Z

This reverts commit 8e5d9e2 (#12353).

Summary

The MTP-eagle / Eagle3 one-model worker merge (#12353) regresses the GB200 2-node disagg perf-sanity gen-only test:

test_perf_sanity.py::test_e2e[disagg_upload-gen_only-gb200_deepseek-v32-fp4_1k1k_con2048_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL]

Failure signature

The GEN server (MTP speculative decoding, num_nextn_predict_layers=1) crashes during executor init while capturing the generation-phase CUDA graphs. All 4 gen ranks abort with exit code 134, so the benchmark never runs:

File "model_engine.py", line 1169, in _run_cuda_graph_warmup
    self._capture_generation_cuda_graphs(resource_manager)
File "model_engine.py", line 1240, in _capture_generation_cuda_graphs
    torch.cuda.synchronize()
torch.AcceleratorError: CUDA error: unspecified launch failure

Bisect

Comparing the last passing post-merge commit 33b0a32 with the first failing one 316430f (12 commits): #12353 is the only change touching the spec-decode / model_engine CUDA-graph capture path. The remaining commits in the range are AutoDeploy, test, and infra changes (none touch deepseek-v32 / MoE / attention / CUDA-graph / spec-decode kernels). The cutlass-dsl version is unchanged across the range.

Reverting restores gen-server startup and unblocks the perf-sanity test.

Summary by CodeRabbit

Refactor
- Refactored EAGLE3 one-model speculative decoding to streamline linear-tree processing and remove legacy code paths.
- Simplified speculative decoding worker constructors and method signatures for improved maintainability.
- Consolidated speculative metadata handling and removed redundant parameters.
Chores
- Cleaned up license header comments and reorganized import statements across speculative decoding modules.
- Updated configuration field types in MTPDecodingConfig from constrained to standard numeric types.

coderabbitai · 2026-06-10T10:24:01Z

📝 Walkthrough

Walkthrough

This PR refactors speculative decoding to separate Eagle3 linear-tree logic from MTP one-model paths by removing unified worker patterns, inlining spec metadata helpers, introducing a dedicated MTPEagleWorker class, and updating mode routing across the subsystem.

Changes

Speculative Decoding Eagle3/MTP Refactoring

Layer / File(s)	Summary
Spec metadata helper inlining and draft model signature `tensorrt_llm/_torch/models/modeling_speculative.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`	`Eagle3DraftModel.forward` removes the `all_rank_num_tokens` override parameter and its `try/finally` restoration logic. The `_set_spec_metadata_all_rank_num_tokens` helper is removed and its token/sequence assignments are inlined at three call sites in `_prepare_incremental_update_metadata`, `_prepare_tp_inputs`, and `_prepare_tp_inputs_no_cache` using direct list comprehensions.
Eagle3 one-model worker refactoring and simplification `tensorrt_llm/_torch/speculative/eagle3.py`	Module is simplified to focus on linear-tree decoding: imports trimmed, `Eagle3OneModelSpecMetadata` loses `slot_ids` and `subseq_all_rank_num_tokens` fields, `Eagle3OneModelWorker` constructor simplified (removes unified worker parameters), attention metadata is cached/restored via `kv_lens_cuda` preservation, linear draft loop is rewritten with dynamic `attn_metadata.all_rank_num_tokens` updates, sampling/acceptance helpers consolidated, `prepare_1st_drafter_inputs` always applies Eagle3 FC projection, and `MTPEagleWorker` alias removed.
MTPEagleWorker introduction `tensorrt_llm/_torch/speculative/mtp.py`	New `MTPEagleWorker` class implements Eagle-mode draft prediction: adds compiled helpers for draft token updates, position ID derivation, and drafter input preparation; implements `forward()` with Eagle-specific gather/index logic across MTP layers, metadata updates, optional Mamba cache state handling, and SA draft-token overrides.
Speculative interface and dynamic tree signature updates `tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py`, `tensorrt_llm/_torch/speculative/interface.py`	Method signatures updated: `sample_and_accept_draft_tokens` removes `input_ids` parameter in both dynamic-tree worker and base interface; `SpecWorkerBase.skip_forward` removes `resource_manager` parameter; `support_dynamic_draft_len` now returns only for `EAGLE3_ONE_MODEL`, dropping `MTP_EAGLE_ONE_MODEL`.
Mode routing and resource manager consolidation `tensorrt_llm/_torch/speculative/utils.py`	Mode predicate checks switch from `is_mtp_eagle_one_model()` to `is_mtp_one_model()` for metadata, decoder, and layer-count routing. Resource manager logic refactored: removes Eagle3 fallback for old predicate, adds dedicated MTP-one-model branch returning `MTPHiddenStatesManager`, and conditionally applies `SuffixAutomatonManager`.
Configuration type relaxation and cleanup `tensorrt_llm/llmapi/llm_args.py`, `tensorrt_llm/_torch/models/modeling_exaone_moe.py`, `tensorrt_llm/_torch/speculative/__init__.py`	Remove Pydantic constrained types (`PositiveInt`, `NonNegativeFloat`, `NonNegativeInt`) from `MTPDecodingConfig` fields (`relaxed_topk`, `relaxed_delta`, thinking-phase tokens) and replace with unconstrained `int` and `float`. Remove SPDX/license headers from module files. Reorganize `__init__.py` imports: move `Eagle3SpecMetadata` to separate `eagle3` import and `MTPEagleWorker` to `mtp` import.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#12353: Reverses unified Eagle3+MTP-eagle one-model plumbing by removing all_rank_num_tokens override support and adjusting spec metadata population helpers.

Suggested reviewers

mikeiovine
schetlur-nv
hlu1
nv-guomingz
ruodil

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 27.91% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: reverting a previous merge that caused a regression in GB200 disaggregated perf-sanity tests.
Description check	✅ Passed	The description provides a clear summary of why the revert is necessary, including the regression signature, bisect analysis confirming the root cause, and evidence that reverting fixes the issue.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/speculative/utils.py (1)

46-55: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the checkpoint MTP layer count separate from the runtime draft cap.

update_spec_config_from_model_config() explicitly allows max_draft_len to be smaller than num_nextn_predict_layers, but these two helpers now swap those meanings: get_spec_metadata() reports mtp_num_modules=max_draft_len, while get_num_spec_layers() reports the checkpoint count. A config like num_nextn_predict_layers=4, max_draft_len=2 will make the one-model MTP metadata and layer-count helpers disagree about the same model, which is a good way to mis-size capture/allocation logic.

Suggested fix

     if spec_config.spec_dec_mode.is_mtp_one_model():
         return MTPSpecMetadata(
             max_draft_len=spec_config.max_draft_len,
             max_total_draft_tokens=spec_config.tokens_per_gen_step - 1,
             spec_dec_mode=spec_config.spec_dec_mode,
-            mtp_num_modules=spec_config.max_draft_len,
+            mtp_num_modules=spec_config.num_nextn_predict_layers,
             max_num_requests=max_num_requests,
             mtp_hidden_states_manager=spec_resource_manager,
             allow_advanced_sampling=spec_config.allow_advanced_sampling,
         )
@@
 def get_num_spec_layers(spec_config):
     if spec_config.spec_dec_mode.is_mtp_one_model():
-        return spec_config.num_nextn_predict_layers
+        return spec_config.max_draft_len

Also applies to: 317-318

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/speculative/utils.py` around lines 46 - 55, The
MTPSpecMetadata currently sets mtp_num_modules from spec_config.max_draft_len
which conflates runtime draft cap with the checkpoint layer count; change the
assignment in get_spec_metadata() so mtp_num_modules is derived from the model
checkpoint count (spec_config.num_nextn_predict_layers or equivalent) while
leaving max_draft_len intact for runtime caps, and make the same correction at
the other occurrence (lines 317-318) so get_num_spec_layers() and
MTPSpecMetadata agree on the checkpoint MTP layer count; update any references
to mtp_num_modules accordingly (functions/constructors: get_spec_metadata(),
MTPSpecMetadata, get_num_spec_layers(), update_spec_config_from_model_config()).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_speculative.py`:
- Around line 406-414: The top-level assert unconditionally requires
self.embed_tokens even when inputs_embeds is provided; change it to only assert
when inputs_embeds is None (e.g., move or wrap the check so that if
inputs_embeds is None: assert self.embed_tokens is not None), then proceed to
materialize embeddings with self.embed_tokens(input_ids).to(self.dtype) as
before; make the identical change in MistralLarge3DraftModel.forward to avoid
rejecting valid calls that pass inputs_embeds or use lazy/shared embed_tokens.

In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 2249-2254: The removed helper that wrapped tp_cp_allgather did
more than unpack values: it also populated the per-rank bookkeeping fields
required for one-model MTP/MTPEagle generation (used during CUDA-graph warmup).
Restore that helper (or reintroduce its logic inline at each call site) so that
after calling tp_cp_allgather you set spec_metadata.all_rank_num_tokens,
spec_metadata.all_rank_num_seqs and the additional mode-specific per-rank fields
the helper previously populated for MTP/MTPEagle generation (the fields consumed
by the generation-side spec worker and CUDA-graph warmup); update the call sites
where you replaced the helper (the locations that currently only set
all_rank_num_tokens/all_rank_num_seqs) to use the full restored helper behavior.
- Around line 3644-3653: The code overwrites attn_metadata.all_rank_num_tokens
with unpadded values (from all_rank_num_tokens) which can conflict with the
padded counts produced by _get_padding_params; to fix, do not assign
attn_metadata.all_rank_num_tokens here—either set only
spec_metadata.all_rank_num_tokens and spec_metadata.all_rank_num_seqs (leave
attn_metadata alone), or if you must update attn_metadata use
attn_metadata.padded_num_tokens (or the padded result from _get_padding_params)
instead of attn_all_rank_num_tokens; update the block that currently assigns
attn_metadata.all_rank_num_tokens, spec_metadata.all_rank_num_tokens, and
spec_metadata.all_rank_num_seqs to follow one of these approaches so downstream
feed-forward/allreduce sees the padded attention counts.

In `@tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py`:
- Around line 183-186: The constructor of Eagle3OneModelDynamicTreeWorker
currently only asserts use_dynamic_tree but allows spec_config.sa_config, which
causes silent degradation for EAGLE3+SA; update
Eagle3OneModelDynamicTreeWorker.__init__ to explicitly reject SA by asserting or
raising when spec_config.sa_config (or spec_config.sa_enabled) is present, e.g.
check the sa_config field and fail with a clear message stating dynamic-tree
EAGLE3 does not support SA yet so an sa_manager won't be created and
maybe_override_all_draft_tokens won't be called; reference the class/constructor
(Eagle3OneModelDynamicTreeWorker.__init__) and the related symbols sa_config,
sa_manager, and maybe_override_all_draft_tokens so the guard is obvious and
prevents instantiation with SA until full support is implemented.

In `@tensorrt_llm/_torch/speculative/mtp.py`:
- Around line 1236-1256: The loop over self.mtp_num_modules incorrectly always
calls draft_model.mtp_layers[0] (and its shared_head) so later draft steps reuse
the first MTP predictor; change the call sites in the loop to use the loop index
(i) so each iteration invokes draft_model.mtp_layers[i] (and the corresponding
shared_head) instead of index 0; apply the same fix to the other occurrence
noted (the similar block around the 1287-1303 range) to ensure each draft step
uses the matching MTP module.
- Around line 1216-1217: The MTPEagle speculative/draft loop calls
_prepare_attn_metadata_for_spec_dec(attn_metadata) and later
_restore_attn_metadata_from_spec_dec(attn_metadata) but those helpers only
snapshot _seq_lens*; explicitly save attn_metadata.kv_lens_cuda before the draft
loop and restore it after the loop (i.e., wrap the MTPEagle draft/replay section
with a local saved_kv = attn_metadata.kv_lens_cuda and reassign
attn_metadata.kv_lens_cuda = saved_kv on exit) — mirror the explicit
save/restore added in Eagle3OneModelWorker to prevent KV-length leakage between
drafts and replays.

In `@tensorrt_llm/llmapi/llm_args.py`:
- Around line 1756-1762: MTPDecodingConfig currently exposes unconstrained
numeric fields (relaxed_topk, relaxed_delta, begin_thinking_phase_token,
end_thinking_phase_token) which can produce runtime errors in torch.topk and
trtllm.mtp_relaxed_acceptance_op; update the Pydantic declarations in
MTPDecodingConfig to enforce valid ranges (e.g., relaxed_topk: int >= 1,
relaxed_delta: float >= 0.0, begin_thinking_phase_token and
end_thinking_phase_token: int >= 0) by adding Field constraints (or use pydantic
types like conint/confloat) so invalid configs fail-fast during validation.
Ensure you change the Field(...) definitions for relaxed_topk, relaxed_delta,
begin_thinking_phase_token, and end_thinking_phase_token in MTPDecodingConfig
accordingly and run tests that exercise torch.topk(...) and
trtllm.mtp_relaxed_acceptance_op usage paths.

---

Outside diff comments:
In `@tensorrt_llm/_torch/speculative/utils.py`:
- Around line 46-55: The MTPSpecMetadata currently sets mtp_num_modules from
spec_config.max_draft_len which conflates runtime draft cap with the checkpoint
layer count; change the assignment in get_spec_metadata() so mtp_num_modules is
derived from the model checkpoint count (spec_config.num_nextn_predict_layers or
equivalent) while leaving max_draft_len intact for runtime caps, and make the
same correction at the other occurrence (lines 317-318) so get_num_spec_layers()
and MTPSpecMetadata agree on the checkpoint MTP layer count; update any
references to mtp_num_modules accordingly (functions/constructors:
get_spec_metadata(), MTPSpecMetadata, get_num_spec_layers(),
update_spec_config_from_model_config()).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 566938ed-de65-4364-aa05-ef76dd6dd8f3

📥 Commits

Reviewing files that changed from the base of the PR and between 62c6521 and b09dcef.

📒 Files selected for processing (10)

tensorrt_llm/_torch/models/modeling_exaone_moe.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/speculative/__init__.py
tensorrt_llm/_torch/speculative/eagle3.py
tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py
tensorrt_llm/_torch/speculative/interface.py
tensorrt_llm/_torch/speculative/mtp.py
tensorrt_llm/_torch/speculative/utils.py
tensorrt_llm/llmapi/llm_args.py

💤 Files with no reviewable changes (1)

tensorrt_llm/_torch/models/modeling_exaone_moe.py

coderabbitai · 2026-06-10T10:24:04Z

+        assert self.embed_tokens is not None

-        try:
-            if (input_ids is None) ^ (inputs_embeds is not None):
-                raise ValueError(
-                    "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
-                )
+        if (input_ids is None) ^ (inputs_embeds is not None):
+            raise ValueError(
+                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
+            )

-            if inputs_embeds is None:
-                assert self.embed_tokens is not None
-                inputs_embeds = self.embed_tokens(input_ids).to(self.dtype)
-
-            assert hidden_states is not None
-            # NOTE: If hidden states from the target model have to be concatenated,
-            # ideally, we expect that to happen outside the model definition. This
-            # helps us avoid data-dependent control flow and gives us better CUDA
-            # graph coverage.
-            if self._eh_proj_before_attn:
-                input_embeds = self.enorm(inputs_embeds)
-                hidden_states = torch.cat([input_embeds, hidden_states], dim=-1)
-                hidden_states = self.eh_proj(hidden_states)
-
-            residual = None
-            if self.num_layers > 1:
-                for layer in self.midlayer:
-                    if residual is not None:
-                        hidden_states = hidden_states + residual
-                    hidden_states, residual = layer(
-                        position_ids=position_ids,
-                        embeds=inputs_embeds,
-                        hidden_states=hidden_states,
-                        attn_metadata=attn_metadata,
-                        spec_metadata=spec_metadata,
-                    )
-            else:
-                hidden_states, residual = self.midlayer(
+        if inputs_embeds is None:
+            inputs_embeds = self.embed_tokens(input_ids).to(self.dtype)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Only require embed_tokens when inputs_embeds is absent.

Both forward() methods still support the inputs_embeds path, but the new top-level assert now rejects that valid call pattern whenever embeddings are shared or lazily wired from the target model. In those cases, embed_tokens can legitimately be None here and should only be required if you actually need to materialize embeddings from input_ids.

Suggested fix

- assert self.embed_tokens is not None - if (input_ids is None) ^ (inputs_embeds is not None): raise ValueError( "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one" ) if inputs_embeds is None: + if self.embed_tokens is None: + raise ValueError("input_ids requires embed_tokens to be initialized") inputs_embeds = self.embed_tokens(input_ids).to(self.dtype)

Apply the same change in MistralLarge3DraftModel.forward().

Also applies to: 635-643

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/models/modeling_speculative.py` around lines 406 - 414, The top-level assert unconditionally requires self.embed_tokens even when inputs_embeds is provided; change it to only assert when inputs_embeds is None (e.g., move or wrap the check so that if inputs_embeds is None: assert self.embed_tokens is not None), then proceed to materialize embeddings with self.embed_tokens(input_ids).to(self.dtype) as before; make the identical change in MistralLarge3DraftModel.forward to avoid rejecting valid calls that pass inputs_embeds or use lazy/shared embed_tokens.

coderabbitai · 2026-06-10T10:24:04Z

+                spec_metadata.all_rank_num_tokens = [
+                    item[0] for item in all_rank_num_tokens
+                ]
+                spec_metadata.all_rank_num_seqs = [
+                    item[1] for item in all_rank_num_tokens
+                ]


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Restore the mode-specific per-rank spec metadata setup.

The deleted helper did more than unpack tp_cp_allgather(...): it also populated the extra per-rank bookkeeping needed by the one-model MTP/MTPEagle path. These inline replacements only set all_rank_num_tokens and all_rank_num_seqs, so the generation-side spec worker now sees incomplete distributed metadata during CUDA-graph warmup. That lines up with the MTP GEN-capture crash called out in this PR. Please move the full logic back behind a shared helper or inline the missing mode-specific assignment at every call site.

Also applies to: 3477-3482, 3647-3653

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 2249 - 2254, The removed helper that wrapped tp_cp_allgather did more than unpack values: it also populated the per-rank bookkeeping fields required for one-model MTP/MTPEagle generation (used during CUDA-graph warmup). Restore that helper (or reintroduce its logic inline at each call site) so that after calling tp_cp_allgather you set spec_metadata.all_rank_num_tokens, spec_metadata.all_rank_num_seqs and the additional mode-specific per-rank fields the helper previously populated for MTP/MTPEagle generation (the fields consumed by the generation-side spec worker and CUDA-graph warmup); update the call sites where you replaced the helper (the locations that currently only set all_rank_num_tokens/all_rank_num_seqs) to use the full restored helper behavior.

Source: Pipeline failures

coderabbitai · 2026-06-10T10:24:04Z

+                attn_all_rank_num_tokens = [
                    item[0] for item in all_rank_num_tokens
                ]
-                self._set_spec_metadata_all_rank_num_tokens(
-                    spec_metadata, [item[1] for item in all_rank_num_tokens],
-                    [item[2] for item in all_rank_num_tokens])
+                spec_all_rank_num_tokens = [
+                    item[1] for item in all_rank_num_tokens
+                ]
+                all_rank_num_seqs = [item[2] for item in all_rank_num_tokens]
+                attn_metadata.all_rank_num_tokens = attn_all_rank_num_tokens
+                spec_metadata.all_rank_num_tokens = spec_all_rank_num_tokens
+                spec_metadata.all_rank_num_seqs = all_rank_num_seqs


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't overwrite the padded attention-DP token counts here.

Lines 3583-3590 already run _get_padding_params(...), which may replace the raw per-rank attention token counts with the padded capture bucket. This new gather writes attn_metadata.all_rank_num_tokens back from unpadded attn_metadata.num_tokens, so attn_metadata.padded_num_tokens and attn_metadata.all_rank_num_tokens can diverge when piecewise padding is active. Downstream feed-forward/allreduce code reads attn_metadata.all_rank_num_tokens, so this breaks the attention/spec metadata contract. Only populate spec_metadata here, or preserve the padded attention counts computed earlier.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/model_engine.py` around lines 3644 - 3653, The code overwrites attn_metadata.all_rank_num_tokens with unpadded values (from all_rank_num_tokens) which can conflict with the padded counts produced by _get_padding_params; to fix, do not assign attn_metadata.all_rank_num_tokens here—either set only spec_metadata.all_rank_num_tokens and spec_metadata.all_rank_num_seqs (leave attn_metadata alone), or if you must update attn_metadata use attn_metadata.padded_num_tokens (or the padded result from _get_padding_params) instead of attn_all_rank_num_tokens; update the block that currently assigns attn_metadata.all_rank_num_tokens, spec_metadata.all_rank_num_tokens, and spec_metadata.all_rank_num_seqs to follow one of these approaches so downstream feed-forward/allreduce sees the padded attention counts.

coderabbitai · 2026-06-10T10:24:04Z

+        super().__init__(spec_config, mapping, use_separate_draft_kv_cache)
        assert self.use_dynamic_tree, (
            "Eagle3OneModelDynamicTreeWorker requires use_dynamic_tree=True"
        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep rejecting sa_config here until dynamic-tree EAGLE3 actually supports it.

Removing the guard now lets this worker instantiate with SA enabled, but the dynamic-tree path still never gets an sa_manager and never applies maybe_override_all_draft_tokens() after drafting. That silently degrades EAGLE3+SA dynamic-tree runs back to plain dynamic tree instead of failing fast.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/speculative/eagle3_dynamic_tree.py` around lines 183 - 186, The constructor of Eagle3OneModelDynamicTreeWorker currently only asserts use_dynamic_tree but allows spec_config.sa_config, which causes silent degradation for EAGLE3+SA; update Eagle3OneModelDynamicTreeWorker.__init__ to explicitly reject SA by asserting or raising when spec_config.sa_config (or spec_config.sa_enabled) is present, e.g. check the sa_config field and fail with a clear message stating dynamic-tree EAGLE3 does not support SA yet so an sa_manager won't be created and maybe_override_all_draft_tokens won't be called; reference the class/constructor (Eagle3OneModelDynamicTreeWorker.__init__) and the related symbols sa_config, sa_manager, and maybe_override_all_draft_tokens so the guard is obvious and prevents instantiation with SA until full support is implemented.

coderabbitai · 2026-06-10T10:24:04Z

+        # Save the old attn_metadata and spec_metadata
+        self._prepare_attn_metadata_for_spec_dec(attn_metadata)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Restore kv_lens_cuda around the MTPEagle draft loop.

This path decrements/increments attn_metadata.kv_lens_cuda, but it only uses the base _prepare/_restore_attn_metadata_from_spec_dec() helpers. Those snapshot _seq_lens*, not the KV lengths, so the target step can inherit draft-loop KV metadata on the next replay. Eagle3OneModelWorker already had to add an explicit save/restore path for the same field.

Also applies to: 1324-1328, 1344-1360

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/speculative/mtp.py` around lines 1216 - 1217, The MTPEagle speculative/draft loop calls _prepare_attn_metadata_for_spec_dec(attn_metadata) and later _restore_attn_metadata_from_spec_dec(attn_metadata) but those helpers only snapshot _seq_lens*; explicitly save attn_metadata.kv_lens_cuda before the draft loop and restore it after the loop (i.e., wrap the MTPEagle draft/replay section with a local saved_kv = attn_metadata.kv_lens_cuda and reassign attn_metadata.kv_lens_cuda = saved_kv on exit) — mirror the explicit save/restore added in Eagle3OneModelWorker to prevent KV-length leakage between drafts and replays.

coderabbitai · 2026-06-10T10:24:04Z

+            for i in range(self.mtp_num_modules):
+                if i == 0:
+                    hidden_states = draft_model.mtp_layers[0](
+                        embed_tokens=draft_model.embed_tokens,
+                        all_rank_num_tokens=spec_metadata.all_rank_num_tokens,
+                        **inputs)
+
+                    start_ids_gen = (
+                        spec_metadata.batch_indices_cuda[:num_gens] *
+                        (self.mtp_num_modules + 1)).long()
+                    gather_ids_gen = (start_ids_gen +
+                                      num_accepted_tokens[num_contexts:] - 1 +
+                                      attn_metadata.num_ctx_tokens)
+                    gather_ids = torch.concat(
+                        [last_tokens_idx[:num_contexts], gather_ids_gen], dim=0)
+                else:
+                    hidden_states = draft_model.mtp_layers[0](
+                        embed_tokens=draft_model.embed_tokens,
+                        all_rank_num_tokens=spec_metadata.
+                        subseq_all_rank_num_tokens,
+                        **inputs)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the matching MTP module on each draft step.

This loop always calls draft_model.mtp_layers[0] and its shared_head, regardless of i. For max_draft_len > 1, steps 1..N keep reusing the first predictor instead of the later trained MTP modules, so the drafter emits the wrong token sequence.

Suggested fix

- with self.draft_kv_cache_context(attn_metadata, draft_kv_cache_manager): - for i in range(self.mtp_num_modules): + with self.draft_kv_cache_context(attn_metadata, draft_kv_cache_manager): + for i, mtp_layer in enumerate(draft_model.mtp_layers): if i == 0: - hidden_states = draft_model.mtp_layers[0]( + hidden_states = mtp_layer( embed_tokens=draft_model.embed_tokens, all_rank_num_tokens=spec_metadata.all_rank_num_tokens, **inputs) @@ else: - hidden_states = draft_model.mtp_layers[0]( + hidden_states = mtp_layer( embed_tokens=draft_model.embed_tokens, all_rank_num_tokens=spec_metadata. subseq_all_rank_num_tokens, **inputs) @@ - logits = draft_model.mtp_layers[0].shared_head( + logits = mtp_layer.shared_head( padded_hidden_states, draft_model.lm_head, attn_metadata, True) else: - logits = draft_model.mtp_layers[0].shared_head( + logits = mtp_layer.shared_head( hidden_states[gather_ids], draft_model.lm_head, attn_metadata, True) @@ - mapping_lm_head_tp = draft_model.mtp_layers[ - 0].shared_head.mapping_lm_head_tp + mapping_lm_head_tp = mtp_layer.shared_head.mapping_lm_head_tp

Also applies to: 1287-1303

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/speculative/mtp.py` around lines 1236 - 1256, The loop over self.mtp_num_modules incorrectly always calls draft_model.mtp_layers[0] (and its shared_head) so later draft steps reuse the first MTP predictor; change the call sites in the loop to use the loop index (i) so each iteration invokes draft_model.mtp_layers[i] (and the corresponding shared_head) instead of index 0; apply the same fix to the other occurrence noted (the similar block around the 1287-1303 range) to ensure each draft step uses the matching MTP module.

coderabbitai · 2026-06-10T10:24:04Z

+    relaxed_topk: int = Field(
        default=1,
        description=
        "Number of top candidate tokens to consider for relaxed acceptance. Draft token is accepted if it matches any of these."
    )
-    relaxed_delta: NonNegativeFloat = Field(
-        default=0.0,
+    relaxed_delta: float = Field(
+        default=0.,


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify unconstrained field types in MTPDecodingConfig. rg -n -A4 -B4 "class MTPDecodingConfig|relaxed_topk|relaxed_delta|begin_thinking_phase_token|end_thinking_phase_token" tensorrt_llm/llmapi/llm_args.py # Verify these fields are consumed directly by MTP runtime ops. rg -n -A4 -B4 "torch.topk|mtp_relaxed_acceptance_op|relaxed_topk|relaxed_delta|begin_thinking_phase_token|end_thinking_phase_token" tensorrt_llm/_torch/speculative/mtp.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 6343

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the full MTPDecodingConfig block for surrounding patterns (line window kept small) python3 - <<'PY' import itertools, pathlib path = pathlib.Path("tensorrt_llm/llmapi/llm_args.py") start, end = 1725, 1840 with path.open("r", encoding="utf-8") as f: for i, line in enumerate(f, start=1): if start <= i <= end: print(f"{i}:{line.rstrip()}") if i > end: break PY # Find nearby typing patterns for token-id-like fields rg -n "token(_id|Id)?|phase_token|begin_.*token|end_.*token" tensorrt_llm/llmapi/llm_args.py # Check how PositiveInt/NonNegativeInt are used in this file (to align with repo style) rg -n "NonNegativeInt|PositiveInt|Field\\(.*(ge|gt)=" tensorrt_llm/llmapi/llm_args.py | head -n 80

Repository: NVIDIA/TensorRT-LLM

Length of output: 25048

🏁 Script executed:

#!/bin/bash set -euo pipefail python3 - <<'PY' import pathlib path = pathlib.Path("tensorrt_llm/_torch/speculative/mtp.py") # Print only the relevant windows around the earlier matches for start in (650, 770, 795, 815, 820): end = start + 35 print(f"\n---- {start+1}-{end} ----") with path.open("r", encoding="utf-8") as f: for i, line in enumerate(f, start=1): if start < i <= end: print(f"{i}:{line.rstrip()}") if i > end: break PY # Also directly locate begin/end token usage rg -n "begin_thinking_phase_token|end_thinking_phase_token|relaxed_delta|relaxed_topk" tensorrt_llm/_torch/speculative/mtp.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 11439

🏁 Script executed:

#!/bin/bash set -euo pipefail # Find other top-k / threshold-like config fields and their types rg -n "topk|top_k|top-k|delta|threshold|thinking_phase_token|phase_token" tensorrt_llm/llmapi/llm_args.py # Narrow to likely Pydantic config fields (same region around MTP and related speculative decoding) rg -n "relaxed_topk|relaxed_delta|begin_thinking_phase_token|end_thinking_phase_token" -n tensorrt_llm/llmapi/llm_args.py # Find other fields that use PositiveInt / NonNegativeFloat specifically rg -n "PositiveInt|NonNegativeInt|NonNegativeFloat|PositiveFloat" tensorrt_llm/llmapi/llm_args.py | head -n 120

Repository: NVIDIA/TensorRT-LLM

Length of output: 6160

Reinstate numeric Pydantic constraints for MTP relaxed-acceptance config

MTPDecodingConfig defines relaxed_topk/begin_thinking_phase_token/end_thinking_phase_token as unconstrained int and relaxed_delta as unconstrained float in tensorrt_llm/llmapi/llm_args.py (lines ~1756-1802). These values are consumed directly at runtime in tensorrt_llm/_torch/speculative/mtp.py (torch.topk(..., k=self.spec_config.relaxed_topk) and passed into trtllm.mtp_relaxed_acceptance_op), so invalid configs (e.g. relaxed_topk <= 0, negative deltas/tokens) can bypass fail-fast validation and lead to runtime errors or incorrect acceptance behavior.

💡 Suggested fix

- relaxed_topk: int = Field( + relaxed_topk: PositiveInt = Field( default=1, description= "Number of top candidate tokens to consider for relaxed acceptance. Draft token is accepted if it matches any of these." ) - relaxed_delta: float = Field( + relaxed_delta: NonNegativeFloat = Field( default=0., description= "Probability threshold for relaxed acceptance. Only candidates with prob >= (top-1 prob - delta) are kept." ) @@ - begin_thinking_phase_token: int = Field( + begin_thinking_phase_token: NonNegativeInt = Field( default=128798, description= "Token ID marking start of thinking phase. Relaxed acceptance only applies within this phase." ) - end_thinking_phase_token: int = Field( + end_thinking_phase_token: NonNegativeInt = Field( default=128799, description= "Token ID marking end of thinking phase. Strict acceptance resumes after this." )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/llmapi/llm_args.py` around lines 1756 - 1762, MTPDecodingConfig currently exposes unconstrained numeric fields (relaxed_topk, relaxed_delta, begin_thinking_phase_token, end_thinking_phase_token) which can produce runtime errors in torch.topk and trtllm.mtp_relaxed_acceptance_op; update the Pydantic declarations in MTPDecodingConfig to enforce valid ranges (e.g., relaxed_topk: int >= 1, relaxed_delta: float >= 0.0, begin_thinking_phase_token and end_thinking_phase_token: int >= 0) by adding Field constraints (or use pydantic types like conint/confloat) so invalid configs fail-fast during validation. Ensure you change the Field(...) definitions for relaxed_topk, relaxed_delta, begin_thinking_phase_token, and end_thinking_phase_token in MTPDecodingConfig accordingly and run tests that exercise torch.topk(...) and trtllm.mtp_relaxed_acceptance_op usage paths.

Source: Coding guidelines

…DIA#12353)" This reverts commit 8e5d9e2 (NVIDIA#12353). The MTP-eagle / Eagle3 one-model worker merge regresses the GB200 2-node disagg perf-sanity gen-only test: test_perf_sanity.py::test_e2e[disagg_upload-gen_only- gb200_deepseek-v32-fp4_1k1k_con2048_ctx1_dep4_gen1_dep4_eplb0_mtp1_ccb-NIXL] The GEN server (MTP speculative decoding, num_nextn_predict_layers=1) crashes during executor init while capturing the generation-phase CUDA graphs: _run_cuda_graph_warmup -> _capture_generation_cuda_graphs -> torch.cuda.synchronize() torch.AcceleratorError: CUDA error: unspecified launch failure All 4 gen ranks abort (exit 134). Bisect between the last passing post-merge commit (33b0a32) and the first failing one (316430f) isolates NVIDIA#12353 as the only change touching the spec-decode / model_engine CUDA-graph capture path; the remaining commits in the range are AutoDeploy, test, and infra changes. Reverting restores the gen server startup and unblocks the perf-sanity test. Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

…revert Remove all 14 deepseek-v32-fp4 perf-sanity test waivers (nvbugs 6280721, 6280649, 6085022, 6200257). These were skipped due to the gen-server CUDA-graph capture crash introduced by NVIDIA#12353, which is reverted in this branch, so the cases can run again. Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com>

chenfeiz0326 · 2026-06-10T11:45:13Z

/bot run --disable-fail-fast --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Disagg-PerfSanity-CTX1-NODE1-GPU4-GEN1-NODE1-GPU4-Post-Merge-4"

tensorrt-cicd · 2026-06-10T11:50:54Z

PR_Github #53306 [ run ] triggered by Bot. Commit: 628ff8a Link to invocation

tensorrt-cicd · 2026-06-10T13:10:26Z

PR_Github #53306 [ run ] completed with state FAILURE. Commit: 628ff8a
/LLM/main/L0_MergeRequest_PR pipeline #42493 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

tensorrt-cicd requested review from a team as code owners June 10, 2026 10:13

tensorrt-cicd requested review from mikeiovine and symphonylyh June 10, 2026 10:13

github-actions Bot assigned tensorrt-cicd Jun 10, 2026

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

chenfeiz0326 added 2 commits June 10, 2026 10:39

chenfeiz0326 force-pushed the revert-mtp-eagle-12353 branch from b1d54c7 to 628ff8a Compare June 10, 2026 10:40

karljang mentioned this pull request Jun 10, 2026

fix(moe): fix IndexError in _reconcile_and_compute_alphas with EPLB #15145

Open

		# Save the old attn_metadata and spec_metadata
		self._prepare_attn_metadata_for_spec_dec(attn_metadata)

Conversation

tensorrt-cicd commented Jun 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Failure signature

Bisect

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 10, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

chenfeiz0326 commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tensorrt-cicd commented Jun 10, 2026 •

edited by coderabbitai Bot

Loading