[TRTLLM-12291][feat] New sharding infrastructure by greg-kwasniewski1 · Pull Request #12419 · NVIDIA/TensorRT-LLM

greg-kwasniewski1 · 2026-03-21T00:59:34Z

Fixes #12291

Summary by CodeRabbit

Release Notes

New Features
- Added deployment configurations for multiple models including DeepSeek, Qwen, Llama, Mistral, and others with optimized tensor parallelism and kernel fusion settings.
- Added comprehensive model implementations for auto-deploy export across 30+ model architectures.
- Enhanced tensor parallelism support with improved sharding hints and custom operations.
Improvements
- Updated MoE fusion configuration to support models with varying quantization scales.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

greg-kwasniewski1 · 2026-03-21T00:59:54Z

/bot run

coderabbitai · 2026-03-21T01:27:57Z

📝 Walkthrough

Walkthrough

This pull request introduces a sharding IR (Intermediate Representation) system for AutoDeploy, enabling tensor and expert parallelism across distributed model execution. Changes include new POC configurations for multiple model architectures, sharding-aware metadata parameters added to custom operations, new sharding-specific custom ops, updated model implementations optimized for AutoDeploy export, and configuration updates to support the apply_sharding_hints transform pipeline.

Changes

Cohort / File(s)	Summary
Sharding POC Configurations `examples/auto_deploy/new_sharding//.../.yaml`	Added 9 new YAML POC configurations for DeepSeek (R1, V2.5), InternLM3, Llama3, Mistral, Nemotron (FP8, NVFP4), Qwen (3, 3.5-MoE), and SmolLM3, each specifying world size, runtime, compute limits, KV cache settings, and sharding transform pipeline with `apply_sharding_hints` enabled and explicit sharding detection/executor disabled.
Core Infrastructure Configuration `tensorrt_llm/_torch/auto_deploy/config/default.yaml`, `tensorrt_llm/_torch/auto_deploy/llm_args.py`	Updated MoE fusion config to allow differing input scales for FP8 MoE passes; modified `init_mapping_from_config()` to prefer `apply_sharding_hints` parameters when available, falling back to `detect_sharding`.
Sharding Custom Operations `tensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.py`	Added three new sharding-aware custom ops (`view`, `split_with_sizes`, `all_reduce`) that accept sharding metadata parameters (`tp_scaled_dim`, `layer_type`, `shardable`) for later transformation by `apply_sharding_hints` graph pass; runtime behavior is identity-like tensor transformations.
Linear and Quantization Custom Ops `tensorrt_llm/_torch/auto_deploy/custom_ops/linear/linear.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/*.py`	Extended `torch_linear_simple`, quantization ops (FP8, FP4, INT4, finegrained), and related fakes with new metadata parameters (`tp_mode`, `output_sizes`, `tp_min_local_shape`, `layer_type`); added `cdiv()` helper for ceiling division in block-size computations; updated INT4 eager path to forward tensor-parallelism parameters.
Attention and Advanced Ops `tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_attention.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_mla.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/*.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py`	Extended signatures of attention, MLA, MoE, Mamba, and normalization custom ops to accept sharding metadata (`layer_type`, `shardable`, `tp_mode`, `output_sizes`); replaced mapping deserialization with `DistConfig` in MoE/MOE-related ops; updated corresponding fake/meta implementations.
MoE Infrastructure `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py`	Updated MoE alltoall checking to use `DistConfig` deserialization instead of legacy `deserialize_mapping`, computing `enable_alltoall` from `DistConfig` fields.
Attention/RoPE Related `tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py`	Added float8 CPU workaround for index-select operations via byte-view reinterpretation; updated RoPE permutation hooks to handle temporary dtype conversion for float8 weights.
Dense Models: Llama Family `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama3.py`, `modeling_llama4.py`	Implemented Llama3 and Llama4 decoder-only causal LMs using AutoDeploy custom ops for RMSNorm, RoPE, and attention; Llama4 adds L2 QK norm, complex RoPE, MoE routing, and multimodal wrapper support; registered with `AutoModelForCausalLMFactory`.
Dense Models: Mistral & Variants `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral.py`, `modeling_mistral3.py`	Implemented Mistral (base and v3.1 text/multimodal variants) using AutoDeploy ops; v3.1 adds vision tower integration and conditional generation support; registered with factories.
Dense Models: Gemma Family `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma.py`, `modeling_gemma2.py`, `modeling_gemma3.py`	Implemented Gemma (v1, v2, v3) models with varying attention mechanisms (causal, sliding-window, attention softcapping); Gemma3 adds multimodal wrapper; all registered with `AutoModelForCausalLMFactory`.
Dense Models: Qwen Family `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen2.py`, `modeling_qwen3.py`, `modeling_qwen3_5.py`, `modeling_qwen3_moe.py`, `modeling_qwen3_next.py`	Implemented Qwen2/3 dense and Qwen3-MoE/3-Next variants with AutoDeploy ops; Qwen3_5 adds GatedDeltaNet linear attention and vision support; Qwen3-Next adds gated RMSNorm and complex linear attention; all registered with factories.
Dense Models: Other Architectures `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_cohere.py`, `modeling_exaone.py`, `modeling_internlm3.py`, `modeling_phi4.py`, `modeling_qwen2.py`, `modeling_starcoder2.py`, `modeling_seed_oss.py`, `modeling_olmo3.py`	Implemented standalone dense causal LMs (Cohere, EXAONE, InternLM3, Phi4, Starcoder2, Seed-OSS, OLMo3) with AutoDeploy ops for normalization, RoPE, and attention; all registered with `AutoModelForCausalLMFactory`.
MoE Models: DeepSeek V2 & Variants `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v2.py`	Implemented DeepSeekV2 with MoE gating (greedy/group-limited-greedy top-k), multi-head latent attention (MLA), YaRN RoPE scaling, and shared expert paths using AutoDeploy `torch_moe` op; registered with factory.
MoE Models: HunYuan & Variants `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py`, `modeling_hunyuan_dense.py`	Implemented HunYuan MoE (softmax top-k routing) and HunYuan Dense with GQA attention using AutoDeploy ops; both registered with `AutoModelForCausalLMFactory`.
MoE Models: GLM Variants `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py`, `modeling_glm_moe_dsa.py`	Implemented GLM4-MoE with Q/K normalization and GLM-MoE-DSA (supports MLA + MoE with fused routing ops); both registered with `AutoModelForCausalLMFactory`.
Specialized/Hybrid Models `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_granite.py`, `modeling_granite_moe_hybrid.py`, `modeling_gpt_oss.py`, `modeling_minimax_m2.py`	Implemented Granite (dense, with residual/attention/embedding multipliers), GraniteMoeHybrid (attention/Mamba selection per layer with optional MoE), GPT-OSS (MoE with MXFP4 dequant), and MiniMax-M2 (MoE with sigmoid routing); all registered with factories.
Vision/Multimodal Models `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4_visionr.py`, `modeling_phi4mm.py`, `modeling_phi4flash.py`	Implemented Phi-4 vision-reasoning (flattened config, SigLIP2 vision + text), Phi-4 multimodal (image/audio LoRA paths with model-switching), and Phi-4-Flash (mini-reasoning with Mamba/SSM layers); registered with multiple factories including image-text-to-text paths.
Special Model Updates `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py`, `modeling_skywork_r1v2.py`	Implemented DeciLM (Nemotron-NAS Llama variant with optional per-layer attention/FFN skipping) and Skywork-R1V2 (LLM + eager vision tower); both registered with `AutoModelForCausalLMFactory`.
Module Initialization `tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/__init__.py`	Added empty module init file with Apache-2.0 license header for new sharding subpackage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

coderabbitai

Actionable comments posted: 17

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py (1)
429-448: ⚠️ Potential issue | 🔴 Critical

Add layer_type parameter to FP8/NVFP4 MoE ops to prevent crashes when shard_layers filtering is used.

The torch_quant_fp8_moe and torch_quant_nvfp4_moe ops lack the layer_type parameter present in torch_moe. This causes apply_sharding_hints to crash with a RuntimeError when shard_layers config filtering is enabled, because it unconditionally attempts to extract layer_type from all MoE ops via extract_op_args(node, "layer_type"), which fails if the parameter is not in the schema (lines 3927 in sharding.py).

Add layer_type: str = "unknown" parameter to both torch_quant_fp8_moe and torch_quant_nvfp4_moe function signatures and their corresponding fake implementations to match torch_moe.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py` around
lines 429 - 448, The FP8/NVFP4 MoE custom ops are missing the layer_type
parameter which causes apply_sharding_hints to crash when shard_layers filtering
calls extract_op_args(node, "layer_type"); update both function signatures
torch_quant_fp8_moe and torch_quant_nvfp4_moe to include layer_type: str =
"unknown" and add the same parameter to their corresponding fake implementations
so their schemas match torch_moe (use the default "unknown"); ensure any
internal usages or return signatures remain unchanged.
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py (1)
560-566: ⚠️ Potential issue | 🟠 Major

Use ceil-division when inferring fine-grained FP8 block sizes.

weight_scale_inv now represents ceil(N / block_n) x ceil(K / block_k). The floor divisions here under-estimate the block size for tiny or non-divisible projections, so _safe_act_quant() and w8a8_block_fp8_matmul_triton() run with the wrong geometry.
Suggested fix
-    block_n = N // scale_n
-    block_k = K // scale_k
+    block_n = cdiv(N, scale_n)
+    block_k = cdiv(K, scale_k)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py`
around lines 560 - 566, The code infers block_size using floor division which
underestimates FP8 block dimensions for non-divisible shapes; change the
computation in the block where N, K = weight_quantized.shape and scale_n,
scale_k = weight_scale_inv.shape to use ceil-division so block_n = ceil(N /
scale_n) and block_k = ceil(K / scale_k) (ensure math.ceil is
available/imported) and set block_size = [block_n, block_k]; this will align the
inferred block geometry used by _safe_act_quant() and
w8a8_block_fp8_matmul_triton() with the ceil-based shape of weight_scale_inv.

🟡 Minor comments (6)

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_mamba.py-193-196 (1)
193-196: ⚠️ Potential issue | 🟡 Minor

Docstring references non-existent tp_mode parameter.

The docstring on line 196 mentions tp_mode but the actual parameters added are shardable and layer_type. This appears to be a copy-paste error.
📝 Proposed fix for docstring
-    """Mamba SSM mixer forward; accepts ``tp_mode`` for sharding-aware AutoDeploy behavior."""
+    """Mamba SSM mixer forward; accepts sharding metadata for AutoDeploy behavior."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_mamba.py` around lines
193 - 196, The docstring for the "Mamba SSM mixer forward" function incorrectly
references a non-existent parameter `tp_mode`; update the docstring to reflect
the actual parameters (`shardable: bool` and `layer_type: str`) and remove any
mention of `tp_mode`. Edit the docstring in torch_mamba.py (the Mamba SSM mixer
forward docstring) to document `shardable` and `layer_type` briefly and
accurately (their types and purpose) so the signature and docstring match.
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_causal_conv.py-34-38 (1)
34-38: ⚠️ Potential issue | 🟡 Minor

Docstring references non-existent tp_mode parameter.

Similar to torch_mamba.py, the docstring mentions tp_mode but the actual parameters are shardable, output_sizes, and layer_type.
📝 Proposed fix for docstring
-    """Causal 1D convolution; accepts ``tp_mode`` for sharding-aware AutoDeploy behavior."""
+    """Causal 1D convolution; accepts sharding metadata for AutoDeploy behavior."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_causal_conv.py` around
lines 34 - 38, The docstring incorrectly mentions a non-existent parameter
`tp_mode`; update the docstring in torch_causal_conv.py to remove `tp_mode` and
instead describe the actual parameters `shardable`, `output_sizes`, and
`layer_type` and their effect (e.g., sharding-aware AutoDeploy behavior). Locate
the function whose signature includes shardable: bool = False, output_sizes:
Optional[List[int]] = None, layer_type: str = "unknown" and revise its docstring
text to accurately reflect those parameters and intended behavior.
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py-1-1 (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Add full Apache 2.0 license header.

The file has only a short copyright line but is missing the full Apache 2.0 license block required by coding guidelines for all TensorRT-LLM source files.
📝 Suggested license header
-# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
As per coding guidelines: "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py` at line
1, The file modeling_phi4flash.py currently only has a short copyright line;
replace or prepend it with the full Apache 2.0 license header used across
TensorRT-LLM source files (including the NVIDIA copyright line with the latest
modification year and the full license text and notice), ensuring the complete
header appears at the top of
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py and follows
the same formatting and year as other project files.
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py-1-1 (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year and add full license header.

The copyright year is 2025 but should be 2026 (current year). Also, the file is missing the full Apache 2.0 license block.
📝 Suggested fix
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py` at
line 1, Update the file header to use the current year and include the complete
Apache License 2.0 header: change the copyright year from 2025 to 2026 at the
top of modeling_glm_moe_dsa.py and replace the existing short copyright line
with the full Apache-2.0 license block (including copyright statement, "Licensed
under the Apache License, Version 2.0 (the "License")" wording, link to the
license, and the standard disclaimer and permissions paragraphs) so the file
contains the standard full license header.
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py-419-419 (1)
419-419: ⚠️ Potential issue | 🟡 Minor

Update the misleading comment about dynamic assignment.

The comment "set dynamically below when used with real HunYuanConfig" is inaccurate—config_class is not assigned anywhere in the file. The config_class = None is intentional for custom models using HuggingFace configs loaded via trust_remote_code. Update the comment to clarify this pattern, similar to the approach in modeling_skywork_r1v2.py: config_class = None # HunYuanConfig uses trust_remote_code; not imported here.

The instantiation concern is unfounded because the actual config is passed directly to __init__() and passed to super().__init__(config), so config_class being None does not prevent initialization.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py` at
line 419, Update the misleading inline comment for the module-level variable
config_class = None: replace "set dynamically below when used with real
HunYuanConfig" with a clear explanation that this pattern intentionally leaves
config_class None because HunYuanConfig is provided via trust_remote_code and
not imported here (e.g., "config_class = None  # HunYuanConfig uses
trust_remote_code; not imported here"). Edit the comment near the config_class
declaration in modeling_hunyuan_moe.py to match the wording used in
modeling_skywork_r1v2.py so readers understand the config is passed to __init__
and super().__init__(config) rather than assigned in this file.
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py-320-324 (1)
320-324: ⚠️ Potential issue | 🟡 Minor

Add explicit config_class = None assignment with explanatory comment.

To match the pattern used in similar models that load config via trust_remote_code (e.g., SkyworkR1V2), explicitly set config_class = None instead of omitting it. This makes the intentional design choice explicit and avoids any ambiguity:
class DeciLMPreTrainedModel(PreTrainedModel):
    config_class = None  # Config loaded via trust_remote_code from HF checkpoint
    base_model_prefix = "model"
    _no_split_modules = ["DeciLMDecoderLayer"]
    _supports_flash_attn_2 = True
    supports_gradient_checkpointing = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py` around
lines 320 - 324, Add an explicit config_class assignment to the
DeciLMPreTrainedModel class to document that config is loaded via
trust_remote_code: set config_class = None with a short comment (e.g., "Config
loaded via trust_remote_code from HF checkpoint") inside the class
DeciLMPreTrainedModel so the intent is explicit alongside existing attributes
like base_model_prefix, _no_split_modules, _supports_flash_attn_2, and
supports_gradient_checkpointing.

🧹 Nitpick comments (22)

tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/__init__.py (1)

1-14: Consider adding a module docstring to document the package.

The file is valid as an empty package initializer. However, adding a docstring after the license header would help developers understand the purpose of the new_sharding package, especially given that this PR introduces new sharding infrastructure.

📝 Example package docstring

 # See the License for the specific language governing permissions and
 # limitations under the License.
+"""
+New sharding infrastructure for AutoDeploy.
+
+This package provides sharding-aware transformations and configurations
+for distributed model execution with tensor and expert parallelism.
+"""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/__init__.py`
around lines 1 - 14, Add a concise module-level docstring to the __init__.py for
the new_sharding package (the package introduced here) immediately after the
license header: include a short description of the package purpose (new sharding
infrastructure), list exported submodules/classes/functions or where to find
them (e.g., sharding strategies, utilities), and any usage/compatibility notes
or maintainer/contact info; implement this as a top-of-file triple-quoted string
so tools like pydoc and IDEs surface it.

tensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.py (1)

84-97: Inconsistent fake implementation pattern for all_reduce.

The real all_reduce implementation returns x.clone() (line 92), but the fake implementation returns torch.empty_like(x) (line 97). This differs from view and split_with_sizes where both real and fake implementations use clone().

While empty_like is sufficient for shape inference during tracing, this inconsistency could cause issues if fake tensor values are ever inspected during debugging or validation. Consider using x.clone() for consistency.
♻️ Proposed fix for consistency
 `@all_reduce.register_fake`
 def _all_reduce_fake(x: torch.Tensor, layer_type: str = "unknown") -> torch.Tensor:
-    return torch.empty_like(x)
+    return x.clone()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.py` around lines 84 -
97, The fake implementation of the custom op all_reduce (_all_reduce_fake)
returns torch.empty_like(x) which is inconsistent with the real all_reduce
(which returns x.clone()); change _all_reduce_fake to return x.clone() so fake
and real implementations match value-preserving behavior—update the
_all_reduce_fake function in sharding_ops.py to return a clone of the input
tensor (use the all_reduce and _all_reduce_fake identifiers to find the code).

tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py (1)

62-70: Redundant workaround call after bfloat16 cast.

After casting w to bfloat16 (line 64), calling _index_select_with_float8_cpu_workaround on line 68 will never trigger the workaround path since w_rope.dtype is now bfloat16, not a float8 dtype. The function will simply fall through to the regular index_select. While this is not incorrect, it adds unnecessary overhead of checking the dtype condition.

Consider either:

Calling the regular index_select directly when the bfloat16 cast path is taken, or

Documenting that this is intentional for code uniformity
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py` around lines
62 - 70, The code casts tensor w to bfloat16 and then calls
_index_select_with_float8_cpu_workaround on w_rope which will never hit its
float8 branch, causing an unnecessary dtype check; in the block around variables
w, w_rope and q_key inside mla_rope_utils.py either (A) perform the index
selection with torch.index_select (or torch.take/advanced indexing) directly on
w_rope when orig_dtype was converted to torch.bfloat16, or (B) defer the
bfloat16 cast until after calling _index_select_with_float8_cpu_workaround so
that the workaround can still run for float8 dtypes—update the code to choose
one of these paths and remove the redundant dtype-checking call to
_index_select_with_float8_cpu_workaround when w has already been cast to
bfloat16.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py (1)

23-43: Use module-level import for DistConfig to match the repo import rule.

Please import the module and reference DistConfig via namespace instead of importing the class directly.

Proposed diff

-from tensorrt_llm._torch.auto_deploy.utils.dist_config import DistConfig
+from tensorrt_llm._torch.auto_deploy.utils import dist_config
@@
-    dc = DistConfig.deserialize(mapping_config)
+    dc = dist_config.DistConfig.deserialize(mapping_config)

As per coding guidelines: "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py` around
lines 23 - 43, Replace the direct class import with a module-level import and
update usages accordingly: change the import of DistConfig in the module to
import the dist_config module (e.g.
tensorrt_llm._torch.auto_deploy.utils.dist_config) and update the call in
_check_moe_alltoall to use the module namespace (e.g.
dist_config.DistConfig.deserialize(...)) and any other references to DistConfig
in this file to dist_config.DistConfig so the repo import rule (module-level
imports) is followed.

tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py (1)

83-104: Non-gated RMSNorm ops currently lack sharding metadata and are not processed by apply_sharding_hints.

The non-gated variants (torch_rmsnorm, triton_rmsnorm, flashinfer_rmsnorm) have only 3 parameters: (input, weight, eps). In contrast, their gated counterparts (torch_rmsnorm_gated, triton_rmsnorm_gated) include tp_mode and layer_type metadata that allows apply_sharding_hints to apply tensor-parallel sharding.

is_any_shardable_op currently detects only the gated variants for ShardableOp.NORM, so non-gated variants are never passed to _apply_hint_norm. If future changes require sharding non-gated variants, consider adding the same tp_mode and layer_type parameters to enable consistent sharding treatment across all RMSNorm backends.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py` around
lines 83 - 104, Non-gated RMSNorm ops (torch_rmsnorm, triton_rmsnorm,
flashinfer_rmsnorm) lack the tp_mode/layer_type metadata so apply_sharding_hints
and is_any_shardable_op (which currently checks only gated variants like
torch_rmsnorm_gated and triton_rmsnorm_gated for ShardableOp.NORM) never route
them to _apply_hint_norm; fix by adding the same tp_mode and layer_type
parameters to the non-gated custom ops' signatures and fake registrations, and
update is_any_shardable_op (and any dispatch that filters on ShardableOp.NORM)
to recognize these updated non-gated names so they receive sharding metadata and
are processed by _apply_hint_norm.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_smollm3.py (1)

182-194: Missing layer_type parameter in torch_attention call.

The torch_attention custom op now accepts an optional layer_type: str = "unknown" parameter for sharding hints metadata (as shown in torch_attention.py lines 109-229). While it defaults to "unknown", other model implementations in this PR may benefit from passing an explicit value for consistency with the sharding infrastructure.

Consider adding the layer_type parameter for consistency:

         attn_output = torch.ops.auto_deploy.torch_attention(
             q,  # [B, S, N, head_dim]
             k,  # [B, S, N_kv, head_dim]
             v,  # [B, S, N_kv, head_dim]
             None,  # attn_mask
             0.0,  # dropout_p
             True,  # is_causal
             self.scaling,  # scale
             None,  # sinks
             None,  # sliding_window
             None,  # logit_cap
             "bsnd",  # layout
+            "unknown",  # layer_type
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_smollm3.py` around
lines 182 - 194, The torch_attention custom op call in modeling_smollm3.py (the
attn_output assignment using torch.ops.auto_deploy.torch_attention) is missing
the optional layer_type parameter; update that call to pass an explicit
layer_type string (e.g., "self_attn" or another descriptive name) as the final
argument to match the new signature in torch_attention and provide sharding
hints metadata, keeping all other arguments (q, k, v, attn_mask, dropout_p,
is_causal, scaling, sinks, sliding_window, logit_cap, layout) unchanged.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral.py (1)

170-182: Missing layer_type parameter in torch_attention call.

For consistency with the sharding infrastructure, consider adding:

         attn_output = torch.ops.auto_deploy.torch_attention(
             q,  # [B, S, N, head_dim]
             k,  # [B, S, N_kv, head_dim]
             v,  # [B, S, N_kv, head_dim]
             None,  # attn_mask
             0.0,  # dropout_p
             True,  # is_causal
             self.scaling,  # scale
             None,  # sinks
             self.sliding_window,  # sliding_window
             None,  # logit_cap
             "bsnd",  # layout
+            "unknown",  # layer_type
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral.py` around
lines 170 - 182, The call to torch.ops.auto_deploy.torch_attention in
modeling_mistral.py (the attn_output assignment) is missing the required
layer_type parameter expected by the sharding infrastructure; update the
torch_attention call to pass the appropriate layer_type argument (e.g., the
current block's type or a constant like "self_attn"/"cross_attn" as used
elsewhere) as the next parameter after "layout" so the function signature
matches the operator and sharding logic in the codebase.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_exaone.py (1)

243-255: Missing layer_type parameter in torch_attention call.

For consistency with the sharding infrastructure, consider adding:

         attn_output = torch.ops.auto_deploy.torch_attention(
             q,  # [B, S, N, head_dim]
             k,  # [B, S, N_kv, head_dim]
             v,  # [B, S, N_kv, head_dim]
             None,  # attn_mask
             0.0,  # dropout_p
             True,  # is_causal
             self.scaling,  # scale
             None,  # sinks
             None,  # sliding_window
             None,  # logit_cap
             "bsnd",  # layout
+            "unknown",  # layer_type
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_exaone.py` around
lines 243 - 255, The call to torch.ops.auto_deploy.torch_attention in
modeling_exaone.py is missing the required layer_type argument for sharding
metadata; update the call in the method that constructs attn_output to pass the
layer type (e.g., add self.layer_type as the final parameter) and, if the class
lacks a layer_type attribute, add one (or compute it) on the model class so
torch_attention receives a valid layer_type string; touch the torch_attention
invocation and the containing class (where attn_output is built) to ensure the
new parameter is populated consistently.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_skywork_r1v2.py (1)

372-384: Missing layer_type parameter in torch_attention call.

Same as other models in this PR, consider adding the explicit layer_type parameter for consistency with the sharding infrastructure:

         attn_output = torch.ops.auto_deploy.torch_attention(
             q,
             k,
             v,
             None,  # attn_mask
             0.0,  # dropout_p
             True,  # is_causal
             self.scaling,
             None,  # sinks
             None,  # sliding_window
             None,  # logit_cap
             "bsnd",
+            "unknown",  # layer_type
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_skywork_r1v2.py`
around lines 372 - 384, The torch.ops.auto_deploy.torch_attention call (assigned
to attn_output) is missing the explicit layer_type kwarg required by the
sharding infra; update the call to pass layer_type with the same string used in
other models in this PR (e.g., the value used elsewhere for self-attention
layers) as a keyword argument so torch_attention receives layer_type
consistently across models.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_starcoder2.py (1)

167-179: Missing layer_type parameter in torch_attention call.

For consistency with the sharding infrastructure, consider adding the explicit parameter:

         attn_output = torch.ops.auto_deploy.torch_attention(
             q,  # [B, S, N, head_dim]
             k,  # [B, S, N_kv, head_dim]
             v,  # [B, S, N_kv, head_dim]
             None,  # attn_mask
             0.0,  # dropout_p
             True,  # is_causal
             self.scaling,  # scale
             None,  # sinks
             self.sliding_window,  # sliding_window
             None,  # logit_cap
             "bsnd",  # layout
+            "unknown",  # layer_type
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_starcoder2.py` around
lines 167 - 179, The torch.ops.auto_deploy.torch_attention call (assigned to
attn_output) is missing the layer_type parameter required by the sharding infra;
update the call in modeling_starcoder2.py to pass the layer type (e.g., use
self.layer_type or the appropriate string literal) as the layer_type argument to
torch_attention so the function signature matches other usages and the sharding
logic can identify this attention layer.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_seed_oss.py (1)

178-190: Missing layer_type parameter in torch_attention call.

For consistency with the sharding infrastructure, consider adding:

         attn_output = torch.ops.auto_deploy.torch_attention(
             q,  # [B, S, N, head_dim]
             k,  # [B, S, N_kv, head_dim]
             v,  # [B, S, N_kv, head_dim]
             None,  # attn_mask
             0.0,  # dropout_p
             True,  # is_causal
             self.scaling,  # scale
             None,  # sinks
             None,  # sliding_window
             None,  # logit_cap
             "bsnd",  # layout
+            "unknown",  # layer_type
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_seed_oss.py` around
lines 178 - 190, The torch.ops.auto_deploy.torch_attention call in
modeling_seed_oss.py is missing the required layer_type argument; update the
call site (the torch_attention invocation that passes q, k, v, ..., "bsnd") to
include the layer_type parameter (e.g., pass self.layer_type or an explicit
string like "attention" as the final argument) so the call matches the expected
signature used by the sharding infrastructure.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama3.py (1)

179-191: Missing layer_type parameter in torch_attention call.

For consistency with the sharding infrastructure, consider adding:

         attn_output = torch.ops.auto_deploy.torch_attention(
             q,  # [B, S, N, head_dim]
             k,  # [B, S, N_kv, head_dim]
             v,  # [B, S, N_kv, head_dim]
             None,  # attn_mask
             0.0,  # dropout_p
             True,  # is_causal
             self.scaling,  # scale
             None,  # sinks
             None,  # sliding_window
             None,  # logit_cap
             "bsnd",  # layout
+            "unknown",  # layer_type
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama3.py` around
lines 179 - 191, The torch.ops.auto_deploy.torch_attention call (the attn_output
assignment) is missing the layer_type argument required by the sharding infra;
update the call to pass the layer type (e.g., add layer_type=self.layer_type or
the appropriate LayerType enum constant) as the final argument after the
"layout" string, and ensure the class has a self.layer_type attribute or
otherwise provide the correct symbol name to reflect the current layer (so
torch_attention receives the layer_type parameter).

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py (1)

229-236: Consider using keyword arguments for torch_attention.

Similar to other files, positional arguments reduce readability.

♻️ Use keyword arguments

         # Attention via canonical AD IR op (bsnd layout, handles GQA internally)
         attn_output = torch.ops.auto_deploy.torch_attention(
             query_states,
             key_states,
             value_states,
-            is_causal=True,
-            dropout_p=0.0,
-            layout="bsnd",
+            attn_mask=None,
+            dropout_p=0.0,
+            is_causal=True,
+            scale=None,
+            sinks=None,
+            sliding_window=None,
+            logit_cap=None,
+            layout="bsnd",
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py` around
lines 229 - 236, The call to torch.ops.auto_deploy.torch_attention currently
passes tensors positionally which hurts readability; change the call site in
modeling_decilm.py where torch.ops.auto_deploy.torch_attention(query_states,
key_states, value_states, is_causal=True, dropout_p=0.0, layout="bsnd") is
invoked to pass the three inputs and all options as explicit keyword arguments
(e.g., query_states=..., key_states=..., value_states=..., is_causal=True,
dropout_p=0.0, layout="bsnd") so the parameters are self-documenting and match
other files' style.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.py (2)

474-474: Use raise instead of assert for input validation in public API.

Using assert for input validation can be disabled with -O flag. For a public API method, use raise ValueError for proper error handling.

♻️ Replace assert with raise

-        assert position_ids is not None, "position_ids must be provided for AD export"
+        if position_ids is None:
+            raise ValueError("position_ids must be provided for AD export")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.py` at line
474, Replace the runtime-optimizable assertion with explicit input validation:
instead of using "assert position_ids is not None, ..." in the method that
checks position_ids (the assertion on position_ids in modeling_minimax_m2.py),
add a conditional that raises ValueError when position_ids is None (e.g., if
position_ids is None: raise ValueError("position_ids must be provided for AD
export")). This ensures the public API fails reliably even when Python is run
with optimizations disabled.

333-345: Consider using keyword arguments for torch_attention call.

Using positional arguments for a function with many optional parameters reduces readability and is prone to errors if the signature changes. Consider using keyword arguments for clarity.

♻️ Use keyword arguments

         # Attention using canonical op with GQA support (BSND layout)
         attn_output = torch.ops.auto_deploy.torch_attention(
-            q,
-            k,
-            v,
-            None,  # attn_mask
-            0.0,  # dropout_p
-            True,  # is_causal
-            self.scaling,  # scale
-            None,  # sinks
-            None,  # sliding_window
-            None,  # logit_cap
-            "bsnd",  # layout
+            q,
+            k,
+            v,
+            attn_mask=None,
+            dropout_p=0.0,
+            is_causal=True,
+            scale=self.scaling,
+            sinks=None,
+            sliding_window=None,
+            logit_cap=None,
+            layout="bsnd",
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.py` around
lines 333 - 345, The torch.ops.auto_deploy.torch_attention call in the
attn_output assignment uses many positional arguments which harms readability
and is fragile; change it to call torch.ops.auto_deploy.torch_attention using
explicit keyword arguments (e.g., q= q, k= k, v= v, attn_mask= None, dropout_p=
0.0, is_causal= True, scale= self.scaling, sinks= None, sliding_window= None,
logit_cap= None, layout= "bsnd") so each parameter is clearly labeled and future
signature changes are less error-prone.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py (2)

147-149: Prefix unused variables with underscore.

The variables bsz and seq_len are unpacked but unused.

♻️ Prefix unused variables

     def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
-        bsz, seq_len, hidden_dim = hidden_states.shape
+        _bsz, _seq_len, hidden_dim = hidden_states.shape
         hidden_states_flat = hidden_states.view(-1, hidden_dim)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py` around
lines 147 - 149, In the forward method, the unpacked variables bsz and seq_len
are unused; rename them to _bsz and _seq_len (or use underscores) when unpacking
hidden_states.shape in def forward(self, hidden_states: torch.Tensor) ->
Tuple[torch.Tensor, torch.Tensor]: so the line becomes something like "_bsz,
_seq_len, hidden_dim = hidden_states.shape" to satisfy the lint rule while
keeping the rest of the method (e.g., hidden_states_flat) unchanged.

426-426: Use raise instead of assert for input validation.

Same issue as in modeling_minimax_m2.py - assert can be disabled with -O.

♻️ Replace assert with raise

-        assert position_ids is not None, "position_ids must be provided for AD export"
+        if position_ids is None:
+            raise ValueError("position_ids must be provided for AD export")

This pattern appears in multiple files in this PR (lines 426, 484). Consider applying consistently.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py` at line
426, Replace the runtime-assertion "assert position_ids is not None,
'position_ids must be provided for AD export'" with explicit input validation
that always runs (e.g., if position_ids is None: raise ValueError("position_ids
must be provided for AD export")) so the check isn't skipped under -O; update
the same pattern in this module (modeling_glm4_moe.py) and the other affected
file (modeling_minimax_m2.py) where the same assert is used, ensuring you raise
an appropriate exception (ValueError or TypeError) in the relevant
function/method that performs AD export or forward.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py (2)

179-188: Prefix unused variables with underscore.

The variables T and D are unpacked but unused.

♻️ Prefix unused variables

     def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
         """Return (topk_indices, topk_weights), shapes [T, K] each."""
-        T, D = hidden_states.shape
+        _T, _D = hidden_states.shape
         # Cast both input and weight to float32 for gate computation
         logits = F.linear(hidden_states.float(), self.wg.weight.float())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py` around
lines 179 - 188, In the forward method of modeling_hunyuan_moe.py, the unpacked
variables T and D from hidden_states.shape are unused; change their names to _T
and _D (or use a single underscore assignment) to mark them as intentionally
unused—update the tuple unpack in def forward(self, hidden_states: torch.Tensor)
-> Tuple[torch.Tensor, torch.Tensor]: so hidden_states.shape is assigned to _T,
_D (or _) while leaving the rest of the logic in forward (logits =
F.linear(...), gates = F.softmax(...), topk = gates.topk(...), normalization,
and return) unchanged.

70-81: Consider using canonical torch_rmsnorm op for consistency.

Similar to the Phi4Flash model, this uses manual PyTorch RMSNorm instead of the canonical AD op. Other models in this PR use torch.ops.auto_deploy.torch_rmsnorm.

♻️ Use canonical AD op

     def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-        input_dtype = hidden_states.dtype
-        hidden_states = hidden_states.to(torch.float32)
-        variance = hidden_states.pow(2).mean(-1, keepdim=True)
-        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
-        return self.weight * hidden_states.to(input_dtype)
+        return torch.ops.auto_deploy.torch_rmsnorm(
+            hidden_states, self.weight, self.variance_epsilon
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py` around
lines 70 - 81, The HunYuanMoERMSNorm implementation manually computes RMSNorm in
forward; replace it with the canonical AD op torch.ops.auto_deploy.torch_rmsnorm
to match other models. Update the forward of class HunYuanMoERMSNorm to convert
hidden_states to float32 for the op if necessary, call
torch.ops.auto_deploy.torch_rmsnorm(hidden_states, self.weight,
self.variance_epsilon), and then cast the result back to the original input
dtype (preserve input_dtype). Ensure the parameter name self.weight and
attribute self.variance_epsilon remain unchanged so other code using this class
continues to work.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py (1)

149-160: Consider using the canonical torch_rmsnorm op for consistency.

Other auto-deploy models in this PR use torch.ops.auto_deploy.torch_rmsnorm for RMSNorm. This implementation uses a manual PyTorch approach, which works but may miss optimization opportunities in the export pipeline.

♻️ Optional: Use canonical AD op

 class Phi4FlashRMSNorm(nn.Module):
     def __init__(self, hidden_size: int, eps: float = 1e-5):
         super().__init__()
         self.weight = nn.Parameter(torch.ones(hidden_size))
         self.variance_epsilon = eps

     def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-        input_dtype = hidden_states.dtype
-        hidden_states = hidden_states.to(torch.float32)
-        variance = hidden_states.pow(2).mean(-1, keepdim=True)
-        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
-        return self.weight * hidden_states.to(input_dtype)
+        return torch.ops.auto_deploy.torch_rmsnorm(
+            hidden_states, self.weight, self.variance_epsilon
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py` around
lines 149 - 160, The custom Phi4FlashRMSNorm implements RMSNorm manually;
replace the manual computation in Phi4FlashRMSNorm.forward with the canonical op
torch.ops.auto_deploy.torch_rmsnorm to match other models and enable export
optimizations: keep the class and the self.weight nn.Parameter and
self.variance_epsilon, capture input_dtype, cast hidden_states to float32 if
needed, call torch.ops.auto_deploy.torch_rmsnorm(hidden_states, self.weight,
self.variance_epsilon) (or the op's exact arg order used elsewhere in the repo)
and then cast the result back to input_dtype before returning; ensure the weight
shape matches hidden_size and behavior (eps) remains the same.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py (1)

279-281: Prefix unused variables with underscore.

The static analysis correctly identifies that bsz and seq_len are unpacked but unused.

♻️ Prefix unused variables

     def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
-        bsz, seq_len, hidden_dim = hidden_states.shape
+        _bsz, _seq_len, hidden_dim = hidden_states.shape
         hidden_states_flat = hidden_states.view(-1, hidden_dim)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py` around
lines 279 - 281, The forward method unpacks bsz and seq_len but doesn't use
them; rename them to _bsz and _seq_len (or prefix with underscores) in the tuple
unpacking inside forward to mark them as intentionally unused and avoid
static-analysis warnings; update the unpacking line in def forward(self,
hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: where bsz,
seq_len, hidden_dim = hidden_states.shape to _bsz, _seq_len, hidden_dim =
hidden_states.shape and leave subsequent usage of hidden_states and
hidden_states_flat unchanged.

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4mm.py (1)

49-60: Consider adding error handling around spec.loader.exec_module(module).

The function loads and executes Python code dynamically from HuggingFace checkpoints. While the null checks on spec and spec.loader provide basic safety, wrapping the exec_module call in try-except would improve robustness—similar to the pattern used in tensorrt_llm/serve/cluster_storage.py. This is especially relevant since arbitrary code execution from checkpoints is inherent to the trust_remote_code pattern and should fail gracefully if module execution encounters errors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4mm.py` around
lines 49 - 60, The _load_hf_aux_module function currently calls
spec.loader.exec_module(module) without guarding runtime errors; wrap that
exec_module call in a try/except that catches Exception, and re-raise a clearer
ImportError (or raise after logging) that includes module_name and module_path
plus the original exception info so failures while executing checkpoint code
fail gracefully; follow the same pattern used in
tensorrt_llm/serve/cluster_storage.py and ensure spec.loader.exec_module(module)
is the protected call.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1166a167-d8ff-43e5-be58-962d10f8ccad

📥 Commits

Reviewing files that changed from the base of the PR and between f31b45b and acbb440.

⛔ Files ignored due to path filters (1)

tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/porting_log.csv is excluded by !**/*.csv

📒 Files selected for processing (87)

examples/auto_deploy/new_sharding/deepseek/deepseek_r1_sharding_poc.yaml
examples/auto_deploy/new_sharding/deepseek/deepseek_v2_5_sharding_poc.yaml
examples/auto_deploy/new_sharding/internlm/internlm3_sharding_poc.yaml
examples/auto_deploy/new_sharding/llama/llama3_sharding_poc.yaml
examples/auto_deploy/new_sharding/mistral/mistral_sharding_poc.yaml
examples/auto_deploy/new_sharding/nemotron/nemotron_sharding_poc_fp8.yaml
examples/auto_deploy/new_sharding/nemotron/nemotron_sharding_poc_nvfp4.yaml
examples/auto_deploy/new_sharding/qwen/qwen3_5_moe_sharding_poc.yaml
examples/auto_deploy/new_sharding/qwen/qwen3_sharding_poc.yaml
examples/auto_deploy/new_sharding/smollm/smollm3_sharding_poc.yaml
tensorrt_llm/_torch/auto_deploy/config/default.yaml
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.py
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py
tensorrt_llm/_torch/auto_deploy/custom_ops/linear/linear.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_mamba.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_mla.py
tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.py
tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py
tensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_cohere.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v2.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_exaone.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma2.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_gpt_oss.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_granite.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_granite_moe_hybrid.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_dense.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_internlm3.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama3.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama4.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_olmo3.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4_visionr.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4mm.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen2.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_moe.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_next.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_seed_oss.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_skywork_r1v2.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_smollm3.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_starcoder2.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/__init__.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_deepseek.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_deepseek_v2.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_internlm3.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_llama3.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_mistral.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_nemotron_h.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_qwen3.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_qwen3_5_moe.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_smollm3.py
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/porting_instructions.md
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/register_sharded_models.py
tensorrt_llm/_torch/auto_deploy/transform/interface.py
tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_quant.py
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
tensorrt_llm/_torch/auto_deploy/transform/optimizer.py
tensorrt_llm/_torch/auto_deploy/utils/dist_config.py
tensorrt_llm/_torch/auto_deploy/utils/mapping_utils.py
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py
tests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.py
tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py
tests/unittest/auto_deploy/singlegpu/custom_ops/test_sharding_ops.py
tests/unittest/auto_deploy/singlegpu/utils/test_dist_config.py
tests/unittest/auto_deploy/singlegpu/utils/test_mapping_utils.py
tests/unittest/auto_deploy/singlegpu/utils/test_node_utils_sharding.py

greg-kwasniewski1 · 2026-03-21T09:11:24Z

/bot run

greg-kwasniewski1 · 2026-03-21T09:15:22Z

/bot help

github-actions · 2026-03-21T09:15:30Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt-cicd · 2026-03-21T09:24:46Z

PR_Github #39792 [ run ] triggered by Bot. Commit: 9db20e2 Link to invocation

tensorrt-cicd · 2026-03-21T09:25:00Z

PR_Github #39792 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 3/21.

Link to invocation

greg-kwasniewski1 · 2026-03-22T05:26:04Z

/bot run

tensorrt-cicd · 2026-03-22T05:31:50Z

PR_Github #39810 [ run ] triggered by Bot. Commit: 9db20e2 Link to invocation

tensorrt-cicd · 2026-03-22T09:22:30Z

PR_Github #39810 [ run ] completed with state SUCCESS. Commit: 9db20e2
/LLM/main/L0_MergeRequest_PR pipeline #30987 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

greg-kwasniewski1 · 2026-03-22T14:25:00Z

/bot run

tensorrt-cicd · 2026-03-22T14:31:11Z

PR_Github #39832 [ run ] triggered by Bot. Commit: 9db20e2 Link to invocation

tensorrt-cicd · 2026-03-22T14:43:46Z

PR_Github #39832 [ run ] completed with state FAILURE. Commit: 9db20e2
/LLM/main/L0_MergeRequest_PR pipeline #31009 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

greg-kwasniewski1 · 2026-03-22T15:05:20Z

/bot run

tensorrt-cicd · 2026-03-22T15:11:27Z

PR_Github #39834 [ run ] triggered by Bot. Commit: f79caad Link to invocation

tensorrt-cicd · 2026-03-22T22:18:45Z

PR_Github #39834 [ run ] completed with state SUCCESS. Commit: f79caad
/LLM/main/L0_MergeRequest_PR pipeline #31010 completed with status: 'SUCCESS'

CI Report

Link to invocation

greg-kwasniewski1 · 2026-03-23T05:58:15Z

/bot run

tensorrt-cicd · 2026-03-23T06:05:11Z

PR_Github #39885 [ run ] triggered by Bot. Commit: d7b20b5 Link to invocation

tensorrt-cicd · 2026-03-23T09:45:27Z

PR_Github #39885 [ run ] completed with state SUCCESS. Commit: d7b20b5
/LLM/main/L0_MergeRequest_PR pipeline #31054 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

greg-kwasniewski1 · 2026-03-23T14:57:04Z

/bot run --reuse-test --disable-fail-fast

tensorrt-cicd · 2026-03-23T15:03:16Z

PR_Github #39950 [ run ] triggered by Bot. Commit: d7b20b5 Link to invocation

tensorrt-cicd · 2026-03-23T22:46:27Z

PR_Github #39950 [ run ] completed with state SUCCESS. Commit: d7b20b5
/LLM/main/L0_MergeRequest_PR pipeline #31115 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

greg-kwasniewski1 · 2026-03-24T06:21:12Z

/bot run --reuse-test --disable-fail-fast

tensorrt-cicd · 2026-03-24T06:27:47Z

PR_Github #40073 [ run ] triggered by Bot. Commit: d7b20b5 Link to invocation

tensorrt-cicd · 2026-03-24T07:43:34Z

PR_Github #40073 [ run ] completed with state FAILURE. Commit: d7b20b5
/LLM/main/L0_MergeRequest_PR pipeline #31226 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tensorrt-cicd · 2026-04-15T09:28:12Z

PR_Github #43467 [ run ] triggered by Bot. Commit: 02a2867 Link to invocation

tensorrt-cicd · 2026-04-15T18:37:23Z

PR_Github #43467 [ run ] completed with state SUCCESS. Commit: 02a2867
/LLM/main/L0_MergeRequest_PR pipeline #33986 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

greg-kwasniewski1 · 2026-04-15T18:55:23Z

/bot run --disable-fail-fast

…dist_config Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com> Made-with: Cursor

tensorrt-cicd · 2026-04-15T19:02:17Z

PR_Github #43565 [ run ] triggered by Bot. Commit: 21e4fc3 Link to invocation

tensorrt-cicd · 2026-04-16T04:57:46Z

PR_Github #43565 [ run ] completed with state SUCCESS. Commit: 21e4fc3
/LLM/main/L0_MergeRequest_PR pipeline #34064 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

greg-kwasniewski1 · 2026-04-16T08:01:09Z

/bot run --reuse-test --disable-fail-fast

tensorrt-cicd · 2026-04-16T08:08:49Z

PR_Github #43718 [ run ] triggered by Bot. Commit: dfd3823 Link to invocation

tensorrt-cicd · 2026-04-16T10:04:43Z

PR_Github #43718 [ run ] completed with state FAILURE. Commit: dfd3823
/LLM/main/L0_MergeRequest_PR pipeline #34203 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

greg-kwasniewski1 · 2026-04-16T10:53:12Z

/bot run --reuse-test --disable-fail-fast

tensorrt-cicd · 2026-04-16T10:59:12Z

PR_Github #43764 [ run ] triggered by Bot. Commit: dfd3823 Link to invocation

tensorrt-cicd · 2026-04-17T01:25:21Z

PR_Github #43764 [ run ] completed with state FAILURE. Commit: dfd3823
/LLM/main/L0_MergeRequest_PR pipeline #34245 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

greg-kwasniewski1 · 2026-04-17T09:44:04Z

/bot run --reuse-test --disable-fail-fast

tensorrt-cicd · 2026-04-17T09:49:50Z

PR_Github #44018 [ run ] triggered by Bot. Commit: 82da57d Link to invocation

tensorrt-cicd · 2026-04-18T09:50:41Z

PR_Github #44018 [ run ] completed with state ABORTED. Commit: 82da57d

Link to invocation

greg-kwasniewski1 · 2026-04-18T11:35:08Z

/bot run --reuse-test --disable-fail-fast

tensorrt-cicd · 2026-04-18T11:40:47Z

PR_Github #44109 [ run ] triggered by Bot. Commit: e1159d8 Link to invocation

tensorrt-cicd · 2026-04-18T23:33:34Z

PR_Github #44109 [ run ] completed with state SUCCESS. Commit: e1159d8
/LLM/main/L0_MergeRequest_PR pipeline #34536 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

greg-kwasniewski1 · 2026-04-20T09:57:51Z

/bot help

github-actions · 2026-04-20T09:58:00Z