[TRTLLM-12291][feat] New sharding infrastructure#12419
[TRTLLM-12291][feat] New sharding infrastructure#12419greg-kwasniewski1 merged 17 commits intoNVIDIA:mainfrom
Conversation
|
/bot run |
📝 WalkthroughWalkthroughThis pull request introduces a sharding IR (Intermediate Representation) system for AutoDeploy, enabling tensor and expert parallelism across distributed model execution. Changes include new POC configurations for multiple model architectures, sharding-aware metadata parameters added to custom operations, new sharding-specific custom ops, updated model implementations optimized for AutoDeploy export, and configuration updates to support the Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py (1)
429-448:⚠️ Potential issue | 🔴 CriticalAdd
layer_typeparameter to FP8/NVFP4 MoE ops to prevent crashes whenshard_layersfiltering is used.The
torch_quant_fp8_moeandtorch_quant_nvfp4_moeops lack thelayer_typeparameter present intorch_moe. This causesapply_sharding_hintsto crash with aRuntimeErrorwhenshard_layersconfig filtering is enabled, because it unconditionally attempts to extractlayer_typefrom all MoE ops viaextract_op_args(node, "layer_type"), which fails if the parameter is not in the schema (lines 3927 in sharding.py).Add
layer_type: str = "unknown"parameter to bothtorch_quant_fp8_moeandtorch_quant_nvfp4_moefunction signatures and their corresponding fake implementations to matchtorch_moe.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.py` around lines 429 - 448, The FP8/NVFP4 MoE custom ops are missing the layer_type parameter which causes apply_sharding_hints to crash when shard_layers filtering calls extract_op_args(node, "layer_type"); update both function signatures torch_quant_fp8_moe and torch_quant_nvfp4_moe to include layer_type: str = "unknown" and add the same parameter to their corresponding fake implementations so their schemas match torch_moe (use the default "unknown"); ensure any internal usages or return signatures remain unchanged.tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py (1)
560-566:⚠️ Potential issue | 🟠 MajorUse ceil-division when inferring fine-grained FP8 block sizes.
weight_scale_invnow representsceil(N / block_n) x ceil(K / block_k). The floor divisions here under-estimate the block size for tiny or non-divisible projections, so_safe_act_quant()andw8a8_block_fp8_matmul_triton()run with the wrong geometry.Suggested fix
- block_n = N // scale_n - block_k = K // scale_k + block_n = cdiv(N, scale_n) + block_k = cdiv(K, scale_k)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.py` around lines 560 - 566, The code infers block_size using floor division which underestimates FP8 block dimensions for non-divisible shapes; change the computation in the block where N, K = weight_quantized.shape and scale_n, scale_k = weight_scale_inv.shape to use ceil-division so block_n = ceil(N / scale_n) and block_k = ceil(K / scale_k) (ensure math.ceil is available/imported) and set block_size = [block_n, block_k]; this will align the inferred block geometry used by _safe_act_quant() and w8a8_block_fp8_matmul_triton() with the ceil-based shape of weight_scale_inv.
🟡 Minor comments (6)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_mamba.py-193-196 (1)
193-196:⚠️ Potential issue | 🟡 MinorDocstring references non-existent
tp_modeparameter.The docstring on line 196 mentions
tp_modebut the actual parameters added areshardableandlayer_type. This appears to be a copy-paste error.📝 Proposed fix for docstring
- """Mamba SSM mixer forward; accepts ``tp_mode`` for sharding-aware AutoDeploy behavior.""" + """Mamba SSM mixer forward; accepts sharding metadata for AutoDeploy behavior."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_mamba.py` around lines 193 - 196, The docstring for the "Mamba SSM mixer forward" function incorrectly references a non-existent parameter `tp_mode`; update the docstring to reflect the actual parameters (`shardable: bool` and `layer_type: str`) and remove any mention of `tp_mode`. Edit the docstring in torch_mamba.py (the Mamba SSM mixer forward docstring) to document `shardable` and `layer_type` briefly and accurately (their types and purpose) so the signature and docstring match.tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_causal_conv.py-34-38 (1)
34-38:⚠️ Potential issue | 🟡 MinorDocstring references non-existent
tp_modeparameter.Similar to
torch_mamba.py, the docstring mentionstp_modebut the actual parameters areshardable,output_sizes, andlayer_type.📝 Proposed fix for docstring
- """Causal 1D convolution; accepts ``tp_mode`` for sharding-aware AutoDeploy behavior.""" + """Causal 1D convolution; accepts sharding metadata for AutoDeploy behavior."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_causal_conv.py` around lines 34 - 38, The docstring incorrectly mentions a non-existent parameter `tp_mode`; update the docstring in torch_causal_conv.py to remove `tp_mode` and instead describe the actual parameters `shardable`, `output_sizes`, and `layer_type` and their effect (e.g., sharding-aware AutoDeploy behavior). Locate the function whose signature includes shardable: bool = False, output_sizes: Optional[List[int]] = None, layer_type: str = "unknown" and revise its docstring text to accurately reflect those parameters and intended behavior.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd full Apache 2.0 license header.
The file has only a short copyright line but is missing the full Apache 2.0 license block required by coding guidelines for all TensorRT-LLM source files.
📝 Suggested license header
-# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.As per coding guidelines: "All TensorRT-LLM source files should contain an NVIDIA copyright header with the year of the latest meaningful modification."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py` at line 1, The file modeling_phi4flash.py currently only has a short copyright line; replace or prepend it with the full Apache 2.0 license header used across TensorRT-LLM source files (including the NVIDIA copyright line with the latest modification year and the full license text and notice), ensuring the complete header appears at the top of tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py and follows the same formatting and year as other project files.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorUpdate copyright year and add full license header.
The copyright year is 2025 but should be 2026 (current year). Also, the file is missing the full Apache 2.0 license block.
📝 Suggested fix
-# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py` at line 1, Update the file header to use the current year and include the complete Apache License 2.0 header: change the copyright year from 2025 to 2026 at the top of modeling_glm_moe_dsa.py and replace the existing short copyright line with the full Apache-2.0 license block (including copyright statement, "Licensed under the Apache License, Version 2.0 (the "License")" wording, link to the license, and the standard disclaimer and permissions paragraphs) so the file contains the standard full license header.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py-419-419 (1)
419-419:⚠️ Potential issue | 🟡 MinorUpdate the misleading comment about dynamic assignment.
The comment "set dynamically below when used with real HunYuanConfig" is inaccurate—
config_classis not assigned anywhere in the file. Theconfig_class = Noneis intentional for custom models using HuggingFace configs loaded viatrust_remote_code. Update the comment to clarify this pattern, similar to the approach inmodeling_skywork_r1v2.py:config_class = None # HunYuanConfig uses trust_remote_code; not imported here.The instantiation concern is unfounded because the actual config is passed directly to
__init__()and passed tosuper().__init__(config), soconfig_classbeingNonedoes not prevent initialization.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py` at line 419, Update the misleading inline comment for the module-level variable config_class = None: replace "set dynamically below when used with real HunYuanConfig" with a clear explanation that this pattern intentionally leaves config_class None because HunYuanConfig is provided via trust_remote_code and not imported here (e.g., "config_class = None # HunYuanConfig uses trust_remote_code; not imported here"). Edit the comment near the config_class declaration in modeling_hunyuan_moe.py to match the wording used in modeling_skywork_r1v2.py so readers understand the config is passed to __init__ and super().__init__(config) rather than assigned in this file.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py-320-324 (1)
320-324:⚠️ Potential issue | 🟡 MinorAdd explicit
config_class = Noneassignment with explanatory comment.To match the pattern used in similar models that load config via
trust_remote_code(e.g.,SkyworkR1V2), explicitly setconfig_class = Noneinstead of omitting it. This makes the intentional design choice explicit and avoids any ambiguity:class DeciLMPreTrainedModel(PreTrainedModel): config_class = None # Config loaded via trust_remote_code from HF checkpoint base_model_prefix = "model" _no_split_modules = ["DeciLMDecoderLayer"] _supports_flash_attn_2 = True supports_gradient_checkpointing = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py` around lines 320 - 324, Add an explicit config_class assignment to the DeciLMPreTrainedModel class to document that config is loaded via trust_remote_code: set config_class = None with a short comment (e.g., "Config loaded via trust_remote_code from HF checkpoint") inside the class DeciLMPreTrainedModel so the intent is explicit alongside existing attributes like base_model_prefix, _no_split_modules, _supports_flash_attn_2, and supports_gradient_checkpointing.
🧹 Nitpick comments (22)
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/__init__.py (1)
1-14: Consider adding a module docstring to document the package.The file is valid as an empty package initializer. However, adding a docstring after the license header would help developers understand the purpose of the
new_shardingpackage, especially given that this PR introduces new sharding infrastructure.📝 Example package docstring
# See the License for the specific language governing permissions and # limitations under the License. +""" +New sharding infrastructure for AutoDeploy. + +This package provides sharding-aware transformations and configurations +for distributed model execution with tensor and expert parallelism. +"""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/__init__.py` around lines 1 - 14, Add a concise module-level docstring to the __init__.py for the new_sharding package (the package introduced here) immediately after the license header: include a short description of the package purpose (new sharding infrastructure), list exported submodules/classes/functions or where to find them (e.g., sharding strategies, utilities), and any usage/compatibility notes or maintainer/contact info; implement this as a top-of-file triple-quoted string so tools like pydoc and IDEs surface it.tensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.py (1)
84-97: Inconsistent fake implementation pattern for all_reduce.The real
all_reduceimplementation returnsx.clone()(line 92), but the fake implementation returnstorch.empty_like(x)(line 97). This differs fromviewandsplit_with_sizeswhere both real and fake implementations useclone().While
empty_likeis sufficient for shape inference during tracing, this inconsistency could cause issues if fake tensor values are ever inspected during debugging or validation. Consider usingx.clone()for consistency.♻️ Proposed fix for consistency
`@all_reduce.register_fake` def _all_reduce_fake(x: torch.Tensor, layer_type: str = "unknown") -> torch.Tensor: - return torch.empty_like(x) + return x.clone()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.py` around lines 84 - 97, The fake implementation of the custom op all_reduce (_all_reduce_fake) returns torch.empty_like(x) which is inconsistent with the real all_reduce (which returns x.clone()); change _all_reduce_fake to return x.clone() so fake and real implementations match value-preserving behavior—update the _all_reduce_fake function in sharding_ops.py to return a clone of the input tensor (use the all_reduce and _all_reduce_fake identifiers to find the code).tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py (1)
62-70: Redundant workaround call after bfloat16 cast.After casting
wtobfloat16(line 64), calling_index_select_with_float8_cpu_workaroundon line 68 will never trigger the workaround path sincew_rope.dtypeis nowbfloat16, not a float8 dtype. The function will simply fall through to the regularindex_select. While this is not incorrect, it adds unnecessary overhead of checking the dtype condition.Consider either:
- Calling the regular
index_selectdirectly when the bfloat16 cast path is taken, or- Documenting that this is intentional for code uniformity
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py` around lines 62 - 70, The code casts tensor w to bfloat16 and then calls _index_select_with_float8_cpu_workaround on w_rope which will never hit its float8 branch, causing an unnecessary dtype check; in the block around variables w, w_rope and q_key inside mla_rope_utils.py either (A) perform the index selection with torch.index_select (or torch.take/advanced indexing) directly on w_rope when orig_dtype was converted to torch.bfloat16, or (B) defer the bfloat16 cast until after calling _index_select_with_float8_cpu_workaround so that the workaround can still run for float8 dtypes—update the code to choose one of these paths and remove the redundant dtype-checking call to _index_select_with_float8_cpu_workaround when w has already been cast to bfloat16.tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py (1)
23-43: Use module-level import forDistConfigto match the repo import rule.Please import the module and reference
DistConfigvia namespace instead of importing the class directly.Proposed diff
-from tensorrt_llm._torch.auto_deploy.utils.dist_config import DistConfig +from tensorrt_llm._torch.auto_deploy.utils import dist_config @@ - dc = DistConfig.deserialize(mapping_config) + dc = dist_config.DistConfig.deserialize(mapping_config)As per coding guidelines: "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py` around lines 23 - 43, Replace the direct class import with a module-level import and update usages accordingly: change the import of DistConfig in the module to import the dist_config module (e.g. tensorrt_llm._torch.auto_deploy.utils.dist_config) and update the call in _check_moe_alltoall to use the module namespace (e.g. dist_config.DistConfig.deserialize(...)) and any other references to DistConfig in this file to dist_config.DistConfig so the repo import rule (module-level imports) is followed.tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py (1)
83-104: Non-gated RMSNorm ops currently lack sharding metadata and are not processed byapply_sharding_hints.The non-gated variants (
torch_rmsnorm,triton_rmsnorm,flashinfer_rmsnorm) have only 3 parameters:(input, weight, eps). In contrast, their gated counterparts (torch_rmsnorm_gated,triton_rmsnorm_gated) includetp_modeandlayer_typemetadata that allowsapply_sharding_hintsto apply tensor-parallel sharding.
is_any_shardable_opcurrently detects only the gated variants forShardableOp.NORM, so non-gated variants are never passed to_apply_hint_norm. If future changes require sharding non-gated variants, consider adding the sametp_modeandlayer_typeparameters to enable consistent sharding treatment across all RMSNorm backends.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.py` around lines 83 - 104, Non-gated RMSNorm ops (torch_rmsnorm, triton_rmsnorm, flashinfer_rmsnorm) lack the tp_mode/layer_type metadata so apply_sharding_hints and is_any_shardable_op (which currently checks only gated variants like torch_rmsnorm_gated and triton_rmsnorm_gated for ShardableOp.NORM) never route them to _apply_hint_norm; fix by adding the same tp_mode and layer_type parameters to the non-gated custom ops' signatures and fake registrations, and update is_any_shardable_op (and any dispatch that filters on ShardableOp.NORM) to recognize these updated non-gated names so they receive sharding metadata and are processed by _apply_hint_norm.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_smollm3.py (1)
182-194: Missinglayer_typeparameter intorch_attentioncall.The
torch_attentioncustom op now accepts an optionallayer_type: str = "unknown"parameter for sharding hints metadata (as shown intorch_attention.pylines 109-229). While it defaults to"unknown", other model implementations in this PR may benefit from passing an explicit value for consistency with the sharding infrastructure.Consider adding the
layer_typeparameter for consistency:attn_output = torch.ops.auto_deploy.torch_attention( q, # [B, S, N, head_dim] k, # [B, S, N_kv, head_dim] v, # [B, S, N_kv, head_dim] None, # attn_mask 0.0, # dropout_p True, # is_causal self.scaling, # scale None, # sinks None, # sliding_window None, # logit_cap "bsnd", # layout + "unknown", # layer_type )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_smollm3.py` around lines 182 - 194, The torch_attention custom op call in modeling_smollm3.py (the attn_output assignment using torch.ops.auto_deploy.torch_attention) is missing the optional layer_type parameter; update that call to pass an explicit layer_type string (e.g., "self_attn" or another descriptive name) as the final argument to match the new signature in torch_attention and provide sharding hints metadata, keeping all other arguments (q, k, v, attn_mask, dropout_p, is_causal, scaling, sinks, sliding_window, logit_cap, layout) unchanged.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral.py (1)
170-182: Missinglayer_typeparameter intorch_attentioncall.For consistency with the sharding infrastructure, consider adding:
attn_output = torch.ops.auto_deploy.torch_attention( q, # [B, S, N, head_dim] k, # [B, S, N_kv, head_dim] v, # [B, S, N_kv, head_dim] None, # attn_mask 0.0, # dropout_p True, # is_causal self.scaling, # scale None, # sinks self.sliding_window, # sliding_window None, # logit_cap "bsnd", # layout + "unknown", # layer_type )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral.py` around lines 170 - 182, The call to torch.ops.auto_deploy.torch_attention in modeling_mistral.py (the attn_output assignment) is missing the required layer_type parameter expected by the sharding infrastructure; update the torch_attention call to pass the appropriate layer_type argument (e.g., the current block's type or a constant like "self_attn"/"cross_attn" as used elsewhere) as the next parameter after "layout" so the function signature matches the operator and sharding logic in the codebase.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_exaone.py (1)
243-255: Missinglayer_typeparameter intorch_attentioncall.For consistency with the sharding infrastructure, consider adding:
attn_output = torch.ops.auto_deploy.torch_attention( q, # [B, S, N, head_dim] k, # [B, S, N_kv, head_dim] v, # [B, S, N_kv, head_dim] None, # attn_mask 0.0, # dropout_p True, # is_causal self.scaling, # scale None, # sinks None, # sliding_window None, # logit_cap "bsnd", # layout + "unknown", # layer_type )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_exaone.py` around lines 243 - 255, The call to torch.ops.auto_deploy.torch_attention in modeling_exaone.py is missing the required layer_type argument for sharding metadata; update the call in the method that constructs attn_output to pass the layer type (e.g., add self.layer_type as the final parameter) and, if the class lacks a layer_type attribute, add one (or compute it) on the model class so torch_attention receives a valid layer_type string; touch the torch_attention invocation and the containing class (where attn_output is built) to ensure the new parameter is populated consistently.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_skywork_r1v2.py (1)
372-384: Missinglayer_typeparameter intorch_attentioncall.Same as other models in this PR, consider adding the explicit
layer_typeparameter for consistency with the sharding infrastructure:attn_output = torch.ops.auto_deploy.torch_attention( q, k, v, None, # attn_mask 0.0, # dropout_p True, # is_causal self.scaling, None, # sinks None, # sliding_window None, # logit_cap "bsnd", + "unknown", # layer_type )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_skywork_r1v2.py` around lines 372 - 384, The torch.ops.auto_deploy.torch_attention call (assigned to attn_output) is missing the explicit layer_type kwarg required by the sharding infra; update the call to pass layer_type with the same string used in other models in this PR (e.g., the value used elsewhere for self-attention layers) as a keyword argument so torch_attention receives layer_type consistently across models.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_starcoder2.py (1)
167-179: Missinglayer_typeparameter intorch_attentioncall.For consistency with the sharding infrastructure, consider adding the explicit parameter:
attn_output = torch.ops.auto_deploy.torch_attention( q, # [B, S, N, head_dim] k, # [B, S, N_kv, head_dim] v, # [B, S, N_kv, head_dim] None, # attn_mask 0.0, # dropout_p True, # is_causal self.scaling, # scale None, # sinks self.sliding_window, # sliding_window None, # logit_cap "bsnd", # layout + "unknown", # layer_type )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_starcoder2.py` around lines 167 - 179, The torch.ops.auto_deploy.torch_attention call (assigned to attn_output) is missing the layer_type parameter required by the sharding infra; update the call in modeling_starcoder2.py to pass the layer type (e.g., use self.layer_type or the appropriate string literal) as the layer_type argument to torch_attention so the function signature matches other usages and the sharding logic can identify this attention layer.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_seed_oss.py (1)
178-190: Missinglayer_typeparameter intorch_attentioncall.For consistency with the sharding infrastructure, consider adding:
attn_output = torch.ops.auto_deploy.torch_attention( q, # [B, S, N, head_dim] k, # [B, S, N_kv, head_dim] v, # [B, S, N_kv, head_dim] None, # attn_mask 0.0, # dropout_p True, # is_causal self.scaling, # scale None, # sinks None, # sliding_window None, # logit_cap "bsnd", # layout + "unknown", # layer_type )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_seed_oss.py` around lines 178 - 190, The torch.ops.auto_deploy.torch_attention call in modeling_seed_oss.py is missing the required layer_type argument; update the call site (the torch_attention invocation that passes q, k, v, ..., "bsnd") to include the layer_type parameter (e.g., pass self.layer_type or an explicit string like "attention" as the final argument) so the call matches the expected signature used by the sharding infrastructure.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama3.py (1)
179-191: Missinglayer_typeparameter intorch_attentioncall.For consistency with the sharding infrastructure, consider adding:
attn_output = torch.ops.auto_deploy.torch_attention( q, # [B, S, N, head_dim] k, # [B, S, N_kv, head_dim] v, # [B, S, N_kv, head_dim] None, # attn_mask 0.0, # dropout_p True, # is_causal self.scaling, # scale None, # sinks None, # sliding_window None, # logit_cap "bsnd", # layout + "unknown", # layer_type )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama3.py` around lines 179 - 191, The torch.ops.auto_deploy.torch_attention call (the attn_output assignment) is missing the layer_type argument required by the sharding infra; update the call to pass the layer type (e.g., add layer_type=self.layer_type or the appropriate LayerType enum constant) as the final argument after the "layout" string, and ensure the class has a self.layer_type attribute or otherwise provide the correct symbol name to reflect the current layer (so torch_attention receives the layer_type parameter).tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py (1)
229-236: Consider using keyword arguments for torch_attention.Similar to other files, positional arguments reduce readability.
♻️ Use keyword arguments
# Attention via canonical AD IR op (bsnd layout, handles GQA internally) attn_output = torch.ops.auto_deploy.torch_attention( query_states, key_states, value_states, - is_causal=True, - dropout_p=0.0, - layout="bsnd", + attn_mask=None, + dropout_p=0.0, + is_causal=True, + scale=None, + sinks=None, + sliding_window=None, + logit_cap=None, + layout="bsnd", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.py` around lines 229 - 236, The call to torch.ops.auto_deploy.torch_attention currently passes tensors positionally which hurts readability; change the call site in modeling_decilm.py where torch.ops.auto_deploy.torch_attention(query_states, key_states, value_states, is_causal=True, dropout_p=0.0, layout="bsnd") is invoked to pass the three inputs and all options as explicit keyword arguments (e.g., query_states=..., key_states=..., value_states=..., is_causal=True, dropout_p=0.0, layout="bsnd") so the parameters are self-documenting and match other files' style.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.py (2)
474-474: Useraiseinstead ofassertfor input validation in public API.Using
assertfor input validation can be disabled with-Oflag. For a public API method, useraise ValueErrorfor proper error handling.♻️ Replace assert with raise
- assert position_ids is not None, "position_ids must be provided for AD export" + if position_ids is None: + raise ValueError("position_ids must be provided for AD export")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.py` at line 474, Replace the runtime-optimizable assertion with explicit input validation: instead of using "assert position_ids is not None, ..." in the method that checks position_ids (the assertion on position_ids in modeling_minimax_m2.py), add a conditional that raises ValueError when position_ids is None (e.g., if position_ids is None: raise ValueError("position_ids must be provided for AD export")). This ensures the public API fails reliably even when Python is run with optimizations disabled.
333-345: Consider using keyword arguments for torch_attention call.Using positional arguments for a function with many optional parameters reduces readability and is prone to errors if the signature changes. Consider using keyword arguments for clarity.
♻️ Use keyword arguments
# Attention using canonical op with GQA support (BSND layout) attn_output = torch.ops.auto_deploy.torch_attention( - q, - k, - v, - None, # attn_mask - 0.0, # dropout_p - True, # is_causal - self.scaling, # scale - None, # sinks - None, # sliding_window - None, # logit_cap - "bsnd", # layout + q, + k, + v, + attn_mask=None, + dropout_p=0.0, + is_causal=True, + scale=self.scaling, + sinks=None, + sliding_window=None, + logit_cap=None, + layout="bsnd", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.py` around lines 333 - 345, The torch.ops.auto_deploy.torch_attention call in the attn_output assignment uses many positional arguments which harms readability and is fragile; change it to call torch.ops.auto_deploy.torch_attention using explicit keyword arguments (e.g., q= q, k= k, v= v, attn_mask= None, dropout_p= 0.0, is_causal= True, scale= self.scaling, sinks= None, sliding_window= None, logit_cap= None, layout= "bsnd") so each parameter is clearly labeled and future signature changes are less error-prone.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py (2)
147-149: Prefix unused variables with underscore.The variables
bszandseq_lenare unpacked but unused.♻️ Prefix unused variables
def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: - bsz, seq_len, hidden_dim = hidden_states.shape + _bsz, _seq_len, hidden_dim = hidden_states.shape hidden_states_flat = hidden_states.view(-1, hidden_dim)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py` around lines 147 - 149, In the forward method, the unpacked variables bsz and seq_len are unused; rename them to _bsz and _seq_len (or use underscores) when unpacking hidden_states.shape in def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: so the line becomes something like "_bsz, _seq_len, hidden_dim = hidden_states.shape" to satisfy the lint rule while keeping the rest of the method (e.g., hidden_states_flat) unchanged.
426-426: Useraiseinstead ofassertfor input validation.Same issue as in modeling_minimax_m2.py -
assertcan be disabled with-O.♻️ Replace assert with raise
- assert position_ids is not None, "position_ids must be provided for AD export" + if position_ids is None: + raise ValueError("position_ids must be provided for AD export")This pattern appears in multiple files in this PR (lines 426, 484). Consider applying consistently.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.py` at line 426, Replace the runtime-assertion "assert position_ids is not None, 'position_ids must be provided for AD export'" with explicit input validation that always runs (e.g., if position_ids is None: raise ValueError("position_ids must be provided for AD export")) so the check isn't skipped under -O; update the same pattern in this module (modeling_glm4_moe.py) and the other affected file (modeling_minimax_m2.py) where the same assert is used, ensuring you raise an appropriate exception (ValueError or TypeError) in the relevant function/method that performs AD export or forward.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py (2)
179-188: Prefix unused variables with underscore.The variables
TandDare unpacked but unused.♻️ Prefix unused variables
def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: """Return (topk_indices, topk_weights), shapes [T, K] each.""" - T, D = hidden_states.shape + _T, _D = hidden_states.shape # Cast both input and weight to float32 for gate computation logits = F.linear(hidden_states.float(), self.wg.weight.float())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py` around lines 179 - 188, In the forward method of modeling_hunyuan_moe.py, the unpacked variables T and D from hidden_states.shape are unused; change their names to _T and _D (or use a single underscore assignment) to mark them as intentionally unused—update the tuple unpack in def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: so hidden_states.shape is assigned to _T, _D (or _) while leaving the rest of the logic in forward (logits = F.linear(...), gates = F.softmax(...), topk = gates.topk(...), normalization, and return) unchanged.
70-81: Consider using canonicaltorch_rmsnormop for consistency.Similar to the Phi4Flash model, this uses manual PyTorch RMSNorm instead of the canonical AD op. Other models in this PR use
torch.ops.auto_deploy.torch_rmsnorm.♻️ Use canonical AD op
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: - input_dtype = hidden_states.dtype - hidden_states = hidden_states.to(torch.float32) - variance = hidden_states.pow(2).mean(-1, keepdim=True) - hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) - return self.weight * hidden_states.to(input_dtype) + return torch.ops.auto_deploy.torch_rmsnorm( + hidden_states, self.weight, self.variance_epsilon + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.py` around lines 70 - 81, The HunYuanMoERMSNorm implementation manually computes RMSNorm in forward; replace it with the canonical AD op torch.ops.auto_deploy.torch_rmsnorm to match other models. Update the forward of class HunYuanMoERMSNorm to convert hidden_states to float32 for the op if necessary, call torch.ops.auto_deploy.torch_rmsnorm(hidden_states, self.weight, self.variance_epsilon), and then cast the result back to the original input dtype (preserve input_dtype). Ensure the parameter name self.weight and attribute self.variance_epsilon remain unchanged so other code using this class continues to work.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py (1)
149-160: Consider using the canonicaltorch_rmsnormop for consistency.Other auto-deploy models in this PR use
torch.ops.auto_deploy.torch_rmsnormfor RMSNorm. This implementation uses a manual PyTorch approach, which works but may miss optimization opportunities in the export pipeline.♻️ Optional: Use canonical AD op
class Phi4FlashRMSNorm(nn.Module): def __init__(self, hidden_size: int, eps: float = 1e-5): super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: - input_dtype = hidden_states.dtype - hidden_states = hidden_states.to(torch.float32) - variance = hidden_states.pow(2).mean(-1, keepdim=True) - hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) - return self.weight * hidden_states.to(input_dtype) + return torch.ops.auto_deploy.torch_rmsnorm( + hidden_states, self.weight, self.variance_epsilon + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.py` around lines 149 - 160, The custom Phi4FlashRMSNorm implements RMSNorm manually; replace the manual computation in Phi4FlashRMSNorm.forward with the canonical op torch.ops.auto_deploy.torch_rmsnorm to match other models and enable export optimizations: keep the class and the self.weight nn.Parameter and self.variance_epsilon, capture input_dtype, cast hidden_states to float32 if needed, call torch.ops.auto_deploy.torch_rmsnorm(hidden_states, self.weight, self.variance_epsilon) (or the op's exact arg order used elsewhere in the repo) and then cast the result back to input_dtype before returning; ensure the weight shape matches hidden_size and behavior (eps) remains the same.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py (1)
279-281: Prefix unused variables with underscore.The static analysis correctly identifies that
bszandseq_lenare unpacked but unused.♻️ Prefix unused variables
def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: - bsz, seq_len, hidden_dim = hidden_states.shape + _bsz, _seq_len, hidden_dim = hidden_states.shape hidden_states_flat = hidden_states.view(-1, hidden_dim)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.py` around lines 279 - 281, The forward method unpacks bsz and seq_len but doesn't use them; rename them to _bsz and _seq_len (or prefix with underscores) in the tuple unpacking inside forward to mark them as intentionally unused and avoid static-analysis warnings; update the unpacking line in def forward(self, hidden_states: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: where bsz, seq_len, hidden_dim = hidden_states.shape to _bsz, _seq_len, hidden_dim = hidden_states.shape and leave subsequent usage of hidden_states and hidden_states_flat unchanged.tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4mm.py (1)
49-60: Consider adding error handling aroundspec.loader.exec_module(module).The function loads and executes Python code dynamically from HuggingFace checkpoints. While the null checks on
specandspec.loaderprovide basic safety, wrapping theexec_modulecall in try-except would improve robustness—similar to the pattern used intensorrt_llm/serve/cluster_storage.py. This is especially relevant since arbitrary code execution from checkpoints is inherent to thetrust_remote_codepattern and should fail gracefully if module execution encounters errors.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4mm.py` around lines 49 - 60, The _load_hf_aux_module function currently calls spec.loader.exec_module(module) without guarding runtime errors; wrap that exec_module call in a try/except that catches Exception, and re-raise a clearer ImportError (or raise after logging) that includes module_name and module_path plus the original exception info so failures while executing checkpoint code fail gracefully; follow the same pattern used in tensorrt_llm/serve/cluster_storage.py and ensure spec.loader.exec_module(module) is the protected call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1166a167-d8ff-43e5-be58-962d10f8ccad
⛔ Files ignored due to path filters (1)
tensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/porting_log.csvis excluded by!**/*.csv
📒 Files selected for processing (87)
examples/auto_deploy/new_sharding/deepseek/deepseek_r1_sharding_poc.yamlexamples/auto_deploy/new_sharding/deepseek/deepseek_v2_5_sharding_poc.yamlexamples/auto_deploy/new_sharding/internlm/internlm3_sharding_poc.yamlexamples/auto_deploy/new_sharding/llama/llama3_sharding_poc.yamlexamples/auto_deploy/new_sharding/mistral/mistral_sharding_poc.yamlexamples/auto_deploy/new_sharding/nemotron/nemotron_sharding_poc_fp8.yamlexamples/auto_deploy/new_sharding/nemotron/nemotron_sharding_poc_nvfp4.yamlexamples/auto_deploy/new_sharding/qwen/qwen3_5_moe_sharding_poc.yamlexamples/auto_deploy/new_sharding/qwen/qwen3_sharding_poc.yamlexamples/auto_deploy/new_sharding/smollm/smollm3_sharding_poc.yamltensorrt_llm/_torch/auto_deploy/config/default.yamltensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_gated_delta.pytensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/torch_moe.pytensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.pytensorrt_llm/_torch/auto_deploy/custom_ops/linear/linear.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_causal_conv.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_mamba.pytensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_mla.pytensorrt_llm/_torch/auto_deploy/custom_ops/normalization/rms_norm.pytensorrt_llm/_torch/auto_deploy/custom_ops/quantization/quant.pytensorrt_llm/_torch/auto_deploy/custom_ops/quantization/torch_quant.pytensorrt_llm/_torch/auto_deploy/custom_ops/sharding_ops.pytensorrt_llm/_torch/auto_deploy/llm_args.pytensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_cohere.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_decilm.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_deepseek_v2.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_exaone.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma2.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_gemma3.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm4_moe.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_glm_moe_dsa.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_gpt_oss.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_granite.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_granite_moe_hybrid.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_dense.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_hunyuan_moe.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_internlm3.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama3.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_llama4.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_minimax_m2.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_olmo3.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4_visionr.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4flash.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_phi4mm.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen2.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_5.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_moe.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_qwen3_next.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_seed_oss.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_skywork_r1v2.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_smollm3.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_starcoder2.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/__init__.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_deepseek.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_deepseek_v2.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_internlm3.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_llama3.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_mistral.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_nemotron_h.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_qwen3.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_qwen3_5_moe.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/modeling_smollm3.pytensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/porting_instructions.mdtensorrt_llm/_torch/auto_deploy/models/custom/new_sharding/register_sharded_models.pytensorrt_llm/_torch/auto_deploy/transform/interface.pytensorrt_llm/_torch/auto_deploy/transform/library/collectives.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_quant.pytensorrt_llm/_torch/auto_deploy/transform/library/quantization.pytensorrt_llm/_torch/auto_deploy/transform/library/sharding.pytensorrt_llm/_torch/auto_deploy/transform/optimizer.pytensorrt_llm/_torch/auto_deploy/utils/dist_config.pytensorrt_llm/_torch/auto_deploy/utils/mapping_utils.pytensorrt_llm/_torch/auto_deploy/utils/node_utils.pytensorrt_llm/_torch/auto_deploy/utils/quantization_utils.pytests/unittest/auto_deploy/multigpu/transformations/library/test_apply_sharding_hints.pytests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.pytests/unittest/auto_deploy/singlegpu/custom_ops/test_sharding_ops.pytests/unittest/auto_deploy/singlegpu/utils/test_dist_config.pytests/unittest/auto_deploy/singlegpu/utils/test_mapping_utils.pytests/unittest/auto_deploy/singlegpu/utils/test_node_utils_sharding.py
|
/bot run |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
PR_Github #39792 [ run ] triggered by Bot. Commit: |
|
PR_Github #39792 [ run ] completed with state |
|
/bot run |
|
PR_Github #39810 [ run ] triggered by Bot. Commit: |
|
PR_Github #39810 [ run ] completed with state
|
|
/bot run |
|
PR_Github #39832 [ run ] triggered by Bot. Commit: |
|
PR_Github #39832 [ run ] completed with state
|
9db20e2 to
f79caad
Compare
|
/bot run |
|
PR_Github #39834 [ run ] triggered by Bot. Commit: |
|
PR_Github #39834 [ run ] completed with state |
|
/bot run |
|
PR_Github #39885 [ run ] triggered by Bot. Commit: |
|
PR_Github #39885 [ run ] completed with state
|
|
/bot run --reuse-test --disable-fail-fast |
|
PR_Github #39950 [ run ] triggered by Bot. Commit: |
|
PR_Github #39950 [ run ] completed with state
|
|
/bot run --reuse-test --disable-fail-fast |
|
PR_Github #40073 [ run ] triggered by Bot. Commit: |
|
PR_Github #40073 [ run ] completed with state
|
|
PR_Github #43467 [ run ] triggered by Bot. Commit: |
|
PR_Github #43467 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
…dist_config Signed-off-by: greg-kwasniewski1 <213329731+greg-kwasniewski1@users.noreply.github.com> Made-with: Cursor
|
PR_Github #43565 [ run ] triggered by Bot. Commit: |
|
PR_Github #43565 [ run ] completed with state
|
|
/bot run --reuse-test --disable-fail-fast |
|
PR_Github #43718 [ run ] triggered by Bot. Commit: |
|
PR_Github #43718 [ run ] completed with state
|
|
/bot run --reuse-test --disable-fail-fast |
|
PR_Github #43764 [ run ] triggered by Bot. Commit: |
|
PR_Github #43764 [ run ] completed with state
|
|
/bot run --reuse-test --disable-fail-fast |
|
PR_Github #44018 [ run ] triggered by Bot. Commit: |
|
PR_Github #44018 [ run ] completed with state |
|
/bot run --reuse-test --disable-fail-fast |
|
PR_Github #44109 [ run ] triggered by Bot. Commit: |
|
PR_Github #44109 [ run ] completed with state
|
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot skip --comment "All single GPU and multi GPU tests pass, except for test_llama_executor[llama-leader-90] in DGX_H100-4_GPUs-CPP-1 test. This is unrelated to this PR. All tests that may have been touched by this PR passed. The code touches only auto-deploy infrastructure, and auto-deploy single and multigpu tests pass". |
|
PR_Github #44442 [ skip ] triggered by Bot. Commit: |
|
PR_Github #44442 [ skip ] completed with state |
Fixes #12291
Summary by CodeRabbit
Release Notes
New Features
Improvements
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.