[None][DO NOT REVIEW] Trigger CI Only by yizhang-nv · Pull Request #11463 · NVIDIA/TensorRT-LLM

yizhang-nv · 2026-02-12T02:37:06Z

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

yizhang-nv · 2026-02-12T02:37:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-12T02:44:53Z

PR_Github #35703 [ run ] triggered by Bot. Commit: 9c0eb21

yizhang-nv · 2026-02-12T05:14:37Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-12T05:22:46Z

PR_Github #35725 [ run ] triggered by Bot. Commit: c63b434

tensorrt-cicd · 2026-02-12T09:28:16Z

PR_Github #35725 [ run ] completed with state SUCCESS. Commit: c63b434
/LLM/main/L0_MergeRequest_PR pipeline #27593 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

yizhang-nv · 2026-02-23T02:22:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-23T02:28:57Z

PR_Github #36446 [ run ] triggered by Bot. Commit: 39d0d39 Link to invocation

tensorrt-cicd · 2026-02-23T06:29:37Z

PR_Github #36446 [ run ] completed with state SUCCESS. Commit: 39d0d39
/LLM/main/L0_MergeRequest_PR pipeline #28194 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yizhang-nv · 2026-02-23T10:13:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-23T10:20:23Z

PR_Github #36493 [ run ] triggered by Bot. Commit: 6177848 Link to invocation

yizhang-nv · 2026-02-23T10:32:39Z

/bot run --disable-fail-fast

yizhang-nv · 2026-02-23T10:37:25Z

/bot run --disable-fail-fast

yizhang-nv · 2026-02-23T10:37:35Z

/bot kill

tensorrt-cicd · 2026-02-23T10:39:16Z

PR_Github #36497 [ run ] triggered by Bot. Commit: 2b7c494 Link to invocation

tensorrt-cicd · 2026-02-23T10:43:53Z

PR_Github #36499 [ run ] triggered by Bot. Commit: 2b7c494 Link to invocation

tensorrt-cicd · 2026-02-23T10:44:26Z

PR_Github #36500 [ kill ] triggered by Bot. Commit: 2b7c494 Link to invocation

tensorrt-cicd · 2026-02-23T10:44:27Z

PR_Github #36499 [ run ] completed with state ABORTED. Commit: 2b7c494

Link to invocation

tensorrt-cicd · 2026-02-23T10:44:58Z

PR_Github #36500 [ kill ] completed with state SUCCESS. Commit: 2b7c494
Successfully killed previous jobs for commit 2b7c494

Link to invocation

yizhang-nv · 2026-02-23T10:48:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-23T10:55:17Z

PR_Github #36501 [ run ] triggered by Bot. Commit: 2b7c494 Link to invocation

tensorrt-cicd · 2026-02-23T15:24:44Z

PR_Github #36501 [ run ] completed with state SUCCESS. Commit: 2b7c494
/LLM/main/L0_MergeRequest_PR pipeline #28239 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yizhang-nv · 2026-02-24T04:00:49Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-24T04:10:39Z

PR_Github #36591 [ run ] triggered by Bot. Commit: 44b6a1a Link to invocation

yizhang-nv · 2026-02-24T08:13:29Z

/bot run --help

yizhang-nv · 2026-02-24T08:18:42Z

/bot run --disable-fail-fast --post-merge

Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Un-skip tests that pass with v2 KV cache, remove stale waives, re-enable RTX Pro 6000 Nemotron tests, and add multimodal-aware block reuse token augmentation for KVCacheManagerV2. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

…ible speculative tests Fix prepare_resources/rewind mismatch when CUDA graph padding extends draft tokens beyond what was allocated. Add _extend_kv_cache_for_padding hook so NGram and two-model drafters extend KV cache capacity after padding, matching the rewind amount computed by TorchSampler. Skip two-model eagle3 and pard tests that OOM with v2 KV cache (v2 does not support two-model budget splitting). Enable speculative test suite on B200. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

…2 KV cache Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

In _commit_block, when a partial block hits UselessBlockError against a full tree block, the rebase path incorrectly swaps the request's copy page with the shared tree page. Any subsequent writes by the request (e.g. during generation) then corrupt the tree page, breaking other active requests that share it. Add `and is_full` to the rebase condition so only full-to-full rebase is allowed. Partial blocks now fall through to VIRTUAL_STOP instead. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

On-the-fly CUDA graph capture during generation can resize the shared cuda_graph_workspace tensor, invalidating addresses baked into previously captured graphs and causing illegal memory access on replay. This happens because create_cuda_graph_metadata() uses copy.copy(), so all CG metadata objects share the same cuda_graph_workspace tensor. When a later capture needs a larger workspace, resize_() changes the tensor address, but earlier graphs still reference the old address. Fix: disable CUDA graph capture by default. Only allow capture during the warmup phase via the new allow_capture() context manager. Uncaptured batch sizes fall back to eager execution instead of on-the-fly capture. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

…ic in V2 Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv · 2026-03-22T16:28:39Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-22T16:34:40Z

PR_Github #39840 [ run ] triggered by Bot. Commit: 4148a6b Link to invocation

tensorrt-cicd · 2026-03-23T01:36:52Z

PR_Github #39840 [ run ] completed with state SUCCESS. Commit: 4148a6b
/LLM/main/L0_MergeRequest_PR pipeline #31015 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

_get_model_kv_cache_manager_cls() was added by PR NVIDIA#12242 but bypassed the V2→V1 fallback logic (beam width > 1, kv_connector, etc.). Move the fallback into that method so all callers get consistent behavior. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv · 2026-03-23T02:59:00Z

/bot run --disable-fail-fast

Pass None instead of new_capacity to kv_cache.resize() during context phase, allowing the cache to determine the appropriate capacity. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Remove stale waives for test_openai_completions_example and test_openai_misc_example on A10 — these are no longer flaky. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv · 2026-03-23T03:53:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-23T03:58:50Z

PR_Github #39874 [ run ] triggered by Bot. Commit: 5595e98 Link to invocation

V2 scheduler BudgetTracker doesn't account for peft pages occupied by ongoing generation requests, causing cache full errors when context requests with different adapters are scheduled first. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

V2 KV cache manager resolves memory pressure that previously caused OOM on RTX Pro 6000D. All four tests (test_auto_dtype, test_auto_dtype_long_rope, test_fp4, test_fp8) verified passing on RTX 6000D. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Two-phase scheduling: defer all context/encoder requests to phase 2 so generation requests' PEFT pages are fully committed before context requests compete for device cache space. Pre-claim PEFT pages for GENERATION_TO_COMPLETE requests whose adapters are still active on device but not yet released (mark_request_done runs after prepare_resources in the overlap executor's next iteration). Removes pytest.skip on test_llama_7b_multi_lora_evict_and_reload_lora_gpu_cache which is now fixed by these changes. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Method was accidentally dropped during rebase in 4148a6b. Required by NGram and two-model drafters for CUDA graph padding KV cache extension after pad_draft_tokens_for_cuda_graph. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Test verified to pass on single H100 with 83.17% GSM8K accuracy. Signed-off-by: Yi Zhang <yizhang@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Move max_blocks_per_seq computation after max_seq_len clamping and include num_extra_kv_tokens + max_total_draft_tokens in the calculation. This ensures the host page-index buffer is large enough for the maximum capacity a single sequence can reach during warmup or normal operation. Previously, the draft V2 KV cache manager received a clamped max_seq_len that did not account for extra speculative decoding tokens, resulting in a max_blocks_per_seq that was too small. During warmup, draft_kv_cache.resize() would fail with "User-provided base page indices is too short" because the resize needed more blocks than the buffer could hold. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

The V2 MAX_UTILIZATION scheduler relies on suspend/resume to evict and later restore KV cache pages when GPU memory is tight. Without a host cache tier, suspended pages have nowhere to be offloaded and resume() always fails, causing a scheduling deadlock where no generation request can ever make progress. Automatically provision a host tier matching the GPU quota (capped at 50% of available host memory) so suspend/resume works out of the box. This fixes the PARD speculative decoding test which previously deadlocked with max_tokens=2048 and 3 concurrent requests. Also re-enable the test_pard unit test that was skipped due to this issue. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Add a deadlock check in KVCacheV2Scheduler: if generation requests are active but none could be scheduled or evicted, raise a clear RuntimeError instead of spinning forever. This replaces silent hangs with an actionable error message pointing to host cache or max_tokens configuration. Also remove pytest.skip from test cases that have been verified to pass with V2 KV cache enabled by default. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

_get_token_num_for_estimation() computes max_num_tokens_in_memory via free_gpu_memory_fraction (float) * free_mem (int), producing a float that propagates through floor-division and multiplication. This float ends up in kv_cache_config.max_tokens, then KVCacheManagerV2._gpu_max_tokens, causing min(int, float) to return float for max_seq_len. The float max_seq_len eventually reaches the C++ attention() nanobind call as attention_window_size, which expects int, triggering a TypeError. Cast max_num_tokens_in_memory to int to ensure the token count stays integral throughout the KV cache configuration pipeline. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv · 2026-03-23T09:37:10Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-23T09:43:10Z

PR_Github #39916 [ run ] triggered by Bot. Commit: f08def5 Link to invocation

yizhang-nv force-pushed the enable-v2-by-default branch from 1de7c1a to 9c0eb21 Compare February 12, 2026 02:37

yizhang-nv force-pushed the enable-v2-by-default branch from c63b434 to 39d0d39 Compare February 23, 2026 02:21

yizhang-nv force-pushed the enable-v2-by-default branch from dbffd28 to fddeb28 Compare February 23, 2026 10:29

yizhang-nv force-pushed the enable-v2-by-default branch from fddeb28 to 2b7c494 Compare February 23, 2026 10:37

yizhang-nv force-pushed the enable-v2-by-default branch from 2b7c494 to 44b6a1a Compare February 24, 2026 03:53

yizhang-nv and others added 8 commits March 22, 2026 09:15

A

e1b8d34

Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

B

91682d6

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Skip test_eagle3_2gpus: two-model unsupported, one-model hangs with v…

4130543

…2 KV cache Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Re-enable speculative decoding unit tests on H100

6a06738

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv force-pushed the enable-v2-by-default branch from d90da6e to 6a06738 Compare March 22, 2026 16:18

Remove duplicate prepare_resources and consolidate draft KV cache log…

4148a6b

…ic in V2 Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv added 2 commits March 22, 2026 20:17

Unwaive flaky A10 openai e2e tests

5595e98

Remove stale waives for test_openai_completions_example and test_openai_misc_example on A10 — these are no longer flaky. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

yizhang-nv added 9 commits March 22, 2026 21:25

Remove OOM skip marker from TestQwen3_30B_A3B::test_fp8

3812748

Test verified to pass on single H100 with 83.17% GSM8K accuracy. Signed-off-by: Yi Zhang <yizhang@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>

Conversation

yizhang-nv commented Feb 12, 2026

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

yizhang-nv commented Feb 12, 2026

Uh oh!

tensorrt-cicd commented Feb 12, 2026

Uh oh!

yizhang-nv commented Feb 12, 2026

Uh oh!

tensorrt-cicd commented Feb 12, 2026

Uh oh!

tensorrt-cicd commented Feb 12, 2026

Uh oh!

yizhang-nv commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

yizhang-nv commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

yizhang-nv commented Feb 23, 2026

Uh oh!

yizhang-nv commented Feb 23, 2026

Uh oh!

yizhang-nv commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

yizhang-nv commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

tensorrt-cicd commented Feb 23, 2026

Uh oh!

yizhang-nv commented Feb 24, 2026

Uh oh!

tensorrt-cicd commented Feb 24, 2026

Uh oh!

yizhang-nv commented Feb 24, 2026

Uh oh!

yizhang-nv commented Feb 24, 2026

Uh oh!

yizhang-nv commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 22, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

yizhang-nv commented Mar 23, 2026

Uh oh!

yizhang-nv commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

yizhang-nv commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects