Skip to content

[None][DO NOT REVIEW] Trigger CI Only#11463

Draft
yizhang-nv wants to merge 22 commits intoNVIDIA:mainfrom
yizhang-nv:enable-v2-by-default
Draft

[None][DO NOT REVIEW] Trigger CI Only#11463
yizhang-nv wants to merge 22 commits intoNVIDIA:mainfrom
yizhang-nv:enable-v2-by-default

Conversation

@yizhang-nv
Copy link
Member

@coderabbitai summary

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35703 [ run ] triggered by Bot. Commit: 9c0eb21

@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35725 [ run ] triggered by Bot. Commit: c63b434

@tensorrt-cicd
Copy link
Collaborator

PR_Github #35725 [ run ] completed with state SUCCESS. Commit: c63b434
/LLM/main/L0_MergeRequest_PR pipeline #27593 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36446 [ run ] triggered by Bot. Commit: 39d0d39 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36446 [ run ] completed with state SUCCESS. Commit: 39d0d39
/LLM/main/L0_MergeRequest_PR pipeline #28194 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36493 [ run ] triggered by Bot. Commit: 6177848 Link to invocation

@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

1 similar comment
@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@yizhang-nv
Copy link
Member Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36497 [ run ] triggered by Bot. Commit: 2b7c494 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36499 [ run ] triggered by Bot. Commit: 2b7c494 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36500 [ kill ] triggered by Bot. Commit: 2b7c494 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36499 [ run ] completed with state ABORTED. Commit: 2b7c494

Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36500 [ kill ] completed with state SUCCESS. Commit: 2b7c494
Successfully killed previous jobs for commit 2b7c494

Link to invocation

@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36501 [ run ] triggered by Bot. Commit: 2b7c494 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36501 [ run ] completed with state SUCCESS. Commit: 2b7c494
/LLM/main/L0_MergeRequest_PR pipeline #28239 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36591 [ run ] triggered by Bot. Commit: 44b6a1a Link to invocation

@yizhang-nv
Copy link
Member Author

/bot run --help

@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast --post-merge

yizhang-nv and others added 8 commits March 22, 2026 09:15
A
Signed-off-by: Yi Zhang <yizhan@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
B
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
C
Un-skip tests that pass with v2 KV cache, remove stale waives,
re-enable RTX Pro 6000 Nemotron tests, and add multimodal-aware
block reuse token augmentation for KVCacheManagerV2.

Signed-off-by: Yi Zhang <yizhan@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
…ible speculative tests

Fix prepare_resources/rewind mismatch when CUDA graph padding extends
draft tokens beyond what was allocated. Add _extend_kv_cache_for_padding
hook so NGram and two-model drafters extend KV cache capacity after
padding, matching the rewind amount computed by TorchSampler.

Skip two-model eagle3 and pard tests that OOM with v2 KV cache (v2
does not support two-model budget splitting). Enable speculative test
suite on B200.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
…2 KV cache

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
In _commit_block, when a partial block hits UselessBlockError against
a full tree block, the rebase path incorrectly swaps the request's
copy page with the shared tree page. Any subsequent writes by the
request (e.g. during generation) then corrupt the tree page, breaking
other active requests that share it.

Add `and is_full` to the rebase condition so only full-to-full rebase
is allowed. Partial blocks now fall through to VIRTUAL_STOP instead.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
On-the-fly CUDA graph capture during generation can resize the shared
cuda_graph_workspace tensor, invalidating addresses baked into previously
captured graphs and causing illegal memory access on replay.

This happens because create_cuda_graph_metadata() uses copy.copy(), so
all CG metadata objects share the same cuda_graph_workspace tensor. When
a later capture needs a larger workspace, resize_() changes the tensor
address, but earlier graphs still reference the old address.

Fix: disable CUDA graph capture by default. Only allow capture during
the warmup phase via the new allow_capture() context manager. Uncaptured
batch sizes fall back to eager execution instead of on-the-fly capture.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
@yizhang-nv yizhang-nv force-pushed the enable-v2-by-default branch from d90da6e to 6a06738 Compare March 22, 2026 16:18
…ic in V2

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39840 [ run ] triggered by Bot. Commit: 4148a6b Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39840 [ run ] completed with state SUCCESS. Commit: 4148a6b
/LLM/main/L0_MergeRequest_PR pipeline #31015 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

_get_model_kv_cache_manager_cls() was added by PR NVIDIA#12242 but bypassed
the V2→V1 fallback logic (beam width > 1, kv_connector, etc.). Move the
fallback into that method so all callers get consistent behavior.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

Pass None instead of new_capacity to kv_cache.resize() during context
phase, allowing the cache to determine the appropriate capacity.

Signed-off-by: Yi Zhang <yizhan@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Remove stale waives for test_openai_completions_example and
test_openai_misc_example on A10 — these are no longer flaky.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39874 [ run ] triggered by Bot. Commit: 5595e98 Link to invocation

V2 scheduler BudgetTracker doesn't account for peft pages occupied
by ongoing generation requests, causing cache full errors when
context requests with different adapters are scheduled first.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
V2 KV cache manager resolves memory pressure that previously caused OOM
on RTX Pro 6000D. All four tests (test_auto_dtype, test_auto_dtype_long_rope,
test_fp4, test_fp8) verified passing on RTX 6000D.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Two-phase scheduling: defer all context/encoder requests to phase 2
so generation requests' PEFT pages are fully committed before context
requests compete for device cache space.

Pre-claim PEFT pages for GENERATION_TO_COMPLETE requests whose adapters
are still active on device but not yet released (mark_request_done runs
after prepare_resources in the overlap executor's next iteration).

Removes pytest.skip on test_llama_7b_multi_lora_evict_and_reload_lora_gpu_cache
which is now fixed by these changes.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Method was accidentally dropped during rebase in 4148a6b. Required
by NGram and two-model drafters for CUDA graph padding KV cache
extension after pad_draft_tokens_for_cuda_graph.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Test verified to pass on single H100 with 83.17% GSM8K accuracy.

Signed-off-by: Yi Zhang <yizhang@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Move max_blocks_per_seq computation after max_seq_len clamping and
include num_extra_kv_tokens + max_total_draft_tokens in the
calculation. This ensures the host page-index buffer is large enough
for the maximum capacity a single sequence can reach during warmup
or normal operation.

Previously, the draft V2 KV cache manager received a clamped
max_seq_len that did not account for extra speculative decoding
tokens, resulting in a max_blocks_per_seq that was too small. During
warmup, draft_kv_cache.resize() would fail with "User-provided base
page indices is too short" because the resize needed more blocks
than the buffer could hold.

Signed-off-by: Yi Zhang <yizhan@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
The V2 MAX_UTILIZATION scheduler relies on suspend/resume to evict and
later restore KV cache pages when GPU memory is tight.  Without a host
cache tier, suspended pages have nowhere to be offloaded and resume()
always fails, causing a scheduling deadlock where no generation request
can ever make progress.

Automatically provision a host tier matching the GPU quota (capped at
50% of available host memory) so suspend/resume works out of the box.
This fixes the PARD speculative decoding test which previously
deadlocked with max_tokens=2048 and 3 concurrent requests.

Also re-enable the test_pard unit test that was skipped due to this
issue.

Signed-off-by: Yi Zhang <yizhan@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Add a deadlock check in KVCacheV2Scheduler: if generation requests
are active but none could be scheduled or evicted, raise a clear
RuntimeError instead of spinning forever. This replaces silent hangs
with an actionable error message pointing to host cache or max_tokens
configuration.

Also remove pytest.skip from test cases that have been verified to
pass with V2 KV cache enabled by default.

Signed-off-by: Yi Zhang <yizhan@nvidia.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
_get_token_num_for_estimation() computes max_num_tokens_in_memory via
free_gpu_memory_fraction (float) * free_mem (int), producing a float
that propagates through floor-division and multiplication. This float
ends up in kv_cache_config.max_tokens, then KVCacheManagerV2._gpu_max_tokens,
causing min(int, float) to return float for max_seq_len. The float
max_seq_len eventually reaches the C++ attention() nanobind call as
attention_window_size, which expects int, triggering a TypeError.

Cast max_num_tokens_in_memory to int to ensure the token count stays
integral throughout the KV cache configuration pipeline.

Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
@yizhang-nv
Copy link
Member Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39916 [ run ] triggered by Bot. Commit: f08def5 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants