[None][DO NOT REVIEW] Trigger CI Only#11463
Conversation
1de7c1a to
9c0eb21
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #35703 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #35725 [ run ] triggered by Bot. Commit: |
|
PR_Github #35725 [ run ] completed with state
|
c63b434 to
39d0d39
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #36446 [ run ] triggered by Bot. Commit: |
|
PR_Github #36446 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #36493 [ run ] triggered by Bot. Commit: |
dbffd28 to
fddeb28
Compare
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
fddeb28 to
2b7c494
Compare
|
/bot kill |
|
PR_Github #36497 [ run ] triggered by Bot. Commit: |
|
PR_Github #36499 [ run ] triggered by Bot. Commit: |
|
PR_Github #36500 [ kill ] triggered by Bot. Commit: |
|
PR_Github #36499 [ run ] completed with state |
|
PR_Github #36500 [ kill ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #36501 [ run ] triggered by Bot. Commit: |
|
PR_Github #36501 [ run ] completed with state
|
2b7c494 to
44b6a1a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #36591 [ run ] triggered by Bot. Commit: |
|
/bot run --help |
|
/bot run --disable-fail-fast --post-merge |
Un-skip tests that pass with v2 KV cache, remove stale waives, re-enable RTX Pro 6000 Nemotron tests, and add multimodal-aware block reuse token augmentation for KVCacheManagerV2. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
…ible speculative tests Fix prepare_resources/rewind mismatch when CUDA graph padding extends draft tokens beyond what was allocated. Add _extend_kv_cache_for_padding hook so NGram and two-model drafters extend KV cache capacity after padding, matching the rewind amount computed by TorchSampler. Skip two-model eagle3 and pard tests that OOM with v2 KV cache (v2 does not support two-model budget splitting). Enable speculative test suite on B200. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
…2 KV cache Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
In _commit_block, when a partial block hits UselessBlockError against a full tree block, the rebase path incorrectly swaps the request's copy page with the shared tree page. Any subsequent writes by the request (e.g. during generation) then corrupt the tree page, breaking other active requests that share it. Add `and is_full` to the rebase condition so only full-to-full rebase is allowed. Partial blocks now fall through to VIRTUAL_STOP instead. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
On-the-fly CUDA graph capture during generation can resize the shared cuda_graph_workspace tensor, invalidating addresses baked into previously captured graphs and causing illegal memory access on replay. This happens because create_cuda_graph_metadata() uses copy.copy(), so all CG metadata objects share the same cuda_graph_workspace tensor. When a later capture needs a larger workspace, resize_() changes the tensor address, but earlier graphs still reference the old address. Fix: disable CUDA graph capture by default. Only allow capture during the warmup phase via the new allow_capture() context manager. Uncaptured batch sizes fall back to eager execution instead of on-the-fly capture. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
d90da6e to
6a06738
Compare
…ic in V2 Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #39840 [ run ] triggered by Bot. Commit: |
|
PR_Github #39840 [ run ] completed with state
|
_get_model_kv_cache_manager_cls() was added by PR NVIDIA#12242 but bypassed the V2→V1 fallback logic (beam width > 1, kv_connector, etc.). Move the fallback into that method so all callers get consistent behavior. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast |
Pass None instead of new_capacity to kv_cache.resize() during context phase, allowing the cache to determine the appropriate capacity. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Remove stale waives for test_openai_completions_example and test_openai_misc_example on A10 — these are no longer flaky. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #39874 [ run ] triggered by Bot. Commit: |
V2 scheduler BudgetTracker doesn't account for peft pages occupied by ongoing generation requests, causing cache full errors when context requests with different adapters are scheduled first. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
V2 KV cache manager resolves memory pressure that previously caused OOM on RTX Pro 6000D. All four tests (test_auto_dtype, test_auto_dtype_long_rope, test_fp4, test_fp8) verified passing on RTX 6000D. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Two-phase scheduling: defer all context/encoder requests to phase 2 so generation requests' PEFT pages are fully committed before context requests compete for device cache space. Pre-claim PEFT pages for GENERATION_TO_COMPLETE requests whose adapters are still active on device but not yet released (mark_request_done runs after prepare_resources in the overlap executor's next iteration). Removes pytest.skip on test_llama_7b_multi_lora_evict_and_reload_lora_gpu_cache which is now fixed by these changes. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Method was accidentally dropped during rebase in 4148a6b. Required by NGram and two-model drafters for CUDA graph padding KV cache extension after pad_draft_tokens_for_cuda_graph. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Test verified to pass on single H100 with 83.17% GSM8K accuracy. Signed-off-by: Yi Zhang <yizhang@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Move max_blocks_per_seq computation after max_seq_len clamping and include num_extra_kv_tokens + max_total_draft_tokens in the calculation. This ensures the host page-index buffer is large enough for the maximum capacity a single sequence can reach during warmup or normal operation. Previously, the draft V2 KV cache manager received a clamped max_seq_len that did not account for extra speculative decoding tokens, resulting in a max_blocks_per_seq that was too small. During warmup, draft_kv_cache.resize() would fail with "User-provided base page indices is too short" because the resize needed more blocks than the buffer could hold. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
The V2 MAX_UTILIZATION scheduler relies on suspend/resume to evict and later restore KV cache pages when GPU memory is tight. Without a host cache tier, suspended pages have nowhere to be offloaded and resume() always fails, causing a scheduling deadlock where no generation request can ever make progress. Automatically provision a host tier matching the GPU quota (capped at 50% of available host memory) so suspend/resume works out of the box. This fixes the PARD speculative decoding test which previously deadlocked with max_tokens=2048 and 3 concurrent requests. Also re-enable the test_pard unit test that was skipped due to this issue. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Add a deadlock check in KVCacheV2Scheduler: if generation requests are active but none could be scheduled or evicted, raise a clear RuntimeError instead of spinning forever. This replaces silent hangs with an actionable error message pointing to host cache or max_tokens configuration. Also remove pytest.skip from test cases that have been verified to pass with V2 KV cache enabled by default. Signed-off-by: Yi Zhang <yizhan@nvidia.com> Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
_get_token_num_for_estimation() computes max_num_tokens_in_memory via free_gpu_memory_fraction (float) * free_mem (int), producing a float that propagates through floor-division and multiplication. This float ends up in kv_cache_config.max_tokens, then KVCacheManagerV2._gpu_max_tokens, causing min(int, float) to return float for max_seq_len. The float max_seq_len eventually reaches the C++ attention() nanobind call as attention_window_size, which expects int, triggering a TypeError. Cast max_num_tokens_in_memory to int to ensure the token count stays integral throughout the KV cache configuration pipeline. Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #39916 [ run ] triggered by Bot. Commit: |
@coderabbitai summary
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.