[TRTLLM-9526][feat] optimize host perf for python cache transceiver by chuangz0 · Pull Request #12273 · NVIDIA/TensorRT-LLM

chuangz0 · 2026-03-17T06:44:15Z

Summary by CodeRabbit

Release Notes

New Features
- Added NumPy array support for memory descriptor construction in cache transmission
- Enhanced memory pointer and block ID handling for improved cache transfer efficiency in disaggregated operations

Description

with this PR and #12490 .

optimize host perf for python kv transfer

model and config:

Model: Qwen3-235B-A22B-FP4
Config: ISL=8192, OSL=1024, ctx2_tep4_gen1_dep16
opt1 (4e505e2)
opt2 (b10c86d)

KV transfer perf

Metric	Stat	Baseline	Opt1	Opt2	Opt1 vs B	Opt2 vs B	Opt2 vs Opt1
prepare_args (ms)	avg	10.331	0.916	0.424	-91.1%	-95.9%	-53.7%
	median	8.110	0.815	0.385	-90.0%	-95.3%	-52.8%
	p99	20.995	1.887	0.866	-91.0%	-95.9%	-54.1%
queue (ms)	avg	16.813	20.932	14.830	+24.5%	-11.8%	-29.1%
	p99	54.054	383.160	47.484	+608.9%	-12.2%	-87.6%
transfer (ms)	median	10.809	10.632	10.668	~flat	~flat	~flat
task (ms)	avg	46.367	35.590	29.044	-23.2%	-37.4%	-18.4%
	median	39.865	23.645	22.383	-40.7%	-43.9%	-5.3%
	p99	82.510	394.621	59.794	+378.3%	-27.5%	-84.8%
throughput (MB/s)	median	17,392	17,683	17,623	~flat	~flat	~flat

E2E

Concurrency = 1127

Metric	Baseline	Opt1	Opt2	Opt1 vs B	Opt2 vs B	Opt2 vs Opt1
Req Throughput (req/s)	15.12	15.96	15.80	+5.5%	+4.5%	-1.0%
Output Throughput (tok/s)	15487.7	16343.4	16184.0	+5.5%	+4.5%	-1.0%
Out Thpt/GPU (tok/s)	645.3	681.0	674.3	+5.5%	+4.5%	-1.0%
tput_per_user (tok/s)	62.17	61.71	60.68	-0.7%	-2.4%	-1.7%
Mean TTFT (ms)	53189.7	49926.0	50262.5	-6.1%	-5.5%	+0.7%
Median TTFT (ms)	56917.3	51837.0	52309.1	-8.9%	-8.1%	+0.9%
P99 TTFT (ms)	69361.5	65017.9	65715.0	-6.3%	-5.3%	+1.1%
Mean TPOT (ms)	16.085	16.205	16.480	+0.7%	+2.5%	+1.7%
Median TPOT (ms)	16.204	16.261	16.272	+0.4%	+0.4%	+0.1%
P99 TPOT (ms)	17.961	16.750	22.793	-6.7%	+26.9%	+36.1%
Mean ITL (ms)	316.4	318.8	324.2	+0.7%	+2.5%	+1.7%
Mean E2EL (ms)	69644.5	66504.0	67121.9	-4.5%	-3.6%	+0.9%
Median E2EL (ms)	73577.8	68473.5	68936.0	-6.9%	-6.3%	+0.7%
P99 E2EL (ms)	85457.6	81946.3	82993.1	-4.1%	-2.9%	+1.3%
Duration (s)	596.1	564.9	570.5	-5.2%	-4.3%	+1.0%

Concurrency = 1229

Metric	Baseline	Opt1	Opt2	Opt1 vs B	Opt2 vs B	Opt2 vs Opt1
Req Throughput (req/s)	15.29	15.98	15.89	+4.5%	+3.9%	-0.6%
Output Throughput (tok/s)	15657.6	16367.0	16272.1	+4.5%	+3.9%	-0.6%
Out Thpt/GPU (tok/s)	652.4	682.0	678.0	+4.5%	+3.9%	-0.6%
tput_per_user (tok/s)	65.95	61.97	61.74	-6.0%	-6.4%	-0.4%
Mean TTFT (ms)	59627.4	55921.3	56130.1	-6.2%	-5.9%	+0.4%
Median TTFT (ms)	60737.4	58095.9	58289.0	-4.3%	-4.0%	+0.3%
P99 TTFT (ms)	74162.3	71548.6	71898.7	-3.5%	-3.1%	+0.5%
Mean TPOT (ms)	15.164	16.137	16.196	+6.4%	+6.8%	+0.4%
Median TPOT (ms)	15.024	16.265	16.263	+8.3%	+8.2%	-0.0%
P99 TPOT (ms)	16.340	16.471	16.393	+0.8%	+0.3%	-0.5%
Mean ITL (ms)	298.3	317.5	318.6	+6.4%	+6.8%	+0.4%
Mean E2EL (ms)	75140.2	72429.7	72698.3	-3.6%	-3.2%	+0.4%
Median E2EL (ms)	76071.8	74734.0	74940.4	-1.8%	-1.5%	+0.3%
P99 E2EL (ms)	89549.2	88584.7	88575.5	-1.1%	-1.1%	-0.0%
Duration (s)	643.0	615.1	618.7	-4.3%	-3.8%	+0.6%

If KV transmission speeds up, the batch size on the gen side will increase faster, the number of iterations will decrease slightly, but TPOT will increase.

config 2 deepseek

deepseek R1 ctx4_dep4_gen1_dep8
8k1k

Baseline is py_cache transceiver without host optimization

1. Output Token Throughput (tok/s)

Concurrency	Baseline	Host Opt	Delta
1024	40,707	40,644	-0.16%
1076	40,629	40,950	+0.79%
1127	40,947	40,978	+0.08%
1229	41,068	41,013	-0.13%

Output throughput is essentially identical between the two configurations (within +/-0.8% noise).

2. User Token Throughput (tok/s)

Concurrency	Baseline	Host Opt	Delta
1024	41.34	41.19	-0.36%
1076	39.20	39.54	+0.87%
1127	37.83	37.85	+0.05%
1229	34.95	34.92	-0.09%

kv transfer

Metric	Baseline	Host Opt	Delta
KVSendTask latency mean	2.52 ms	2.36 ms	-6.5%
KVSendTask latency p99	5.68 ms	4.69 ms	-17.4%
KVRecvTask latency mean	3.16 ms	2.94 ms	-7.0%
KVRecvTask latency p99	6.52 ms	5.44 ms	-16.5%
prepare_args latency mean	0.078 ms	0.055 ms	-29.5%
Transfer throughput mean	171,233 MB/s	178,121 MB/s	+4.0%

config 3 deepseek

deepseek R1 ctx4_dep4_gen1_dep8
8k1k

Baseline is cpp cache transceiver

1. Output Token Throughput (tok/s)

Concurrency	Baseline (C++)	Py+HostOpt	Delta
1024	40,323	40,644	+0.80%
1076	40,467	40,950	+1.19%
1127	40,536	40,978	+1.09%
1229	40,699	41,013	+0.77%

2. User Token Throughput (tok/s)

Concurrency	Baseline (C++)	Py+HostOpt	Delta
1024	40.87	41.19	+0.78%
1076	39.04	39.54	+1.28%
1127	37.33	37.85	+1.39%
1229	34.31	34.92	+1.78%

3. TTFT(ms)

Concurrency	Baseline (C++)	Py+HostOpt	Delta
mean
1024	2,650	2,507	-5.4%
1076	3,730	3,377	-9.5%
1127	4,781	4,405	-7.9%
1229	6,891	6,586	-4.4%
median
1024	1,650	1,496	-9.3%
1076	3,031	2,819	-7.0%
1127	4,118	4,074	-1.1%
1229	6,322	6,801	+7.6%
p99
1024	20,572	20,570	~0%
1076	21,671	21,509	-0.7%
1127	22,705	22,493	-0.9%
1229	24,674	24,378	-1.2%

C++ cache transceiver has better perf, but get worse context forward perf.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-03-17T06:47:33Z

📝 Walkthrough

Walkthrough

This pull request systematically converts Python list-based data structures to NumPy arrays throughout the disaggregated cache transmission system, including updates to C++ bindings, base classes, implementations, serialization logic, and tests to support vectorized operations.

Changes

Cohort / File(s)	Summary
C++ Bindings `cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp`	Added `__init__` overload for `MemoryDescs` to construct from NumPy arrays (addrs, sizes, device_ids) with included nanobind ndarray header support.
Base Class Refactoring `tensorrt_llm/_torch/disaggregation/base/agent.py`, `tensorrt_llm/_torch/disaggregation/base/region.py`, `tensorrt_llm/_torch/disaggregation/base/transfer.py`	Converted `MemoryDescs` from dataclass to custom class with dual initialization modes; updated `MemRegionGroup.ptrs`, `RegionExtractorBase.extract`, and `KVSlice.block_ids_per_layer_groups` type signatures from lists to NumPy arrays.
Native Auxiliary & Metadata `tensorrt_llm/_torch/disaggregation/native/auxiliary.py`, `tensorrt_llm/_torch/disaggregation/native/py_cache_transceiver.py`	Converted `AuxBufferMeta` fields and block ID collections to NumPy arrays with serialization/deserialization logic; replaced list comprehensions with `np.fromiter` and `np.asarray` operations.
Memory Mapper Vectorization `tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py`, `tensorrt_llm/_torch/disaggregation/native/mixers/ssm/peer.py`	Replaced pointer list construction with vectorized NumPy operations (arithmetic, broadcasting, ravel); updated assertions to use `.size` for array comparisons.
Transfer & Serialization `tensorrt_llm/_torch/disaggregation/native/transfer.py`	Comprehensive refactoring of `WriteMeta` and `RecvReqInfo` to use NumPy arrays for pointers/sizes; updated serialization via `tobytes()`/`frombuffer()`; replaced list aggregation with NumPy concatenation and size-aware filtering.
KV Extractor `tensorrt_llm/_torch/disaggregation/resource/kv_extractor.py`	Updated `KVRegionExtractorV1.extract` signature to accept `np.ndarray`; replaced list iteration with vectorized boolean masking for pointer computation.
Test Updates `tests/unittest/disaggregated/region/test_block.py`, `tests/unittest/disaggregated/test_extractor.py`, `tests/unittest/disaggregated/test_kv_transfer.py`, `tests/unittest/disaggregated/test_kv_transfer_mp.py`	Updated test assertions to use `numpy.testing.assert_array_equal`; converted region_ids and pointer constructions to NumPy int64 arrays; adjusted type checks for array-based data structures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	PR description is incomplete and uses template placeholders. Critical sections like 'Description' and 'Test Coverage' lack substantive content.	Add a clear summary of changes, explain the optimization rationale, and list specific test cases covering the modifications.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: optimizing host performance for the Python cache transceiver, with proper JIRA ticket reference and feature type.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tensorrt_llm/_torch/disaggregation/base/agent.py (1)

41-48: Consider adding strict=True to zip() for safety.

The zip() call on line 45 silently truncates if descs_or_addrs, sizes, and device_ids have different lengths. While upstream code typically ensures equal lengths, adding strict=True (Python 3.10+) would catch mismatches early:

💡 Proposed fix

     def __init__(self, type, descs_or_addrs, sizes=None, device_ids=None):
         self.type = type
         if sizes is not None:
             self.descs = [
-                (int(a), int(s), int(d)) for a, s, d in zip(descs_or_addrs, sizes, device_ids)
+                (int(a), int(s), int(d)) for a, s, d in zip(descs_or_addrs, sizes, device_ids, strict=True)
             ]
         else:
             self.descs = descs_or_addrs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/base/agent.py` around lines 41 - 48, In
the __init__ method of the agent (where self.descs is built), change the zip
call that combines descs_or_addrs, sizes, and device_ids to use zip(...,
strict=True) so mismatched input lengths raise immediately; update the
tuple-building logic around the zip in __init__ (referencing the __init__ method
and self.descs) to use strict=True to catch length mismatches early.

tensorrt_llm/_torch/disaggregation/native/transfer.py (1)

1640-1645: Minor: Consider using .size instead of len() for consistency.

While len() works on NumPy arrays, .size is the more idiomatic NumPy approach and would be consistent with the rest of this PR's changes.

💡 Proposed fix

     def _register_aux_buffer(self):
         aux_meta = self._aux_buffer.meta
-        ptr_num = len(aux_meta.ptrs)
+        ptr_num = aux_meta.ptrs.size
         ptr_descs = []
         for i in range(ptr_num):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py` around lines 1640 -
1645, Replace the len() call with the NumPy .size property for consistency:
change the assignment to ptr_num so it uses aux_meta.ptrs.size (referencing the
local variable ptr_num and the attribute aux_meta.ptrs in transfer.py) and keep
the rest of the loop building ptr_descs unchanged.

tests/unittest/disaggregated/test_kv_transfer.py (1)

444-473: Return type annotation is outdated.

The function now returns List[np.ndarray] (each element is np.asarray(..., dtype=np.int64)), but the type hint still declares List[List[int]].

📝 Proposed fix for type annotation

 def get_block_ids_per_layer_groups(
     kv_cache_manager, transfer_worker, request_id: int, use_v2: bool, tokens_per_block: int
-) -> List[List[int]]:
+) -> List[np.ndarray]:
     """Get block_ids for each layer group with window_size filtering."""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/disaggregated/test_kv_transfer.py` around lines 444 - 473, The
return type annotation for get_block_ids_per_layer_groups is outdated (it
returns numpy arrays); update its signature to reflect List[np.ndarray] (or
Sequence[np.ndarray] if you prefer immutability) instead of List[List[int]].
Locate get_block_ids_per_layer_groups and change the annotation and any related
docstring/comment to List[np.ndarray]; ensure imports include numpy as np and
typing.List is used consistently with np.ndarray. Verify callers of
get_block_ids_per_layer_groups (e.g., uses of block_ids_per_layer_groups) still
work with numpy arrays and adjust any type checks if necessary.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp`:
- Around line 87-108: The lambda __init__ for kvc::MemoryDescs reads n elements
from addrs, sizes, and deviceIds but only uses addrs.shape(0) for n; add
explicit validation at the start of that lambda: compute n = addrs.shape(0) then
assert sizes.shape(0) == n and deviceIds.shape(0) == n and if not, throw a clear
exception (e.g., std::invalid_argument or nb::value_error) indicating mismatched
array lengths; keep the rest of the logic unchanged so you avoid out-of-bounds
reads when constructing descs.

---

Nitpick comments:
In `@tensorrt_llm/_torch/disaggregation/base/agent.py`:
- Around line 41-48: In the __init__ method of the agent (where self.descs is
built), change the zip call that combines descs_or_addrs, sizes, and device_ids
to use zip(..., strict=True) so mismatched input lengths raise immediately;
update the tuple-building logic around the zip in __init__ (referencing the
__init__ method and self.descs) to use strict=True to catch length mismatches
early.

In `@tensorrt_llm/_torch/disaggregation/native/transfer.py`:
- Around line 1640-1645: Replace the len() call with the NumPy .size property
for consistency: change the assignment to ptr_num so it uses aux_meta.ptrs.size
(referencing the local variable ptr_num and the attribute aux_meta.ptrs in
transfer.py) and keep the rest of the loop building ptr_descs unchanged.

In `@tests/unittest/disaggregated/test_kv_transfer.py`:
- Around line 444-473: The return type annotation for
get_block_ids_per_layer_groups is outdated (it returns numpy arrays); update its
signature to reflect List[np.ndarray] (or Sequence[np.ndarray] if you prefer
immutability) instead of List[List[int]]. Locate get_block_ids_per_layer_groups
and change the annotation and any related docstring/comment to List[np.ndarray];
ensure imports include numpy as np and typing.List is used consistently with
np.ndarray. Verify callers of get_block_ids_per_layer_groups (e.g., uses of
block_ids_per_layer_groups) still work with numpy arrays and adjust any type
checks if necessary.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c636e8b5-bfb5-4228-a6c5-d26fe10c0491

📥 Commits

Reviewing files that changed from the base of the PR and between a064a9b and a271356.

📒 Files selected for processing (14)

cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp
tensorrt_llm/_torch/disaggregation/base/agent.py
tensorrt_llm/_torch/disaggregation/base/region.py
tensorrt_llm/_torch/disaggregation/base/transfer.py
tensorrt_llm/_torch/disaggregation/native/auxiliary.py
tensorrt_llm/_torch/disaggregation/native/mixers/attention/peer.py
tensorrt_llm/_torch/disaggregation/native/mixers/ssm/peer.py
tensorrt_llm/_torch/disaggregation/native/py_cache_transceiver.py
tensorrt_llm/_torch/disaggregation/native/transfer.py
tensorrt_llm/_torch/disaggregation/resource/kv_extractor.py
tests/unittest/disaggregated/region/test_block.py
tests/unittest/disaggregated/test_extractor.py
tests/unittest/disaggregated/test_kv_transfer.py
tests/unittest/disaggregated/test_kv_transfer_mp.py

chuangz0 · 2026-03-18T02:12:47Z

/bot run

tensorrt-cicd · 2026-03-18T02:19:11Z

PR_Github #39355 [ run ] triggered by Bot. Commit: a271356 Link to invocation

tensorrt-cicd · 2026-03-18T05:51:49Z

PR_Github #39355 [ run ] completed with state SUCCESS. Commit: a271356
/LLM/main/L0_MergeRequest_PR pipeline #30599 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-03-18T10:31:54Z

/bot run

tensorrt-cicd · 2026-03-18T10:37:58Z

PR_Github #39444 [ run ] triggered by Bot. Commit: b10c86d Link to invocation

tensorrt-cicd · 2026-03-18T16:13:15Z

PR_Github #39444 [ run ] completed with state SUCCESS. Commit: b10c86d
/LLM/main/L0_MergeRequest_PR pipeline #30671 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-03-19T08:32:05Z

/bot run

tensorrt-cicd · 2026-03-19T08:37:46Z

PR_Github #39575 [ run ] triggered by Bot. Commit: 83d0a29 Link to invocation

tensorrt-cicd · 2026-03-19T14:55:12Z

PR_Github #39575 [ run ] completed with state SUCCESS. Commit: 83d0a29
/LLM/main/L0_MergeRequest_PR pipeline #30789 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-03-20T03:30:47Z

/bot run

tensorrt-cicd · 2026-03-20T03:37:14Z

PR_Github #39678 [ run ] triggered by Bot. Commit: 225c455 Link to invocation

tensorrt-cicd · 2026-03-20T05:50:11Z

PR_Github #39678 [ run ] completed with state SUCCESS. Commit: 225c455
/LLM/main/L0_MergeRequest_PR pipeline #30879 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-03-20T06:48:41Z

/bot help

github-actions · 2026-03-20T06:48:49Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt-cicd · 2026-03-30T06:15:06Z

PR_Github #40669 [ run ] triggered by Bot. Commit: 7062fcb Link to invocation

chuangz0 · 2026-03-30T06:32:02Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-30T06:38:07Z

PR_Github #40675 [ run ] triggered by Bot. Commit: 98d4f41 Link to invocation

tensorrt-cicd · 2026-03-30T06:38:09Z

PR_Github #40669 [ run ] completed with state ABORTED. Commit: 7062fcb

Link to invocation

chuangz0 · 2026-03-30T06:40:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-30T06:46:38Z

PR_Github #40679 [ run ] triggered by Bot. Commit: 98d4f41 Link to invocation

tensorrt-cicd · 2026-03-30T06:46:40Z

PR_Github #40675 [ run ] completed with state ABORTED. Commit: 98d4f41

Link to invocation

tensorrt-cicd · 2026-03-30T16:01:52Z

PR_Github #40679 [ run ] completed with state SUCCESS. Commit: 98d4f41
/LLM/main/L0_MergeRequest_PR pipeline #31709 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-03-31T03:36:30Z

/bot run --stage-list "A30-PyTorch-1,A30-PyTorch-2, DGX_B200-8_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-2"

tensorrt-cicd · 2026-03-31T03:42:47Z

PR_Github #40841 [ run ] triggered by Bot. Commit: 62ed37c Link to invocation

chuangz0 · 2026-03-31T03:42:50Z

/bot run --stage-list "A30-PyTorch-1, A30-PyTorch-2, DGX_B200-8_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-2"

tensorrt-cicd · 2026-03-31T06:45:25Z

PR_Github #40841 [ run ] completed with state SUCCESS. Commit: 62ed37c
/LLM/main/L0_MergeRequest_PR pipeline #31850 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chuangz0 · 2026-03-31T07:29:39Z

/bot run --stage-list "A30-PyTorch-1, DGX_B200-8_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-2"

tensorrt-cicd · 2026-03-31T07:36:17Z

PR_Github #40896 [ run ] triggered by Bot. Commit: 62ed37c Link to invocation

tensorrt-cicd · 2026-03-31T13:37:25Z

PR_Github #40896 [ run ] completed with state SUCCESS. Commit: 62ed37c
/LLM/main/L0_MergeRequest_PR pipeline #31897 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

chuangz0 · 2026-04-01T01:07:36Z

/bot skip --comment "all test has passed"

tensorrt-cicd · 2026-04-01T01:13:15Z

PR_Github #41049 [ skip ] triggered by Bot. Commit: b220b62 Link to invocation

tensorrt-cicd · 2026-04-01T01:21:46Z

PR_Github #41049 [ skip ] completed with state SUCCESS. Commit: b220b62
Skipping testing for commit b220b62

Link to invocation

…VIDIA#12273) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

chuangz0 requested a review from a team as a code owner March 17, 2026 06:44

chuangz0 requested review from lfr-0531 and yilin-void March 17, 2026 06:44

github-actions bot assigned chuangz0 Mar 17, 2026

chuangz0 requested a review from Shixiaowei02 March 17, 2026 06:44

chuangz0 force-pushed the py_cache_transceiver_host_optimize branch from 46e84f8 to a271356 Compare March 17, 2026 06:44

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/agentBindings.cpp Outdated

Shixiaowei02 reviewed Mar 17, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/disaggregation/base/agent.py

chuangz0 force-pushed the py_cache_transceiver_host_optimize branch 2 times, most recently from 215aed1 to b10c86d Compare March 18, 2026 10:31

chuangz0 force-pushed the py_cache_transceiver_host_optimize branch from 31748f2 to 83d0a29 Compare March 19, 2026 08:31

chuangz0 force-pushed the py_cache_transceiver_host_optimize branch from 83d0a29 to 225c455 Compare March 20, 2026 03:30

chuangz0 force-pushed the py_cache_transceiver_host_optimize branch from 225c455 to 68aba48 Compare March 20, 2026 06:48

chuangz0 force-pushed the py_cache_transceiver_host_optimize branch 2 times, most recently from 265b3eb to 93728ee Compare March 23, 2026 06:10

chuangz0 force-pushed the py_cache_transceiver_host_optimize branch from eb1465f to 7062fcb Compare March 30, 2026 06:09

chuangz0 force-pushed the py_cache_transceiver_host_optimize branch from 98d4f41 to 62ed37c Compare March 31, 2026 03:32

chuangz0 added 6 commits April 1, 2026 09:07

optimize host perf

b6fa788

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

host optimize 2

0cd8368

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

move transRequest

0454684

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

address comments

0de75ff

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix transceiver.py

05a200a

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

fix format

b220b62

Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

chuangz0 force-pushed the py_cache_transceiver_host_optimize branch from 62ed37c to b220b62 Compare April 1, 2026 01:07

chuangz0 enabled auto-merge (squash) April 1, 2026 01:07

chuangz0 merged commit 7a450b4 into NVIDIA:main Apr 1, 2026
5 checks passed

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[TRTLLM-9526][feat] optimize host perf for python cache transceiver (N…

bc8b474

…VIDIA#12273) Signed-off-by: Chuang Zhu <111838961+chuangz0@users.noreply.github.com>

Conversation

chuangz0 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

optimize host perf for python kv transfer

KV transfer perf

E2E

Concurrency = 1127

Concurrency = 1229

config 2 deepseek

1. Output Token Throughput (tok/s)

2. User Token Throughput (tok/s)

kv transfer

config 3 deepseek

1. Output Token Throughput (tok/s)

2. User Token Throughput (tok/s)

3. TTFT(ms)

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

chuangz0 commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

chuangz0 commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

chuangz0 commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

chuangz0 commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

tensorrt-cicd commented Mar 20, 2026

Uh oh!

chuangz0 commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

chuangz0 commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

chuangz0 commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

chuangz0 commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading