[None][perf] executor: memcpy int32 token buffer into tle::Request ctor by hyukn · Pull Request #15211 · NVIDIA/TensorRT-LLM

hyukn · 2026-06-10T08:25:45Z

Description

RpcWorker.submit → BaseWorker._enqueue_request builds a tle::Request from prompt_token_ids on a GIL-held thread. nanobind casts the Python list[int] → std::vector<int32> element-by-element (one PyLong read per token), which is O(ISL). In the decode phase (short forward steps, host-bound iteration) the submit thread's per-element cast stalls the PyExecutor loop on the GIL (~15.7 ms/iter observed in nsys on a disagg GEN worker).

This PR adds a buffer fast-path to the Request constructor binding: a 1-D contiguous int32 ndarray is memcpy'd into VecTokens (no per-element cast). list[int] still works via the default sequence cast, so the change is back-compatible. On the Python side, _enqueue_request passes prompt_token_ids straight through when it is already an ndarray.

Microbenchmark (numpy proxy; real bindings not built in dev env)

Per-submit token handling: the current list→vector cast is O(ISL) (≈4 µs @128 → ≈480 µs @16k tokens); the fixed int32 buffer → memcpy is bandwidth-bound (low single-digit µs) → ~12×–240× on that component. Activates when tokens reach the worker as a buffer (e.g. bytes-serialized request on the wire); no-op and back-compatible otherwise.

Status / TODO (draft)

C++ not yet built/verified in CI — authored against the exact 38-arg ctor signature; needs a build to confirm nanobind try_cast/ndarray usage compiles.
On feat/deepseek_v4 alone the fast-path is dormant (tokens arrive as list); the win is realized when paired with bytes-on-the-wire request serialization (keeps tokens as ndarray end-to-end).
Add NVTX split (copy / build / enqueue) to confirm the exact submit reduction in a real profile.

Test Coverage

TBD — back-compatible path is exercised by existing executor tests; buffer path needs a unit test passing an int32 ndarray to tllm.Request.

🤖 Generated with Claude Code

RpcWorker.submit constructs a tle::Request from prompt_token_ids on a GIL-held thread; nanobind casts the Python list[int] -> std::vector<int32> element-by-element (one PyLong read per token). This is O(ISL) and, in the decode phase (short forward steps), the submit thread stalls the executor loop on the GIL (~15.7 ms/iter observed in nsys). Add a buffer fast-path to the Request constructor: a 1-D contiguous int32 ndarray is memcpy'd into VecTokens (no per-element cast); list[int] still works via the default sequence cast, so the change is back-compatible. base_worker._enqueue_request passes prompt_token_ids straight through when it is already an ndarray. This is the construction-path complement to NVIDIA#15134, which bytes-encodes Request pickling (the DP/TP broadcast). Here we target Request construction on the RpcWorker.submit / _enqueue_request path, which NVIDIA#15134 does not touch. NOTE: the C++ change requires a rebuild to verify (not built in this env). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>

hyukn · 2026-06-10T09:06:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-10T09:13:14Z

PR_Github #53284 [ run ] triggered by Bot. Commit: 8fe0a80 Link to invocation

tensorrt-cicd · 2026-06-10T12:22:29Z

PR_Github #53284 [ run ] completed with state SUCCESS. Commit: 8fe0a80
/LLM/main/L0_MergeRequest_PR pipeline #42474 completed with status: 'SUCCESS'

CI Report

Link to invocation

github-actions Bot assigned hyukn Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][perf] executor: memcpy int32 token buffer into tle::Request ctor#15211

[None][perf] executor: memcpy int32 token buffer into tle::Request ctor#15211
hyukn wants to merge 1 commit into
NVIDIA:feat/deepseek_v4from
hyukn:perf/request-ctor-int32-buffer

hyukn commented Jun 10, 2026 •

edited

Loading

Uh oh!

hyukn commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hyukn commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Microbenchmark (numpy proxy; real bindings not built in dev env)

Status / TODO (draft)

Test Coverage

Uh oh!

hyukn commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

tensorrt-cicd commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hyukn commented Jun 10, 2026 •

edited

Loading