Skip to content

[None][perf] executor: memcpy int32 token buffer into tle::Request ctor#15211

Draft
hyukn wants to merge 1 commit into
NVIDIA:feat/deepseek_v4from
hyukn:perf/request-ctor-int32-buffer
Draft

[None][perf] executor: memcpy int32 token buffer into tle::Request ctor#15211
hyukn wants to merge 1 commit into
NVIDIA:feat/deepseek_v4from
hyukn:perf/request-ctor-int32-buffer

Conversation

@hyukn

@hyukn hyukn commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Description

RpcWorker.submitBaseWorker._enqueue_request builds a tle::Request from prompt_token_ids on a GIL-held thread. nanobind casts the Python list[int]std::vector<int32> element-by-element (one PyLong read per token), which is O(ISL). In the decode phase (short forward steps, host-bound iteration) the submit thread's per-element cast stalls the PyExecutor loop on the GIL (~15.7 ms/iter observed in nsys on a disagg GEN worker).

This PR adds a buffer fast-path to the Request constructor binding: a 1-D contiguous int32 ndarray is memcpy'd into VecTokens (no per-element cast). list[int] still works via the default sequence cast, so the change is back-compatible. On the Python side, _enqueue_request passes prompt_token_ids straight through when it is already an ndarray.

Microbenchmark (numpy proxy; real bindings not built in dev env)

Per-submit token handling: the current list→vector cast is O(ISL) (≈4 µs @128 → ≈480 µs @16k tokens); the fixed int32 buffer → memcpy is bandwidth-bound (low single-digit µs) → ~12×–240× on that component. Activates when tokens reach the worker as a buffer (e.g. bytes-serialized request on the wire); no-op and back-compatible otherwise.

Status / TODO (draft)

  • C++ not yet built/verified in CI — authored against the exact 38-arg ctor signature; needs a build to confirm nanobind try_cast/ndarray usage compiles.
  • On feat/deepseek_v4 alone the fast-path is dormant (tokens arrive as list); the win is realized when paired with bytes-on-the-wire request serialization (keeps tokens as ndarray end-to-end).
  • Add NVTX split (copy / build / enqueue) to confirm the exact submit reduction in a real profile.

Test Coverage

TBD — back-compatible path is exercised by existing executor tests; buffer path needs a unit test passing an int32 ndarray to tllm.Request.

🤖 Generated with Claude Code

RpcWorker.submit constructs a tle::Request from prompt_token_ids on a
GIL-held thread; nanobind casts the Python list[int] -> std::vector<int32>
element-by-element (one PyLong read per token). This is O(ISL) and, in the
decode phase (short forward steps), the submit thread stalls the executor
loop on the GIL (~15.7 ms/iter observed in nsys).

Add a buffer fast-path to the Request constructor: a 1-D contiguous int32
ndarray is memcpy'd into VecTokens (no per-element cast); list[int] still
works via the default sequence cast, so the change is back-compatible.
base_worker._enqueue_request passes prompt_token_ids straight through when
it is already an ndarray.

This is the construction-path complement to NVIDIA#15134, which bytes-encodes
Request pickling (the DP/TP broadcast). Here we target Request construction
on the RpcWorker.submit / _enqueue_request path, which NVIDIA#15134 does not touch.

NOTE: the C++ change requires a rebuild to verify (not built in this env).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Yukun He <23156053+hyukn@users.noreply.github.com>
@hyukn

hyukn commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53284 [ run ] triggered by Bot. Commit: 8fe0a80 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #53284 [ run ] completed with state SUCCESS. Commit: 8fe0a80
/LLM/main/L0_MergeRequest_PR pipeline #42474 completed with status: 'SUCCESS'

CI Report

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants