Skip to content

cuda: add per-session mutable state rebinding#20241

Open
mergennachin wants to merge 1 commit into
mainfrom
llm-pr-a-cuda-mutable-state
Open

cuda: add per-session mutable state rebinding#20241
mergennachin wants to merge 1 commit into
mainfrom
llm-pr-a-cuda-mutable-state

Conversation

@mergennachin

@mergennachin mergennachin commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Local agent serving needs to host multiple logical conversations on one CUDA-resident model without multiplying the model weights. Loading one AOTI module per conversation is not viable for large local models, while sharing the default mutable state across conversations would let KV/recurrent/conv buffers bleed between users.

This adds the CUDA-private foundation for separating those concerns: weights remain owned by the loaded AOTI container, while mutable buffer FQNs can be registered as per-session state and rebound before execution. The path is fail-closed and dormant until a model opts in by creating a mutable-state context and validating coverage, so existing CUDA models keep their current behavior.

The branch also wires the new source and fall-closed unit test into both Buck and CMake so the primitive can land independently before any model-specific engine consumes it.

#20001

Copilot AI review requested due to automatic review settings June 12, 2026 20:11
@pytorch-bot

pytorch-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20241

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 142 Cancelled Jobs, 2 Pending, 2 Unclassified Failures

As of commit 4632963 with merge base d7ca5db (image):

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026
@mergennachin mergennachin requested a review from Gasoonjia June 12, 2026 20:11
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds CUDA-private infrastructure to support per-session rebinding of mutable model state (e.g., KV/conv/recurrent buffers) while keeping model weights shared across logical conversations, enabling multi-session local serving without duplicating large CUDA-resident weights.

Changes:

  • Introduces cuda_mutable_state context/session manager with descriptor capture and per-execute rebinding hooks.
  • Integrates mutable-state hooks into CudaBackend::init (handle noting) and CudaBackend::execute (rebind-before-run).
  • Adds a fail-closed unit test and wires it into both Buck and CMake builds.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
backends/cuda/runtime/cuda_mutable_state.h Declares CUDA-private context/session API for mutable-state rebinding.
backends/cuda/runtime/cuda_mutable_state.cpp Implements context/session tracking, template capture, buffer allocation, and rebind logic.
backends/cuda/runtime/cuda_backend.cpp Calls into mutable-state hooks during init and before execute.
backends/cuda/runtime/test/test_cuda_mutable_state.cpp Adds unit test covering fail-closed defaults.
backends/cuda/runtime/TARGETS Adds new source/header to CUDA backend and registers Buck unit test target.
backends/cuda/CMakeLists.txt Adds new source to CMake backend build and registers CMake unit test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp Outdated
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
@Gasoonjia

Copy link
Copy Markdown
Contributor

@claude review the PR

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @Gasoonjia's task in 1m 53s —— View job


Code Review: cuda — per-session mutable state rebinding

Reviewed all 6 changed files against main. This is a clean, well-isolated foundational primitive. The design is genuinely fail-closed: it stays dormant unless an engine opts in via create_context + begin_load/end_load + register_fqns + validate_coverage, and every error path returns a non-Ok Error rather than silently allocating or binding.

The three earlier Copilot comments are already resolved in the current revision, each with a matching regression test:

  • mutable_state_rebind_for_execute validates the token against c.sessions before allocating (cuda_mutable_state.cpp:495), covered by RebindRejectsUncreatedSessionToken.
  • mutable_state_begin_load guards against nested scopes and fails closed by poisoning build_error (:303), covered by NestedBeginLoadFailsClosed.
  • mutable_state_forget_handle exists and is called from CudaBackend::destroy (cuda_backend.cpp:840), covered by ForgetHandleDropsAssociation.

Nice work tightening those.


Findings

1. Shared per-FQN template/buffer can be under-sized if two methods declare the same mutable FQN with different shapescuda_mutable_state.cpp:165 / :216

The template (template_ptr/template_nbytes) and the session buffer are keyed by FQN and captured/sized once, from the first handle that discovers the FQN (:165). But the wrapper in ensure_bound uses this handle's Desc sizes/strides/dtype (:233-238) over that shared buffer. If a later method's container exposes the same FQN with a larger nbytes, the SlimTensor describes more bytes than were allocated → out-of-bounds device access / silent corruption, with no diagnostic.

Per-session KV/conv FQNs likely have consistent shapes across methods today, so this is latent rather than active, but it's exactly the kind of mismatch this fail-closed layer should catch. A cheap guard in ensure_bound would close it:

if (d.nbytes > c.template_nbytes[fqn]) {
  ET_LOG(Error,
      "mutable_state: descriptor '%s' (%zu B) exceeds shared template buffer (%zu B)",
      fqn.c_str(), d.nbytes, c.template_nbytes[fqn]);
  return Error::Internal;
}

Fix this →

2. mutable_state_destroy_context leaves stale thread-local active sessioncuda_mutable_state.cpp:275

If a thread has set_active(ctx, token) and the context is destroyed (e.g. teardown races, or a buggy engine), tl_active_ctx/tl_active_token stay set. This is not a memory-safety bug — the next rebind_for_execute fails closed at the handle_ctx lookup (the handle's association was erased during destroy → Error::Internal at :463). Worth confirming this is the intended contract; the header already states the engine owns lifecycle/serialization, so a one-line note that destroy does not reset other threads' active selection would help future callers.

3. Multi-session is serialized, not concurrent — make sure this is loud enoughcuda_mutable_state.h:84

Because the AOTI container is shared and update_user_managed_constant_buffer_pairs rebinds it globally, two threads executing different sessions on the same container would race on the container's constant pointers. The header does document "the engine must hold its serialization lock across set + execute + read-out," which is the right contract. This is the key architectural constraint for the consuming engine; flagging it here so reviewers of the future engine PR don't miss it. No change needed in this PR.


Minor / non-blocking

  • mutable_state_rebind_for_execute calls update_user_managed_constant_buffer_pairs on every execute (not just first bind). This is correct for true multi-session interleaving (the container may have been rebound by another session since), but adds a per-execute rebind cost even in the common single-active-session case. Fine for a foundation; a future optimization could skip the rebind when the container is already pointed at this session's buffers.
  • GPU allocation paths (build_descriptors, ensure_session_buffers, ensure_bound with non-empty templates) are not exercised by the unit test — the fake handle reports get_num_constants == 0. The fail-closed control flow is well covered; the actual clone/rebind logic will only get coverage once a real model or a fake with non-zero constants drives it. Reasonable for a primitive landing ahead of its consumer, but worth a follow-up.
  • Build wiring (CMake :242-250, Buck :142-156) looks correct and symmetric; the test is registered in both with CUDA_AVAILABLE=1.

Overall this is solid, defensive, and well-documented. Finding #1 is the only one I'd consider addressing before the consuming engine lands; the rest are confirmations/notes.
• Branch: llm-pr-a-cuda-mutable-state

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
Comment thread backends/cuda/runtime/test/test_cuda_mutable_state.cpp
@mergennachin mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from a9c517c to 0be0912 Compare June 12, 2026 20:50
@mergennachin

Copy link
Copy Markdown
Contributor Author

@claude Review the PR

Double check if it addressed previous feedbacks from Copilot and Claude

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @mergennachin's task in 1m 19s —— View job


Re-review: did the PR address prior Copilot & Claude feedback?

Checked all 6 changed files at 0be0912 against every prior review comment. Verdict: all 7 prior findings are resolved, each with a fail-closed code path and (where applicable) a matching regression test.

Copilot round 1 (2026-06-12T20:16)

  • Implicit session_buf[token] allocation for uncreated tokensrebind_for_execute now validates c.sessions.find(token) and returns InvalidArgument before any allocation (cuda_mutable_state.cpp:509-512). Test: RebindRejectsUncreatedSessionToken.
  • begin_load overwrites tl_loading_ctx unconditionally (nesting) → guarded; nested entry poisons both contexts' build_error = InvalidState, logs, and clears the thread-local (:314-329). Test: NestedBeginLoadFailsClosed.
  • No way to unregister a destroyed handlemutable_state_forget_handle added (:457-470), called from CudaBackend::destroy (cuda_backend.cpp:840). Test: ForgetHandleDropsAssociation.

Copilot round 2 (2026-06-12T20:44)

  • Rebinding under CUDA graph capture/replay silently no-ops → rejected with NotSupported when cuda_graph_state.phase != Disabled (:513-518). Test: RebindRejectsCudaGraphHandle (the test Copilot explicitly asked for).
  • create_session succeeds despite earlier build_error → now returns build_error first (:402-404), matching validate_coverage.
  • available() ignores build_error → now folds it in: build_error == Error::Ok && symbols_available (:355-356).
  • Missing CUDA-graph regression test → added (see above).

Claude round 1 (2026-06-12T20:38) — Finding #1 (the only one flagged as worth fixing pre-merge)

  • Shared per-FQN buffer under-sized if a later method declares the same FQN with a larger nbytesensure_bound now guards d.nbytes > template_nbytes[fqn] (and missing-template), returning Error::Internal with a diagnostic before wrapping the buffer (:232-243). This closes the out-of-bounds GPU access path.

Findings #2 (stale thread-local active session after destroy) and #3 (multi-session serialization contract) were notes/confirmations, not change requests — both remain correctly fail-closed at the handle_ctx/build_error lookups.

Build wiring

  • CMake: source CMakeLists.txt:188, test target :246-249 with CUDA_AVAILABLE=1.
  • Buck: source TARGETS:109/113, test target :143-145. Symmetric with CMake.

Remaining observations (non-blocking, no prior comment)

  1. cudaMalloc template leak on partial build_descriptors failure:165-181: each FQN's template is cudaMalloc'd and recorded in c.template_ptr as the loop proceeds. If a later FQN's cudaMalloc/cudaMemcpy fails and returns Error::Internal, the already-allocated earlier templates are only freed later by mutable_state_destroy_context. Since the engine is contractually required to destroy the context on a non-Ok build, this is reclaimed — but worth a one-line note that cleanup is deferred to destroy_context, not immediate.

  2. GPU clone/rebind logic still has no positive-path coverage — the fake handle reports get_num_constants == 0, so build_descriptors/ensure_session_buffers/ensure_bound (including the new size guard at :232-243) never execute their allocation bodies in tests. The fail-closed control flow is thoroughly covered; the actual copy-and-bind path will only get exercised once a real model or a fake with non-zero constants drives it. Reasonable for a primitive landing ahead of its consumer — flagging for a follow-up. Add coverage →

Summary: every actionable item from both Copilot rounds and Claude's blocking Finding #1 is addressed in code with corresponding tests. No regressions introduced. The two observations above are follow-ups, not merge blockers.
• Branch: llm-pr-a-cuda-mutable-state

@mergennachin mergennachin requested a review from shoumikhin June 12, 2026 20:53
Copilot AI review requested due to automatic review settings June 12, 2026 21:00
@mergennachin

Copy link
Copy Markdown
Contributor Author

@claude Review again

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @mergennachin's task in 1m 33s —— View job


Re-review: cuda — per-session mutable state rebinding

Re-read all 6 files at the current tip (b0ec6fc). Everything from both Copilot rounds, Claude's blocking Finding #1, and both of my prior follow-up observations is now resolved in code with matching tests. No regressions, no new blocking issues.

What changed since my last review

The two items I'd flagged as follow-ups are now closed:

  • Positive-path GPU coverageCapturesClonesAndRebindsDeviceBuffer (test_cuda_mutable_state.cpp:308) drives a real FakeContainer with a non-zero constant, so build_descriptorsensure_session_buffersensure_boundupdate_user_managed_constant_buffer_pairs all execute their allocation/clone bodies. It asserts the bound buffer is a distinct device pointer from the source, has the right nbytes, and contains the cloned contents (D2H copy compared to {1,2,3,4}). This is exactly the clone-and-rebind path that previously had no coverage.
  • Size-guard regression testRebindRejectsLargerDescriptorForSharedFqn (:365) registers the same FQN across two handles where the second declares a larger nbytes, and confirms rebind_for_execute returns Error::Internal with update_calls == 0. This locks in the d.nbytes > template_nbytes[fqn] guard at cuda_mutable_state.cpp:232-243 (Claude Finding Add support for quantized LeakyReLU #1).

Both are CUDA-gated via cuda_device_available() / GTEST_SKIP, so they're real coverage on CUDA CI and harmless elsewhere.

Prior findings — still confirmed resolved

  • Copilot R1: uncreated-token allocation (:509), nested begin_load (:314), forget_handle from destroy (:457, cuda_backend.cpp:840).
  • Copilot R2: CUDA-graph rebind rejected NotSupported (:513); create_session honors build_error (:402); available() folds in build_error (:355).
  • Claude R1 Add support for quantized LeakyReLU #1: shared-FQN under-size guard (:232) — now also test-locked.

Cross-check on execute ordering

The mutable_state_rebind_for_execute call sits at cuda_backend.cpp:551, before the CudaGraphPhase::Replay branch (:556). Since rebind returns NotSupported whenever phase != Disabled, an active session on a graph-captured handle fails closed before replay — consistent with the design. ✅

Remaining non-blocking notes (carried over, no action required for this PR)

  1. Deferred template cleanupbuild_descriptors (:165-181) cudaMallocs per-FQN templates as it goes; on a later-FQN failure the earlier templates are reclaimed only by mutable_state_destroy_context, not immediately. The header contract requires the engine to destroy the context on a non-Ok build, so memory is reclaimed — just worth a one-line comment that cleanup is deferred.
  2. Serialized multi-session — the shared AOTI container means concurrent sessions on one container would race on constant pointers; the header documents that the engine must hold its serialization lock across set+execute+read-out. This is the key constraint for the future consuming engine PR, not this one.

Verdict: This foundational primitive is fail-closed, well-documented, and now has both the control-flow and the allocation/clone paths under test. LGTM to land ahead of its consumer.
• Branch: llm-pr-a-cuda-mutable-state

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp
Comment thread backends/cuda/runtime/cuda_mutable_state.cpp Outdated
Comment thread backends/cuda/runtime/TARGETS
Comment thread backends/cuda/runtime/TARGETS
@Gasoonjia

Copy link
Copy Markdown
Contributor

Thanks Mergen for this! Looks good.

I've one question regarding kv cache sharing across methods: like prefill and decode are separate methods (each its own AOTI container / CudaDelegateHandle) exporting the same KV-cache FQN: since session_buf is keyed (token → fqn → buffer) and ensure_bound points every handle's wrapper at the same session_buf[token][fqn], can you confirm the intended behavior is that, under one active session token, both handles rebind to the same GPU buffer so decode continues exactly what prefill wrote? Or is the V1 primitive only meant for single-method per-session isolation? Any test you can add to guard the target behavoir?

Another thing im worried about whether the session support impact cuda graph inference, since cuda graph inference needs to capture a static path but now the path may not be static since we are gonna change session, maybe we should move the cuda graph infortaion from CudaBackendHandle to cuda_mutable_state.

Local agent serving needs to host multiple logical conversations on one CUDA-resident model without multiplying the model weights. Loading one AOTI module per conversation is not viable for large local models, while sharing the default mutable state across conversations would let KV/recurrent/conv buffers bleed between users.

This adds the CUDA-private foundation for separating those concerns: weights remain owned by the loaded AOTI container, while mutable buffer FQNs can be registered as per-session state and rebound before execution. The path is fail-closed and dormant until a model opts in by creating a mutable-state context and validating coverage, so existing CUDA models keep their current behavior.

The branch also wires the new source and unit coverage into both Buck and CMake so the primitive can land independently before any model-specific engine consumes it.
Copilot AI review requested due to automatic review settings June 12, 2026 21:33
@mergennachin mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from 4344f00 to de9d3cd Compare June 12, 2026 21:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comment on lines +174 to +179
// A successful call may still report an unusable (null/empty) name --
// that's a method-scoped constant, not an error: skip it (another container
// owns it). A non-OK return code above is a real failure and falls closed.
if (internal && fqn && fqn[0] != '\0') {
fqn_to_internal[fqn] = internal;
}
Comment on lines +213 to +224
void* tpl = nullptr;
if (cudaMalloc(&tpl, t->nbytes()) != cudaSuccess) {
ET_LOG(Error, "mutable_state: cudaMalloc template '%s'", fqn.c_str());
return Error::Internal;
}
if (cudaMemcpy(
tpl, t->data_ptr(), t->nbytes(), cudaMemcpyDeviceToDevice) !=
cudaSuccess) {
ET_LOG(Error, "mutable_state: cudaMemcpy template '%s'", fqn.c_str());
cudaFree(tpl);
return Error::Internal;
}
Comment on lines +243 to +256
void* tpl = kv.second;
size_t nbytes = c.template_nbytes[fqn];
void* p = nullptr;
if (cudaMalloc(&p, nbytes) != cudaSuccess) {
ET_LOG(
Error, "mutable_state: cudaMalloc session buffer '%s'", fqn.c_str());
return Error::Internal;
}
if (cudaMemcpy(p, tpl, nbytes, cudaMemcpyDeviceToDevice) != cudaSuccess) {
ET_LOG(
Error, "mutable_state: cudaMemcpy session buffer '%s'", fqn.c_str());
cudaFree(p);
return Error::Internal;
}
Comment on lines +308 to +317
void free_session_buffers(Context& c, int token) {
auto it = c.session_buf.find(token);
if (it != c.session_buf.end()) {
for (auto& kv : it->second) {
if (kv.second) {
cudaFree(kv.second);
}
}
c.session_buf.erase(it);
}
Comment on lines +341 to +353
Context& c = it->second;
for (auto& kv : c.template_ptr) {
if (kv.second) {
cudaFree(kv.second);
}
}
for (auto& sb : c.session_buf) {
for (auto& kv : sb.second) {
if (kv.second) {
cudaFree(kv.second);
}
}
}
}

void mutable_state_set_active(MutableStateContext ctx, int token) {
tl_active_ctx = ctx;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The active session is stored in a thread-local that is set before execute but never cleared afterward. If a reset is ever missed (an exception, an early return, or another execute on the same thread), the next run rebinds the container to the previous session's buffers. When that next run is the same model there is no error, and one conversation's state silently leaks into another's. Don't rely on the caller resetting it: add an RAII guard that sets the active session in its constructor and clears it in its destructor, and reset it right after the run in execute. The same applies to begin_load/end_load: if load_method throws before end_load, the leftover loading state poisons the next unrelated load on that thread, so wrap that in an RAII scope too.


Error mutable_state_rebind_for_execute(CudaDelegateHandle* handle) {
if (tl_active_token == kNoMutableSession) {
return Error::Ok; // single-session / legacy: nothing to rebind

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebinding changes the shared container in place and the change persists. Once a session has run, the container keeps pointing at that session's buffers. A later run with no active session hits this early return, does nothing, and silently executes on the last session's buffers. So mixing session and non-session runs on the same model corrupts both. Either assert that once a model uses sessions every run must select a session, or rebind back to the model's default buffers in this no-session path.

}
auto& m = mgr();
std::lock_guard<std::mutex> g(m.mu);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This holds one process-wide lock across the external rebind call and the GPU allocations below. That forces every model in the process to take turns on the hottest part of execute, and it can stall everything if the rebind call blocks or re-enters. Use a per-context lock instead, and copy out the values you need (the pairs, container handle, function pointer) under the lock, then release it before making the rebind call and the allocations.

}

void mutable_state_destroy_session(MutableStateContext ctx, int token) {
auto& m = mgr();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This frees the session's GPU buffers right away, but a run that used them may still be finishing on the GPU since some methods return without waiting on the stream. Freeing memory the GPU is still writing is a use-after-free and can crash or corrupt data. Wait on the relevant stream or event before freeing, or require the caller to ensure all runs are done first and assert it. The same applies to destroy_context.

}

Error fake_container_update_user_managed_pairs(
aoti::AOTInductorModelContainerHandle container,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fake only records its arguments and never stores the bound buffer, so the GPU tests cannot prove a rebind actually took effect or that a second session's rebind replaces the first. There is also no test for the main guarantee that two sessions stay isolated. Please make this fake remember the bound buffer per name, then add a GPU test that rebinds session A and writes a pattern, rebinds session B and writes a different pattern, rebinds A again, and asserts A's buffer is the same one and still holds A's data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants