cuda: add per-session mutable state rebinding by mergennachin · Pull Request #20241 · pytorch/executorch

mergennachin · 2026-06-12T20:11:32Z

Local agent serving needs to host multiple logical conversations on one CUDA-resident model without multiplying the model weights. Loading one AOTI module per conversation is not viable for large local models, while sharing the default mutable state across conversations would let KV/recurrent/conv buffers bleed between users.

This adds the CUDA-private foundation for separating those concerns: weights remain owned by the loaded AOTI container, while mutable buffer FQNs can be registered as per-session state and rebound before execution. The path is fail-closed and dormant until a model opts in by creating a mutable-state context and validating coverage, so existing CUDA models keep their current behavior.

The branch also wires the new source and fall-closed unit test into both Buck and CMake so the primitive can land independently before any model-specific engine consumes it.

#20001

pytorch-bot · 2026-06-12T20:11:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20241

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI jobs will have longer queue times due to CI migration

❌ 142 Cancelled Jobs, 2 Pending, 2 Unclassified Failures

As of commit 4632963 with merge base d7ca5db ():

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Test CUDA Builds (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Test CUDA Windows Export and E2E (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Build documentation / build (buck2) / Build doc (gh)
##[error]The operation was canceled.
Build Presets / apple (ios-simulator) / build (gh)
##[error]The operation was canceled.
Build Presets / apple (ios) / build (gh)
##[error]The operation was canceled.
Build Presets / apple (llm) / build (gh)
##[error]The operation was canceled.
Build Presets / apple (macos) / build (gh)
##[error]The operation was canceled.
Build Presets / apple (profiling) / build (gh)
##[error]The operation was canceled.
Build Presets / apple (pybind) / build (gh)
##[error]The operation was canceled.
Build Presets / linux (linux, linux.2xlarge, executorch-ubuntu-22.04-clang12) / build (gh)
##[error]The operation was canceled.
Build Presets / linux (linux, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11-aarch64) / build (gh)
##[error]The operation was canceled.
Build Presets / linux (llm, linux.2xlarge, executorch-ubuntu-22.04-clang12) / build (gh)
##[error]The operation was canceled.
Build Presets / linux (llm, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11-aarch64) / build (gh)
##[error]The operation was canceled.
Build Presets / linux (pybind, linux.2xlarge, executorch-ubuntu-22.04-clang12) / build (gh)
##[error]The operation was canceled.
Build Presets / linux (pybind, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11-aarch64) / build (gh)
##[error]The operation was canceled.
Build Presets / windows (pybind) / build (gh)
Build Presets / windows (windows) / build (gh)
Build Presets / zephyr (zephyr) / build (gh)
##[error]The operation was canceled.
Cadence Build & Test / cpu-build / build (gh)
##[error]The operation was canceled.
Cadence Build & Test / hifi-build / hifi4 (gh)
##[error]The operation was canceled.
pull / android / build-android (gh)
##[error]The operation was canceled.
pull / android / run-emulator (gh)
pull / nxp-build-test / linux-job (gh)
##[error]The operation was canceled.
pull / test-arm-backend-no-driver (test_pytest_models_tosa) / linux-job (gh)
##[error]The operation was canceled.
pull / test-arm-backend-no-driver (test_pytest_ops_no_target) / linux-job (gh)
##[error]The operation was canceled.
pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
##[error]The operation was canceled.
pull / test-arm-backend-no-driver (test_run_tosa) / linux-job (gh)
##[error]The operation was canceled.
pull / test-arm-cortex-m-size-test (bare_metal) / linux-job (gh)
##[error]The operation was canceled.
pull / test-arm-cortex-m-size-test (zephyr-preset) / linux-job (gh)
##[error]The operation was canceled.
pull / test-binary-size-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-binary-size-linux-gcc / linux-job (gh)
##[error]The operation was canceled.
pull / test-build-wasm-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-custom-ops-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-eval_llama-wikitext-linux / linux-job (gh)
pull / test-llama_runner_eager-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-linux (bf16, custom, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-linux (fp32, xnnpack+custom+qe, linux.arm64.2xlarge, executorch-ubuntu-22.04-gc... / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04... / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-linux (fp32, xnnpack+custom+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu... / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-linux (fp32, xnnpack+quantize_kv, linux.arm64.2xlarge, executorch-ubuntu-22.04-... / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-linux-android / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh)
##[error]The operation was canceled.
pull / test-llama-runner-qnn-linux (fp32, qnn_8a8w, qnn) / linux-job (gh)
##[error]The operation was canceled.
pull / test-lora-linux / linux-job (gh)
pull / test-lora-multimethod-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-mcu-cortex-m-backend / linux-job (gh)
pull / test-mediatek-models-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (add_mul, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (add_mul, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (add, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (add, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (emformer_join, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (emformer_join, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (emformer_transcribe, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (emformer_transcribe, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (ic3, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (ic3, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (ic4, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (ic4, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (linear, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (linear, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (llama3_2_vision_encoder, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (mv2, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (mv2, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (resnet18, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (resnet18, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (resnet50, portable, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (resnet50, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux (w2l, portable, linux.4xlarge.memory) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (mv3, portable, buck2, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (mv3, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (mv3, portable, cmake, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11... / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, buck2, linux.2xlarge, executorch-u... / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (mv3, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (vit, portable, buck2, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (vit, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (vit, portable, cmake, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11... / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, buck2, linux.2xlarge, executorch-u... / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.2xlarge, executorch-u... / linux-job (gh)
##[error]The operation was canceled.
pull / test-models-linux-basic (vit, xnnpack-quantization-delegation, cmake, linux.arm64.2xlarge, execut... / linux-job (gh)
##[error]The operation was canceled.
pull / test-moshi-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
##[error]The operation was canceled.
pull / test-openvino-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-parakeet-xnnpack-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-phi-3-mini-runner-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-buck-build-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-delegate-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-direct-build-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-models-linux (dl3) / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-models-linux (mv2) / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-models-linux (mv3) / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-passes-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-python-imports-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-testsuite-linux / package-golden-artifacts (gh)
pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-testsuite-linux / test-backend-linux (qnn, operators) / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-wheel-packages-linux (3.10) / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-wheel-packages-linux (3.11) / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-wheel-packages-linux (3.12) / linux-job (gh)
##[error]The operation was canceled.
pull / test-qnn-wheel-packages-linux (3.13) / linux-job (gh)
##[error]The operation was canceled.
pull / test-quantized-aot-lib-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-samsung-models-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-samsung-quantmodels-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-selective-build-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-setup-linux-gcc / linux-job (gh)
##[error]The operation was canceled.
pull / test-sqnr-static-llm-qnn-linux (smollm2_135m) / linux-job (gh)
##[error]The operation was canceled.
pull / test-static-llama-qnn-linux (stories_110m) / linux-job (gh)
##[error]The operation was canceled.
pull / test-static-llama-qnn-linux (stories_260k_bc) / linux-job (gh)
##[error]The operation was canceled.
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-vulkan-models-linux / linux-job (gh)
##[error]The operation was canceled.
pull / test-vulkan-operators-linux / linux-job (gh)
##[error]The operation was canceled.
pull / unittest / linux / linux-job (gh)
pull / unittest / macos / macos-job (gh)
pull / unittest / windows / windows-job (gh)
pull / unittest-buck / linux / linux-job (gh)
pull / unittest-buck / macos / macos-job (gh)
pull / unittest-editable / linux / linux-job (gh)
pull / unittest-editable / macos / macos-job (gh)
pull / unittest-editable / windows / windows-job (gh)
pull / unittest-nxp-neutron / linux-job (gh)
##[error]The operation was canceled.
pull / unittest-wasm-bindings (--enable-etdump) / linux-job (gh)
##[error]The operation was canceled.
Test ARM Backend / test-arm / package-golden-artifacts (gh)
Test ARM Backend / test-arm / test-backend-linux (arm_tosa_fp, models) / linux-job (gh)
##[error]The operation was canceled.
Test ARM Backend / test-arm / test-backend-linux (arm_tosa_fp, operators) / linux-job (gh)
##[error]The operation was canceled.
Test ARM Backend / test-arm / test-backend-linux (arm_vgf_fp, models) / linux-job (gh)
##[error]The operation was canceled.
Test ARM Backend / test-arm / test-backend-linux (arm_vgf_fp, operators) / linux-job (gh)
##[error]The operation was canceled.
Test QNN Backend / test-qnn / package-golden-artifacts (gh)
Test QNN Backend / test-qnn / test-backend-linux (qnn, models) / linux-job (gh)
##[error]The operation was canceled.
Test QNN Backend / test-qnn / test-backend-linux (qnn, operators) / linux-job (gh)
Test Vulkan Backend / test-vulkan / package-golden-artifacts (gh)
Test Vulkan Backend / test-vulkan / test-backend-linux (vulkan, models) / linux-job (gh)
##[error]The operation was canceled.
Test Vulkan Backend / test-vulkan / test-backend-linux (vulkan, operators) / linux-job (gh)
##[error]The operation was canceled.
Test WebGPU Backend / test-webgpu / package-golden-artifacts (gh)
Test WebGPU Backend / test-webgpu / test-backend-linux (webgpu, models) / linux-job (gh)
##[error]The operation was canceled.
Test WebGPU Backend / test-webgpu / test-backend-linux (webgpu, operators) / linux-job (gh)
##[error]The operation was canceled.
Test XNNPACK Backend / test-xnnpack / package-golden-artifacts (gh)
Test XNNPACK Backend / test-xnnpack / test-backend-linux (xnnpack, models) / linux-job (gh)
##[error]The operation was canceled.
Test XNNPACK Backend / test-xnnpack / test-backend-linux (xnnpack, operators) / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-12T20:12:24Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Adds CUDA-private infrastructure to support per-session rebinding of mutable model state (e.g., KV/conv/recurrent buffers) while keeping model weights shared across logical conversations, enabling multi-session local serving without duplicating large CUDA-resident weights.

Changes:

Introduces cuda_mutable_state context/session manager with descriptor capture and per-execute rebinding hooks.
Integrates mutable-state hooks into CudaBackend::init (handle noting) and CudaBackend::execute (rebind-before-run).
Adds a fail-closed unit test and wires it into both Buck and CMake builds.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
backends/cuda/runtime/cuda_mutable_state.h	Declares CUDA-private context/session API for mutable-state rebinding.
backends/cuda/runtime/cuda_mutable_state.cpp	Implements context/session tracking, template capture, buffer allocation, and rebind logic.
backends/cuda/runtime/cuda_backend.cpp	Calls into mutable-state hooks during init and before execute.
backends/cuda/runtime/test/test_cuda_mutable_state.cpp	Adds unit test covering fail-closed defaults.
backends/cuda/runtime/TARGETS	Adds new source/header to CUDA backend and registers Buck unit test target.
backends/cuda/CMakeLists.txt	Adds new source to CMake backend build and registers CMake unit test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Gasoonjia · 2026-06-12T20:37:49Z

@claude review the PR

claude · 2026-06-12T20:38:41Z

Claude finished @Gasoonjia's task in 1m 53s —— View job

Code Review: cuda — per-session mutable state rebinding

Reviewed all 6 changed files against main. This is a clean, well-isolated foundational primitive. The design is genuinely fail-closed: it stays dormant unless an engine opts in via create_context + begin_load/end_load + register_fqns + validate_coverage, and every error path returns a non-Ok Error rather than silently allocating or binding.

The three earlier Copilot comments are already resolved in the current revision, each with a matching regression test:

mutable_state_rebind_for_execute validates the token against c.sessions before allocating (cuda_mutable_state.cpp:495), covered by RebindRejectsUncreatedSessionToken.
mutable_state_begin_load guards against nested scopes and fails closed by poisoning build_error (:303), covered by NestedBeginLoadFailsClosed.
mutable_state_forget_handle exists and is called from CudaBackend::destroy (cuda_backend.cpp:840), covered by ForgetHandleDropsAssociation.

Nice work tightening those.

Findings

1. Shared per-FQN template/buffer can be under-sized if two methods declare the same mutable FQN with different shapes — cuda_mutable_state.cpp:165 / :216

The template (template_ptr/template_nbytes) and the session buffer are keyed by FQN and captured/sized once, from the first handle that discovers the FQN (:165). But the wrapper in ensure_bound uses this handle's Desc sizes/strides/dtype (:233-238) over that shared buffer. If a later method's container exposes the same FQN with a larger nbytes, the SlimTensor describes more bytes than were allocated → out-of-bounds device access / silent corruption, with no diagnostic.

Per-session KV/conv FQNs likely have consistent shapes across methods today, so this is latent rather than active, but it's exactly the kind of mismatch this fail-closed layer should catch. A cheap guard in ensure_bound would close it:

if (d.nbytes > c.template_nbytes[fqn]) {
  ET_LOG(Error,
      "mutable_state: descriptor '%s' (%zu B) exceeds shared template buffer (%zu B)",
      fqn.c_str(), d.nbytes, c.template_nbytes[fqn]);
  return Error::Internal;
}

Fix this →

2. mutable_state_destroy_context leaves stale thread-local active session — cuda_mutable_state.cpp:275

If a thread has set_active(ctx, token) and the context is destroyed (e.g. teardown races, or a buggy engine), tl_active_ctx/tl_active_token stay set. This is not a memory-safety bug — the next rebind_for_execute fails closed at the handle_ctx lookup (the handle's association was erased during destroy → Error::Internal at :463). Worth confirming this is the intended contract; the header already states the engine owns lifecycle/serialization, so a one-line note that destroy does not reset other threads' active selection would help future callers.

3. Multi-session is serialized, not concurrent — make sure this is loud enough — cuda_mutable_state.h:84

Because the AOTI container is shared and update_user_managed_constant_buffer_pairs rebinds it globally, two threads executing different sessions on the same container would race on the container's constant pointers. The header does document "the engine must hold its serialization lock across set + execute + read-out," which is the right contract. This is the key architectural constraint for the consuming engine; flagging it here so reviewers of the future engine PR don't miss it. No change needed in this PR.

Minor / non-blocking

mutable_state_rebind_for_execute calls update_user_managed_constant_buffer_pairs on every execute (not just first bind). This is correct for true multi-session interleaving (the container may have been rebound by another session since), but adds a per-execute rebind cost even in the common single-active-session case. Fine for a foundation; a future optimization could skip the rebind when the container is already pointed at this session's buffers.
GPU allocation paths (build_descriptors, ensure_session_buffers, ensure_bound with non-empty templates) are not exercised by the unit test — the fake handle reports get_num_constants == 0. The fail-closed control flow is well covered; the actual clone/rebind logic will only get coverage once a real model or a fake with non-zero constants drives it. Reasonable for a primitive landing ahead of its consumer, but worth a follow-up.
Build wiring (CMake :242-250, Buck :142-156) looks correct and symmetric; the test is registered in both with CUDA_AVAILABLE=1.

Overall this is solid, defensive, and well-documented. Finding #1 is the only one I'd consider addressing before the consuming engine lands; the rest are confirmations/notes.
• Branch: llm-pr-a-cuda-mutable-state

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

mergennachin · 2026-06-12T20:51:20Z

@claude Review the PR

Double check if it addressed previous feedbacks from Copilot and Claude

claude · 2026-06-12T20:52:10Z

Claude finished @mergennachin's task in 1m 19s —— View job

Re-review: did the PR address prior Copilot & Claude feedback?

Checked all 6 changed files at 0be0912 against every prior review comment. Verdict: all 7 prior findings are resolved, each with a fail-closed code path and (where applicable) a matching regression test.

Copilot round 1 (2026-06-12T20:16)

Implicit session_buf[token] allocation for uncreated tokens → rebind_for_execute now validates c.sessions.find(token) and returns InvalidArgument before any allocation (cuda_mutable_state.cpp:509-512). Test: RebindRejectsUncreatedSessionToken.
begin_load overwrites tl_loading_ctx unconditionally (nesting) → guarded; nested entry poisons both contexts' build_error = InvalidState, logs, and clears the thread-local (:314-329). Test: NestedBeginLoadFailsClosed.
No way to unregister a destroyed handle → mutable_state_forget_handle added (:457-470), called from CudaBackend::destroy (cuda_backend.cpp:840). Test: ForgetHandleDropsAssociation.

Copilot round 2 (2026-06-12T20:44)

Rebinding under CUDA graph capture/replay silently no-ops → rejected with NotSupported when cuda_graph_state.phase != Disabled (:513-518). Test: RebindRejectsCudaGraphHandle (the test Copilot explicitly asked for).
create_session succeeds despite earlier build_error → now returns build_error first (:402-404), matching validate_coverage.
available() ignores build_error → now folds it in: build_error == Error::Ok && symbols_available (:355-356).
Missing CUDA-graph regression test → added (see above).

Claude round 1 (2026-06-12T20:38) — Finding #1 (the only one flagged as worth fixing pre-merge)

Shared per-FQN buffer under-sized if a later method declares the same FQN with a larger nbytes → ensure_bound now guards d.nbytes > template_nbytes[fqn] (and missing-template), returning Error::Internal with a diagnostic before wrapping the buffer (:232-243). This closes the out-of-bounds GPU access path.

Findings #2 (stale thread-local active session after destroy) and #3 (multi-session serialization contract) were notes/confirmations, not change requests — both remain correctly fail-closed at the handle_ctx/build_error lookups.

Build wiring

CMake: source CMakeLists.txt:188, test target :246-249 with CUDA_AVAILABLE=1.
Buck: source TARGETS:109/113, test target :143-145. Symmetric with CMake.

Remaining observations (non-blocking, no prior comment)

cudaMalloc template leak on partial build_descriptors failure — :165-181: each FQN's template is cudaMalloc'd and recorded in c.template_ptr as the loop proceeds. If a later FQN's cudaMalloc/cudaMemcpy fails and returns Error::Internal, the already-allocated earlier templates are only freed later by mutable_state_destroy_context. Since the engine is contractually required to destroy the context on a non-Ok build, this is reclaimed — but worth a one-line note that cleanup is deferred to destroy_context, not immediate.
GPU clone/rebind logic still has no positive-path coverage — the fake handle reports get_num_constants == 0, so build_descriptors/ensure_session_buffers/ensure_bound (including the new size guard at :232-243) never execute their allocation bodies in tests. The fail-closed control flow is thoroughly covered; the actual copy-and-bind path will only get exercised once a real model or a fake with non-zero constants drives it. Reasonable for a primitive landing ahead of its consumer — flagging for a follow-up. Add coverage →

Summary: every actionable item from both Copilot rounds and Claude's blocking Finding #1 is addressed in code with corresponding tests. No regressions introduced. The two observations above are follow-ups, not merge blockers.
• Branch: llm-pr-a-cuda-mutable-state

mergennachin · 2026-06-12T21:06:54Z

@claude Review again

claude · 2026-06-12T21:07:44Z

Claude finished @mergennachin's task in 1m 33s —— View job

Re-review: cuda — per-session mutable state rebinding

Re-read all 6 files at the current tip (b0ec6fc). Everything from both Copilot rounds, Claude's blocking Finding #1, and both of my prior follow-up observations is now resolved in code with matching tests. No regressions, no new blocking issues.

What changed since my last review

The two items I'd flagged as follow-ups are now closed:

Positive-path GPU coverage → CapturesClonesAndRebindsDeviceBuffer (test_cuda_mutable_state.cpp:308) drives a real FakeContainer with a non-zero constant, so build_descriptors → ensure_session_buffers → ensure_bound → update_user_managed_constant_buffer_pairs all execute their allocation/clone bodies. It asserts the bound buffer is a distinct device pointer from the source, has the right nbytes, and contains the cloned contents (D2H copy compared to {1,2,3,4}). This is exactly the clone-and-rebind path that previously had no coverage.
Size-guard regression test → RebindRejectsLargerDescriptorForSharedFqn (:365) registers the same FQN across two handles where the second declares a larger nbytes, and confirms rebind_for_execute returns Error::Internal with update_calls == 0. This locks in the d.nbytes > template_nbytes[fqn] guard at cuda_mutable_state.cpp:232-243 (Claude Finding Add support for quantized LeakyReLU #1).

Both are CUDA-gated via cuda_device_available() / GTEST_SKIP, so they're real coverage on CUDA CI and harmless elsewhere.

Prior findings — still confirmed resolved

Copilot R1: uncreated-token allocation (:509), nested begin_load (:314), forget_handle from destroy (:457, cuda_backend.cpp:840).
Copilot R2: CUDA-graph rebind rejected NotSupported (:513); create_session honors build_error (:402); available() folds in build_error (:355).
Claude R1 Add support for quantized LeakyReLU #1: shared-FQN under-size guard (:232) — now also test-locked.

Cross-check on execute ordering

The mutable_state_rebind_for_execute call sits at cuda_backend.cpp:551, before the CudaGraphPhase::Replay branch (:556). Since rebind returns NotSupported whenever phase != Disabled, an active session on a graph-captured handle fails closed before replay — consistent with the design. ✅

Remaining non-blocking notes (carried over, no action required for this PR)

Deferred template cleanup — build_descriptors (:165-181) cudaMallocs per-FQN templates as it goes; on a later-FQN failure the earlier templates are reclaimed only by mutable_state_destroy_context, not immediately. The header contract requires the engine to destroy the context on a non-Ok build, so memory is reclaimed — just worth a one-line comment that cleanup is deferred.
Serialized multi-session — the shared AOTI container means concurrent sessions on one container would race on constant pointers; the header documents that the engine must hold its serialization lock across set+execute+read-out. This is the key constraint for the future consuming engine PR, not this one.

Verdict: This foundational primitive is fail-closed, well-documented, and now has both the control-flow and the allocation/clone paths under test. LGTM to land ahead of its consumer.
• Branch: llm-pr-a-cuda-mutable-state

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Gasoonjia · 2026-06-12T21:27:43Z

Thanks Mergen for this! Looks good.

I've one question regarding kv cache sharing across methods: like prefill and decode are separate methods (each its own AOTI container / CudaDelegateHandle) exporting the same KV-cache FQN: since session_buf is keyed (token → fqn → buffer) and ensure_bound points every handle's wrapper at the same session_buf[token][fqn], can you confirm the intended behavior is that, under one active session token, both handles rebind to the same GPU buffer so decode continues exactly what prefill wrote? Or is the V1 primitive only meant for single-method per-session isolation? Any test you can add to guard the target behavoir?

Another thing im worried about whether the session support impact cuda graph inference, since cuda graph inference needs to capture a static path but now the path may not be static since we are gonna change session, maybe we should move the cuda graph infortaion from CudaBackendHandle to cuda_mutable_state.

Local agent serving needs to host multiple logical conversations on one CUDA-resident model without multiplying the model weights. Loading one AOTI module per conversation is not viable for large local models, while sharing the default mutable state across conversations would let KV/recurrent/conv buffers bleed between users. This adds the CUDA-private foundation for separating those concerns: weights remain owned by the loaded AOTI container, while mutable buffer FQNs can be registered as per-session state and rebound before execution. The path is fail-closed and dormant until a model opts in by creating a mutable-state context and validating coverage, so existing CUDA models keep their current behavior. The branch also wires the new source and unit coverage into both Buck and CMake so the primitive can land independently before any model-specific engine consumes it.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

+    // A successful call may still report an unusable (null/empty) name --
+    // that's a method-scoped constant, not an error: skip it (another container
+    // owns it). A non-OK return code above is a real failure and falls closed.
+    if (internal && fqn && fqn[0] != '\0') {
+      fqn_to_internal[fqn] = internal;
+    }


+      void* tpl = nullptr;
+      if (cudaMalloc(&tpl, t->nbytes()) != cudaSuccess) {
+        ET_LOG(Error, "mutable_state: cudaMalloc template '%s'", fqn.c_str());
+        return Error::Internal;
+      }
+      if (cudaMemcpy(
+              tpl, t->data_ptr(), t->nbytes(), cudaMemcpyDeviceToDevice) !=
+          cudaSuccess) {
+        ET_LOG(Error, "mutable_state: cudaMemcpy template '%s'", fqn.c_str());
+        cudaFree(tpl);
+        return Error::Internal;
+      }


+    void* tpl = kv.second;
+    size_t nbytes = c.template_nbytes[fqn];
+    void* p = nullptr;
+    if (cudaMalloc(&p, nbytes) != cudaSuccess) {
+      ET_LOG(
+          Error, "mutable_state: cudaMalloc session buffer '%s'", fqn.c_str());
+      return Error::Internal;
+    }
+    if (cudaMemcpy(p, tpl, nbytes, cudaMemcpyDeviceToDevice) != cudaSuccess) {
+      ET_LOG(
+          Error, "mutable_state: cudaMemcpy session buffer '%s'", fqn.c_str());
+      cudaFree(p);
+      return Error::Internal;
+    }


+void free_session_buffers(Context& c, int token) {
+  auto it = c.session_buf.find(token);
+  if (it != c.session_buf.end()) {
+    for (auto& kv : it->second) {
+      if (kv.second) {
+        cudaFree(kv.second);
+      }
+    }
+    c.session_buf.erase(it);
+  }


+  Context& c = it->second;
+  for (auto& kv : c.template_ptr) {
+    if (kv.second) {
+      cudaFree(kv.second);
+    }
+  }
+  for (auto& sb : c.session_buf) {
+    for (auto& kv : sb.second) {
+      if (kv.second) {
+        cudaFree(kv.second);
+      }
+    }
+  }


shoumikhin · 2026-06-12T22:05:29Z

+}
+
+void mutable_state_set_active(MutableStateContext ctx, int token) {
+  tl_active_ctx = ctx;


The active session is stored in a thread-local that is set before execute but never cleared afterward. If a reset is ever missed (an exception, an early return, or another execute on the same thread), the next run rebinds the container to the previous session's buffers. When that next run is the same model there is no error, and one conversation's state silently leaks into another's. Don't rely on the caller resetting it: add an RAII guard that sets the active session in its constructor and clears it in its destructor, and reset it right after the run in execute. The same applies to begin_load/end_load: if load_method throws before end_load, the leftover loading state poisons the next unrelated load on that thread, so wrap that in an RAII scope too.

shoumikhin · 2026-06-12T22:05:30Z

+
+Error mutable_state_rebind_for_execute(CudaDelegateHandle* handle) {
+  if (tl_active_token == kNoMutableSession) {
+    return Error::Ok; // single-session / legacy: nothing to rebind


Rebinding changes the shared container in place and the change persists. Once a session has run, the container keeps pointing at that session's buffers. A later run with no active session hits this early return, does nothing, and silently executes on the last session's buffers. So mixing session and non-session runs on the same model corrupts both. Either assert that once a model uses sessions every run must select a session, or rebind back to the model's default buffers in this no-session path.

shoumikhin · 2026-06-12T22:05:31Z

+  }
+  auto& m = mgr();
+  std::lock_guard<std::mutex> g(m.mu);
+


This holds one process-wide lock across the external rebind call and the GPU allocations below. That forces every model in the process to take turns on the hottest part of execute, and it can stall everything if the rebind call blocks or re-enters. Use a per-context lock instead, and copy out the values you need (the pairs, container handle, function pointer) under the lock, then release it before making the rebind call and the allocations.

shoumikhin · 2026-06-12T22:05:32Z

+}
+
+void mutable_state_destroy_session(MutableStateContext ctx, int token) {
+  auto& m = mgr();


This frees the session's GPU buffers right away, but a run that used them may still be finishing on the GPU since some methods return without waiting on the stream. Freeing memory the GPU is still writing is a use-after-free and can crash or corrupt data. Wait on the relevant stream or event before freeing, or require the caller to ensure all runs are done first and assert it. The same applies to destroy_context.

shoumikhin · 2026-06-12T22:05:33Z

+}
+
+Error fake_container_update_user_managed_pairs(
+    aoti::AOTInductorModelContainerHandle container,


This fake only records its arguments and never stores the bound buffer, so the GPU tests cannot prove a rebind actually took effect or that a second session's rebind replaces the first. There is also no test for the main guarantee that two sessions stay isolated. Please make this fake remember the bound buffer per name, then add a GPU test that rebinds session A and writes a pattern, rebinds session B and writes a different pattern, rebinds A again, and asserts A's buffer is the same one and still holds A's data.

mergennachin requested review from kirklandsign and larryliu0820 as code owners June 12, 2026 20:11

Copilot AI review requested due to automatic review settings June 12, 2026 20:11

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 12, 2026

mergennachin had a problem deploying to cadence June 12, 2026 20:11 — with GitHub Actions Error

mergennachin temporarily deployed to cadence June 12, 2026 20:11 — with GitHub Actions Inactive

mergennachin requested a review from Gasoonjia June 12, 2026 20:11

Copilot started reviewing on behalf of mergennachin June 12, 2026 20:12 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp Outdated

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp

mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from 73fb7d1 to 4632963 Compare June 12, 2026 20:35

mergennachin had a problem deploying to cadence June 12, 2026 20:35 — with GitHub Actions Error

Copilot AI review requested due to automatic review settings June 12, 2026 20:39

mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from 4632963 to a9c517c Compare June 12, 2026 20:39

Copilot started reviewing on behalf of mergennachin June 12, 2026 20:40 View session

mergennachin had a problem deploying to cadence June 12, 2026 20:40 — with GitHub Actions Error

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp

Comment thread backends/cuda/runtime/test/test_cuda_mutable_state.cpp

mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from a9c517c to 0be0912 Compare June 12, 2026 20:50

mergennachin had a problem deploying to cadence June 12, 2026 20:51 — with GitHub Actions Error

mergennachin requested a review from shoumikhin June 12, 2026 20:53

Gasoonjia added the ciflow/cuda label Jun 12, 2026

Copilot AI review requested due to automatic review settings June 12, 2026 21:00

mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from 0be0912 to b0ec6fc Compare June 12, 2026 21:00

Copilot started reviewing on behalf of mergennachin June 12, 2026 21:01 View session

mergennachin had a problem deploying to cadence June 12, 2026 21:01 — with GitHub Actions Error

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp

Comment thread backends/cuda/runtime/cuda_mutable_state.cpp Outdated

Comment thread backends/cuda/runtime/TARGETS

Comment thread backends/cuda/runtime/TARGETS

mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from b0ec6fc to 4344f00 Compare June 12, 2026 21:18

mergennachin had a problem deploying to cadence June 12, 2026 21:19 — with GitHub Actions Error

Gasoonjia approved these changes Jun 12, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 12, 2026 21:33

mergennachin force-pushed the llm-pr-a-cuda-mutable-state branch from 4344f00 to de9d3cd Compare June 12, 2026 21:33

Copilot started reviewing on behalf of mergennachin June 12, 2026 21:33 View session

mergennachin temporarily deployed to cadence June 12, 2026 21:34 — with GitHub Actions Inactive

Copilot AI reviewed Jun 12, 2026

View reviewed changes

shoumikhin reviewed Jun 12, 2026

View reviewed changes

Conversation

mergennachin commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20241

❗ 1 Active SEVs

❌ 142 Cancelled Jobs, 2 Pending, 2 Unclassified Failures

Uh oh!

github-actions Bot commented Jun 12, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gasoonjia commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: cuda — per-session mutable state rebinding

Findings

Minor / non-blocking

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: did the PR address prior Copilot & Claude feedback?

Copilot round 1 (2026-06-12T20:16)

Copilot round 2 (2026-06-12T20:44)

Claude round 1 (2026-06-12T20:38) — Finding #1 (the only one flagged as worth fixing pre-merge)

Build wiring

Remaining observations (non-blocking, no prior comment)

Uh oh!

mergennachin commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: cuda — per-session mutable state rebinding

What changed since my last review

Prior findings — still confirmed resolved

Cross-check on execute ordering

Remaining non-blocking notes (carried over, no action required for this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gasoonjia commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

shoumikhin Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

shoumikhin Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

shoumikhin Jun 12, 2026

mergennachin commented Jun 12, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 12, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 12, 2026 •

edited

Loading

claude Bot commented Jun 12, 2026 •

edited

Loading

claude Bot commented Jun 12, 2026 •

edited

Loading