cuda: add per-session mutable state rebinding#20241
Conversation
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Adds CUDA-private infrastructure to support per-session rebinding of mutable model state (e.g., KV/conv/recurrent buffers) while keeping model weights shared across logical conversations, enabling multi-session local serving without duplicating large CUDA-resident weights.
Changes:
- Introduces
cuda_mutable_statecontext/session manager with descriptor capture and per-execute rebinding hooks. - Integrates mutable-state hooks into
CudaBackend::init(handle noting) andCudaBackend::execute(rebind-before-run). - Adds a fail-closed unit test and wires it into both Buck and CMake builds.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| backends/cuda/runtime/cuda_mutable_state.h | Declares CUDA-private context/session API for mutable-state rebinding. |
| backends/cuda/runtime/cuda_mutable_state.cpp | Implements context/session tracking, template capture, buffer allocation, and rebind logic. |
| backends/cuda/runtime/cuda_backend.cpp | Calls into mutable-state hooks during init and before execute. |
| backends/cuda/runtime/test/test_cuda_mutable_state.cpp | Adds unit test covering fail-closed defaults. |
| backends/cuda/runtime/TARGETS | Adds new source/header to CUDA backend and registers Buck unit test target. |
| backends/cuda/CMakeLists.txt | Adds new source to CMake backend build and registers CMake unit test target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
73fb7d1 to
4632963
Compare
|
@claude review the PR |
|
Claude finished @Gasoonjia's task in 1m 53s —— View job Code Review: cuda — per-session mutable state rebindingReviewed all 6 changed files against The three earlier Copilot comments are already resolved in the current revision, each with a matching regression test:
Nice work tightening those. Findings1. Shared per-FQN template/buffer can be under-sized if two methods declare the same mutable FQN with different shapes — The template ( Per-session KV/conv FQNs likely have consistent shapes across methods today, so this is latent rather than active, but it's exactly the kind of mismatch this fail-closed layer should catch. A cheap guard in if (d.nbytes > c.template_nbytes[fqn]) {
ET_LOG(Error,
"mutable_state: descriptor '%s' (%zu B) exceeds shared template buffer (%zu B)",
fqn.c_str(), d.nbytes, c.template_nbytes[fqn]);
return Error::Internal;
}2. If a thread has 3. Multi-session is serialized, not concurrent — make sure this is loud enough — Because the AOTI container is shared and Minor / non-blocking
Overall this is solid, defensive, and well-documented. Finding #1 is the only one I'd consider addressing before the consuming engine lands; the rest are confirmations/notes. |
4632963 to
a9c517c
Compare
a9c517c to
0be0912
Compare
|
@claude Review the PR Double check if it addressed previous feedbacks from Copilot and Claude |
|
Claude finished @mergennachin's task in 1m 19s —— View job Re-review: did the PR address prior Copilot & Claude feedback?Checked all 6 changed files at Copilot round 1 (2026-06-12T20:16)
Copilot round 2 (2026-06-12T20:44)
Claude round 1 (2026-06-12T20:38) — Finding #1 (the only one flagged as worth fixing pre-merge)
Findings #2 (stale thread-local active session after destroy) and #3 (multi-session serialization contract) were notes/confirmations, not change requests — both remain correctly fail-closed at the Build wiring
Remaining observations (non-blocking, no prior comment)
Summary: every actionable item from both Copilot rounds and Claude's blocking Finding #1 is addressed in code with corresponding tests. No regressions introduced. The two observations above are follow-ups, not merge blockers. |
0be0912 to
b0ec6fc
Compare
|
@claude Review again |
|
Claude finished @mergennachin's task in 1m 33s —— View job Re-review: cuda — per-session mutable state rebindingRe-read all 6 files at the current tip ( What changed since my last reviewThe two items I'd flagged as follow-ups are now closed:
Both are CUDA-gated via Prior findings — still confirmed resolved
Cross-check on execute orderingThe Remaining non-blocking notes (carried over, no action required for this PR)
Verdict: This foundational primitive is fail-closed, well-documented, and now has both the control-flow and the allocation/clone paths under test. LGTM to land ahead of its consumer. |
b0ec6fc to
4344f00
Compare
|
Thanks Mergen for this! Looks good. I've one question regarding kv cache sharing across methods: like prefill and decode are separate methods (each its own AOTI container / CudaDelegateHandle) exporting the same KV-cache FQN: since session_buf is keyed (token → fqn → buffer) and ensure_bound points every handle's wrapper at the same session_buf[token][fqn], can you confirm the intended behavior is that, under one active session token, both handles rebind to the same GPU buffer so decode continues exactly what prefill wrote? Or is the V1 primitive only meant for single-method per-session isolation? Any test you can add to guard the target behavoir? Another thing im worried about whether the session support impact cuda graph inference, since cuda graph inference needs to capture a static path but now the path may not be static since we are gonna change session, maybe we should move the cuda graph infortaion from CudaBackendHandle to cuda_mutable_state. |
Local agent serving needs to host multiple logical conversations on one CUDA-resident model without multiplying the model weights. Loading one AOTI module per conversation is not viable for large local models, while sharing the default mutable state across conversations would let KV/recurrent/conv buffers bleed between users. This adds the CUDA-private foundation for separating those concerns: weights remain owned by the loaded AOTI container, while mutable buffer FQNs can be registered as per-session state and rebound before execution. The path is fail-closed and dormant until a model opts in by creating a mutable-state context and validating coverage, so existing CUDA models keep their current behavior. The branch also wires the new source and unit coverage into both Buck and CMake so the primitive can land independently before any model-specific engine consumes it.
4344f00 to
de9d3cd
Compare
| // A successful call may still report an unusable (null/empty) name -- | ||
| // that's a method-scoped constant, not an error: skip it (another container | ||
| // owns it). A non-OK return code above is a real failure and falls closed. | ||
| if (internal && fqn && fqn[0] != '\0') { | ||
| fqn_to_internal[fqn] = internal; | ||
| } |
| void* tpl = nullptr; | ||
| if (cudaMalloc(&tpl, t->nbytes()) != cudaSuccess) { | ||
| ET_LOG(Error, "mutable_state: cudaMalloc template '%s'", fqn.c_str()); | ||
| return Error::Internal; | ||
| } | ||
| if (cudaMemcpy( | ||
| tpl, t->data_ptr(), t->nbytes(), cudaMemcpyDeviceToDevice) != | ||
| cudaSuccess) { | ||
| ET_LOG(Error, "mutable_state: cudaMemcpy template '%s'", fqn.c_str()); | ||
| cudaFree(tpl); | ||
| return Error::Internal; | ||
| } |
| void* tpl = kv.second; | ||
| size_t nbytes = c.template_nbytes[fqn]; | ||
| void* p = nullptr; | ||
| if (cudaMalloc(&p, nbytes) != cudaSuccess) { | ||
| ET_LOG( | ||
| Error, "mutable_state: cudaMalloc session buffer '%s'", fqn.c_str()); | ||
| return Error::Internal; | ||
| } | ||
| if (cudaMemcpy(p, tpl, nbytes, cudaMemcpyDeviceToDevice) != cudaSuccess) { | ||
| ET_LOG( | ||
| Error, "mutable_state: cudaMemcpy session buffer '%s'", fqn.c_str()); | ||
| cudaFree(p); | ||
| return Error::Internal; | ||
| } |
| void free_session_buffers(Context& c, int token) { | ||
| auto it = c.session_buf.find(token); | ||
| if (it != c.session_buf.end()) { | ||
| for (auto& kv : it->second) { | ||
| if (kv.second) { | ||
| cudaFree(kv.second); | ||
| } | ||
| } | ||
| c.session_buf.erase(it); | ||
| } |
| Context& c = it->second; | ||
| for (auto& kv : c.template_ptr) { | ||
| if (kv.second) { | ||
| cudaFree(kv.second); | ||
| } | ||
| } | ||
| for (auto& sb : c.session_buf) { | ||
| for (auto& kv : sb.second) { | ||
| if (kv.second) { | ||
| cudaFree(kv.second); | ||
| } | ||
| } | ||
| } |
| } | ||
|
|
||
| void mutable_state_set_active(MutableStateContext ctx, int token) { | ||
| tl_active_ctx = ctx; |
There was a problem hiding this comment.
The active session is stored in a thread-local that is set before execute but never cleared afterward. If a reset is ever missed (an exception, an early return, or another execute on the same thread), the next run rebinds the container to the previous session's buffers. When that next run is the same model there is no error, and one conversation's state silently leaks into another's. Don't rely on the caller resetting it: add an RAII guard that sets the active session in its constructor and clears it in its destructor, and reset it right after the run in execute. The same applies to begin_load/end_load: if load_method throws before end_load, the leftover loading state poisons the next unrelated load on that thread, so wrap that in an RAII scope too.
|
|
||
| Error mutable_state_rebind_for_execute(CudaDelegateHandle* handle) { | ||
| if (tl_active_token == kNoMutableSession) { | ||
| return Error::Ok; // single-session / legacy: nothing to rebind |
There was a problem hiding this comment.
Rebinding changes the shared container in place and the change persists. Once a session has run, the container keeps pointing at that session's buffers. A later run with no active session hits this early return, does nothing, and silently executes on the last session's buffers. So mixing session and non-session runs on the same model corrupts both. Either assert that once a model uses sessions every run must select a session, or rebind back to the model's default buffers in this no-session path.
| } | ||
| auto& m = mgr(); | ||
| std::lock_guard<std::mutex> g(m.mu); | ||
|
|
There was a problem hiding this comment.
This holds one process-wide lock across the external rebind call and the GPU allocations below. That forces every model in the process to take turns on the hottest part of execute, and it can stall everything if the rebind call blocks or re-enters. Use a per-context lock instead, and copy out the values you need (the pairs, container handle, function pointer) under the lock, then release it before making the rebind call and the allocations.
| } | ||
|
|
||
| void mutable_state_destroy_session(MutableStateContext ctx, int token) { | ||
| auto& m = mgr(); |
There was a problem hiding this comment.
This frees the session's GPU buffers right away, but a run that used them may still be finishing on the GPU since some methods return without waiting on the stream. Freeing memory the GPU is still writing is a use-after-free and can crash or corrupt data. Wait on the relevant stream or event before freeing, or require the caller to ensure all runs are done first and assert it. The same applies to destroy_context.
| } | ||
|
|
||
| Error fake_container_update_user_managed_pairs( | ||
| aoti::AOTInductorModelContainerHandle container, |
There was a problem hiding this comment.
This fake only records its arguments and never stores the bound buffer, so the GPU tests cannot prove a rebind actually took effect or that a second session's rebind replaces the first. There is also no test for the main guarantee that two sessions stay isolated. Please make this fake remember the bound buffer per name, then add a GPU test that rebinds session A and writes a pattern, rebinds session B and writes a different pattern, rebinds A again, and asserts A's buffer is the same one and still holds A's data.
Local agent serving needs to host multiple logical conversations on one CUDA-resident model without multiplying the model weights. Loading one AOTI module per conversation is not viable for large local models, while sharing the default mutable state across conversations would let KV/recurrent/conv buffers bleed between users.
This adds the CUDA-private foundation for separating those concerns: weights remain owned by the loaded AOTI container, while mutable buffer FQNs can be registered as per-session state and rebound before execution. The path is fail-closed and dormant until a model opts in by creating a mutable-state context and validating coverage, so existing CUDA models keep their current behavior.
The branch also wires the new source and fall-closed unit test into both Buck and CMake so the primitive can land independently before any model-specific engine consumes it.
#20001