fix: release asyncio.Lock before streaming to prevent orphan on client disconnect#235
fix: release asyncio.Lock before streaming to prevent orphan on client disconnect#235
Conversation
…t disconnect Narrow the lock scope in ContextWebSocket.execute() so that the lock is only held during the prepare+send phase, not during result streaming. Previously, the lock was held for the entire generator lifetime including the _wait_for_result() streaming loop. When a client disconnected (e.g. SDK timeout), Starlette abandoned the generator while it was blocked at `await queue.get()`. The lock stayed held until the kernel finished internally, blocking all subsequent executions on the same context and causing cascading timeouts. The fix moves the streaming phase (Phase B) outside the `async with self._lock` block. This is safe because results are routed by unique message_id in _process_message() — no shared state is accessed during streaming. A try/finally ensures the execution entry is cleaned up even if the generator is abandoned. Fixes #213
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 15ad5ab02e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Phase B: stream results (no lock held) | ||
| # Results are routed by unique message_id in _process_message(), | ||
| # so no shared state is accessed during streaming. |
There was a problem hiding this comment.
Serialize env-var cleanup before admitting next execution
Releasing self._lock before result streaming allows a second execute() call to enter Phase A and send its request before the first call schedules _cleanup_task in finally. When the first call used env_vars, that cleanup request is now queued after the second execute request, so the second execution can run with stale per-request env vars from the first execution. This is a correctness/isolation regression for overlapping requests on the same context; keep cleanup scheduling serialized with the next execute admission (or otherwise gate new requests until prior env cleanup is registered).
Useful? React with 👍 / 👎.
Summary
Fixes #213 —
asyncio.Lockinmessaging.pynot released on client disconnect, causing cascading timeouts.Root cause:
execute()holdsself._lockfor the entire duration of code execution + result streaming. When a client disconnects (SDK timeout), the lock stays held until the Jupyter kernel finishes internally. All subsequent executions on the same context block behind the orphaned lock.Fix: Split
execute()into two phases:message_idin_process_message(), so no shared state needs protection.A
try/finallyensures the_executionsentry is cleaned up even if the generator is abandoned on client disconnect.Reproduction
Tested against a live sandbox built from the
code-interpretertemplate:Before fix — bug confirmed:
After fix — resolved:
The remaining ~30s wait is expected — the Jupyter kernel serializes execution, so the next request waits for
sleep(30)to finish. The key difference: the lock is free, so the request is queued at the kernel level, not stuck at the Python lock level. This eliminates the cascading timeout problem where each retry adds another full timeout duration.What changed
template/server/messaging.py: Narrowed lock scope inexecute()to cover only the prepare+send phase, not result streaming. Addedtry/finallyfor cleanup.Test plan
print('hello')executes in <1s