Skip to content

fix: release asyncio.Lock before streaming to prevent orphan on client disconnect#235

Closed
beran-t wants to merge 1 commit intomainfrom
fix/lock-orphan-on-disconnect
Closed

fix: release asyncio.Lock before streaming to prevent orphan on client disconnect#235
beran-t wants to merge 1 commit intomainfrom
fix/lock-orphan-on-disconnect

Conversation

@beran-t
Copy link

@beran-t beran-t commented Mar 24, 2026

Summary

Fixes #213asyncio.Lock in messaging.py not released on client disconnect, causing cascading timeouts.

Root cause: execute() holds self._lock for the entire duration of code execution + result streaming. When a client disconnects (SDK timeout), the lock stays held until the Jupyter kernel finishes internally. All subsequent executions on the same context block behind the orphaned lock.

Fix: Split execute() into two phases:

  • Phase A (under lock): Prepare env vars, register execution, send request to kernel. Lock is released after the send.
  • Phase B (no lock): Stream results back to client. Results are routed by unique message_id in _process_message(), so no shared state needs protection.

A try/finally ensures the _executions entry is cleaned up even if the generator is abandoned on client disconnect.

Reproduction

Tested against a live sandbox built from the code-interpreter template:

Before fix — bug confirmed:

[T+2.1s]  Submitting: time.sleep(30) with 5s SDK timeout
[T+7.3s]  Expected timeout: TimeoutException
[T+7.3s]  Now running: print('hello') on same context
[T+32.4s] BUG CONFIRMED - took 25.1s (lock was orphaned)

Cascade test: 3/3 retries TIMED OUT (each blocked behind lock)

After fix — resolved:

[T+2.0s]  Submitting: time.sleep(30) with 5s SDK timeout
[T+7.1s]  Expected timeout: TimeoutException
[T+7.1s]  Now running: print('hello') on same context
[T+37.2s] Result: 'hello after disconnect' — took 30.1s
          (waits for kernel, but lock is NOT orphaned)

Cascade test: retries queue at kernel level, no lock deadlock

The remaining ~30s wait is expected — the Jupyter kernel serializes execution, so the next request waits for sleep(30) to finish. The key difference: the lock is free, so the request is queued at the kernel level, not stuck at the Python lock level. This eliminates the cascading timeout problem where each retry adds another full timeout duration.

What changed

  • template/server/messaging.py: Narrowed lock scope in execute() to cover only the prepare+send phase, not result streaming. Added try/finally for cleanup.

Test plan

  • Baseline: normal print('hello') executes in <1s
  • Lock orphan: after SDK timeout on long-running code, next execution is no longer blocked by orphaned lock
  • Cascade: retries no longer cascade into infinite timeouts
  • Env var cleanup still runs after execution completes

…t disconnect

Narrow the lock scope in ContextWebSocket.execute() so that the lock
is only held during the prepare+send phase, not during result streaming.

Previously, the lock was held for the entire generator lifetime including
the _wait_for_result() streaming loop.  When a client disconnected
(e.g. SDK timeout), Starlette abandoned the generator while it was
blocked at `await queue.get()`.  The lock stayed held until the kernel
finished internally, blocking all subsequent executions on the same
context and causing cascading timeouts.

The fix moves the streaming phase (Phase B) outside the `async with
self._lock` block.  This is safe because results are routed by unique
message_id in _process_message() — no shared state is accessed during
streaming.  A try/finally ensures the execution entry is cleaned up
even if the generator is abandoned.

Fixes #213
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15ad5ab02e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +372 to +374
# Phase B: stream results (no lock held)
# Results are routed by unique message_id in _process_message(),
# so no shared state is accessed during streaming.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize env-var cleanup before admitting next execution

Releasing self._lock before result streaming allows a second execute() call to enter Phase A and send its request before the first call schedules _cleanup_task in finally. When the first call used env_vars, that cleanup request is now queued after the second execute request, so the second execution can run with stale per-request env vars from the first execution. This is a correctness/isolation regression for overlapping requests on the same context; keep cleanup scheduling serialized with the next execute admission (or otherwise gate new requests until prior env cleanup is registered).

Useful? React with 👍 / 👎.

@beran-t beran-t closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

asyncio.Lock in messaging.py not released on client disconnect → cascading timeouts

1 participant