fix: release asyncio.Lock before streaming to prevent orphan on client disconnect by beran-t · Pull Request #235 · e2b-dev/code-interpreter

beran-t · 2026-03-24T00:08:55Z

Summary

Fixes #213 — asyncio.Lock in messaging.py not released on client disconnect, causing cascading timeouts.

Root cause: execute() holds self._lock for the entire duration of code execution + result streaming. When a client disconnects (SDK timeout), the lock stays held until the Jupyter kernel finishes internally. All subsequent executions on the same context block behind the orphaned lock.

Fix: Split execute() into two phases:

Phase A (under lock): Prepare env vars, register execution, send request to kernel. Lock is released after the send.
Phase B (no lock): Stream results back to client. Results are routed by unique message_id in _process_message(), so no shared state needs protection.

A try/finally ensures the _executions entry is cleaned up even if the generator is abandoned on client disconnect.

Reproduction

Tested against a live sandbox built from the code-interpreter template:

Before fix — bug confirmed:

[T+2.1s]  Submitting: time.sleep(30) with 5s SDK timeout
[T+7.3s]  Expected timeout: TimeoutException
[T+7.3s]  Now running: print('hello') on same context
[T+32.4s] BUG CONFIRMED - took 25.1s (lock was orphaned)

Cascade test: 3/3 retries TIMED OUT (each blocked behind lock)

After fix — resolved:

[T+2.0s]  Submitting: time.sleep(30) with 5s SDK timeout
[T+7.1s]  Expected timeout: TimeoutException
[T+7.1s]  Now running: print('hello') on same context
[T+37.2s] Result: 'hello after disconnect' — took 30.1s
          (waits for kernel, but lock is NOT orphaned)

Cascade test: retries queue at kernel level, no lock deadlock

The remaining ~30s wait is expected — the Jupyter kernel serializes execution, so the next request waits for sleep(30) to finish. The key difference: the lock is free, so the request is queued at the kernel level, not stuck at the Python lock level. This eliminates the cascading timeout problem where each retry adds another full timeout duration.

What changed

template/server/messaging.py: Narrowed lock scope in execute() to cover only the prepare+send phase, not result streaming. Added try/finally for cleanup.

Test plan

Baseline: normal print('hello') executes in <1s
Lock orphan: after SDK timeout on long-running code, next execution is no longer blocked by orphaned lock
Cascade: retries no longer cascade into infinite timeouts
Env var cleanup still runs after execution completes

…t disconnect Narrow the lock scope in ContextWebSocket.execute() so that the lock is only held during the prepare+send phase, not during result streaming. Previously, the lock was held for the entire generator lifetime including the _wait_for_result() streaming loop. When a client disconnected (e.g. SDK timeout), Starlette abandoned the generator while it was blocked at `await queue.get()`. The lock stayed held until the kernel finished internally, blocking all subsequent executions on the same context and causing cascading timeouts. The fix moves the streaming phase (Phase B) outside the `async with self._lock` block. This is safe because results are routed by unique message_id in _process_message() — no shared state is accessed during streaming. A try/finally ensures the execution entry is cleaned up even if the generator is abandoned. Fixes #213

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 15ad5ab02e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-24T00:11:17Z

template/server/messaging.py

+        # Phase B: stream results (no lock held)
+        # Results are routed by unique message_id in _process_message(),
+        # so no shared state is accessed during streaming.


Serialize env-var cleanup before admitting next execution

Releasing self._lock before result streaming allows a second execute() call to enter Phase A and send its request before the first call schedules _cleanup_task in finally. When the first call used env_vars, that cleanup request is now queued after the second execute request, so the second execution can run with stale per-request env vars from the first execution. This is a correctness/isolation regression for overlapping requests on the same context; keep cleanup scheduling serialized with the next execute admission (or otherwise gate new requests until prior env cleanup is registered).

Useful? React with 👍 / 👎.

beran-t requested review from ValentaTomas, jakubno and mishushakov as code owners March 24, 2026 00:08

chatgpt-codex-connector bot reviewed Mar 24, 2026

View reviewed changes

beran-t closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: release asyncio.Lock before streaming to prevent orphan on client disconnect#235

fix: release asyncio.Lock before streaming to prevent orphan on client disconnect#235
beran-t wants to merge 1 commit intomainfrom
fix/lock-orphan-on-disconnect

beran-t commented Mar 24, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beran-t commented Mar 24, 2026

Summary

Reproduction

What changed

Test plan

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant