fix(rtc): auto-reconnect signaling WebSocket on unexpected loss by aliev · Pull Request #241 · GetStream/stream-py

aliev · 2026-05-11T13:41:15Z

Why

WebSocketClient in getstream/video/rtc/signaling.py has no reconnect logic of its own. On unexpected close, error after handshake, or health-check timeout, it logs the failure and stops. The session then sits hanging until the frontend's own timeout fires a DELETE /sessions/... and the agent goes away.

Under concurrent load this is now the dominant cause of "the agent disconnected randomly" reports on a production deploy: a single brief TCP reset or missed heartbeat between agent and SFU is fatal because nothing reconnects.

We can drive into the existing ReconnectionManager — it already handles strategy escalation (FAST → REJOIN → MIGRATE), locking, and the disconnection-timeout deadline. The missing piece is just the notification.

Changes

signaling.py: emit a connection_lost event (with a reason string) from the WS worker thread when the connection drops outside of a user-initiated close(). Three sites:
- _on_close — unexpected close.
- _on_error — error after first_message_event is set (i.e. past initial handshake; pre-handshake errors continue to surface via first_message).
- _ping_loop — health-check timeout. Notify before self.close(), otherwise the not self.closed guard suppresses the event.
- Idempotent via _connection_lost_sent so a chained _on_error → _on_close only fires once.
connection_manager.py: register on_event("connection_lost", self._on_signaling_connection_lost) after the other handlers. The handler invokes self._reconnector.reconnect(ReconnectionStrategy.FAST, …).

No new public API; consumers that don't care about the event are unaffected.

Test plan

Existing tests/test_signaling.py — all 7 cases pass locally (error path during connect, event callbacks, threading, traces) and don't trip on the new emit.
E2E: verified in a deployed environment that prior to this fix, transient WS drops under load caused the symptom this PR addresses. Will re-verify after deploy with the fix.

Summary by CodeRabbit

New Features
- The system now automatically attempts to reconnect when signaling WebSocket connections drop unexpectedly, improving reliability during network disruptions.
- Connection loss events are emitted when WebSocket errors or unexpected closures occur, enabling proper detection and response to disconnections.
Tests
- Added tests to verify automatic reconnection is triggered on connection loss and that events are properly emitted.

coderabbitai · 2026-05-11T13:41:35Z

Warning

Rate limit exceeded

@aliev has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 48 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c28fc40-ea57-4eb1-8f41-9ab5a38bfa60

📥 Commits

Reviewing files that changed from the base of the PR and between 20777a7 and 9c9d8b6.

📒 Files selected for processing (2)

getstream/video/rtc/signaling.py
tests/test_signaling.py

📝 Walkthrough

Walkthrough

When the SFU signaling WebSocket drops unexpectedly, WebSocketClient now emits a connection_lost event that ConnectionManager listens to and responds with automatic fast reconnection. Event emission is driven by raw socket/transport loss from worker threads.

Changes

Signaling WebSocket reconnection on loss

Layer / File(s)	Summary
WebSocketClient connection_lost event emission `getstream/video/rtc/signaling.py`, `tests/test_signaling.py`	WebSocketClient detects WebSocket errors and unexpected closes, schedules a `connection_lost` event onto the main asyncio loop from worker threads, and emits the event before close during health-check timeout. Test verifies the event is emitted on code 1006 close with the close code in the reason.
ConnectionManager reconnection handler `getstream/video/rtc/connection_manager.py`, `tests/rtc/test_connection_manager.py`	ConnectionManager imports `ReconnectionStrategy`, registers `_on_signaling_connection_lost` to receive WebSocket client `connection_lost` events, and triggers the internal reconnector with `ReconnectionStrategy.FAST` when the manager is running. Test verifies the reconnector is called with the fast strategy and the provided reason.

Sequence Diagram

sequenceDiagram
  participant WebSocketTransport
  participant WebSocketClient
  participant AsyncioLoop
  participant ConnectionManager
  participant Reconnector
  WebSocketTransport->>WebSocketClient: error or unexpected close
  WebSocketClient->>AsyncioLoop: run_coroutine_threadsafe(emit)
  AsyncioLoop->>ConnectionManager: connection_lost event
  ConnectionManager->>Reconnector: reconnect(strategy=FAST, reason)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 When websockets drop in the dead of night,
Our speedy reconnector sets things right,
From thread to loop, the signal flies,
Connection lost—but quick, it tries!
Fast strategy saves the streaming day! 🚀

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(rtc): auto-reconnect signaling WebSocket on unexpected loss' clearly and specifically describes the main change: adding automatic reconnection for the signaling WebSocket when it unexpectedly closes.
Docstring Coverage	✅ Passed	Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/auto-reconnect-signaling-ws

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

`WebSocketClient` had no reconnect logic of its own — `_on_close`, `_on_error`, and the health-check timeout in `_ping_loop` only logged and stopped the connection. When the signaling WS dropped (transient TCP reset, missed health check, server-side close without an SFU error event), the session sat hanging until the frontend's own timeout fired a DELETE. Under concurrent load this is the dominant remaining cause of "the agent disconnected randomly" reports: even with the late-offer fix in place, a single brief WS blip is fatal because nothing kicks reconnection. - `signaling.py`: emit a `connection_lost` event with a reason string on (a) unexpected `_on_close`, (b) `_on_error` after the initial handshake completed, and (c) `_ping_loop` health-check timeout (notify *before* `self.close()` so the user-initiated guard does not suppress it). Idempotent via `_connection_lost_sent`. - `connection_manager.py`: subscribe to `connection_lost` on the WS client and route it into the existing `ReconnectionManager` (`ReconnectionStrategy.FAST`). The manager already handles strategy escalation, locking, and the disconnection-timeout deadline.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@getstream/video/rtc/signaling.py`:
- Around line 217-218: The issue is duplicate connection_lost events in
error→close sequences; add an idempotency flag (e.g.,
self._connection_lost_emitted initialized to False in __init__) and update
_notify_connection_lost to return immediately if the flag is True and otherwise
set it to True when emitting the event (so subsequent calls from _on_error and
_on_close won't re-emit); also reset the flag to False at the start of connect()
so a fresh connection can emit again.

In `@tests/test_signaling.py`:
- Around line 298-349: Add a new async test that covers the error→close sequence
to ensure connection_lost is emitted only once: create a WebSocketClient (same
pattern as test_connection_lost_emitted_on_unexpected_close), register an async
handler via client.on_event("connection_lost", on_lost) that appends to
received, finish the handshake using the join_response and client.connect(),
then simulate on_error followed by on_close using
mock_websocket.call_args[1]["on_error"](...) and
mock_websocket.call_args[1]["on_close"](...); after small asyncio.sleep()s
assert len(received) == 1 to catch duplicate emissions from _on_error and
_on_close. Ensure the new test is named
test_connection_lost_no_duplicate_on_error_then_close and cleans up by calling
client.close().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7777ec03-9bfc-4f44-b494-2628881c7a34

📥 Commits

Reviewing files that changed from the base of the PR and between c71b551 and 20777a7.

📒 Files selected for processing (4)

getstream/video/rtc/connection_manager.py
getstream/video/rtc/signaling.py
tests/rtc/test_connection_manager.py
tests/test_signaling.py

The previous refactor removed the idempotency flag on the grounds that `ReconnectionManager.reconnect` is already locked + state-guarded, so a duplicate `connection_lost` from a chained `_on_error` → `_on_close` was harmless. That reasoning only held for the SDK-internal path — because of the `on_wildcard("*", self.emit)` re-emit in `ConnectionManager`, the event also leaks to any user subscriber, who shouldn't have to dedupe. Bring the flag back, but split the check+set into a small `_claim_connection_lost()` helper so the responsibilities stay separated. Add a regression test exercising the actual `_on_error` → `_on_close` chain.

aliev temporarily deployed to ci May 11, 2026 13:41 — with GitHub Actions Inactive

aliev force-pushed the fix/auto-reconnect-signaling-ws branch from 7be299a to a81381f Compare May 11, 2026 14:47

aliev temporarily deployed to ci May 11, 2026 14:47 — with GitHub Actions Inactive

aliev temporarily deployed to ci May 11, 2026 15:18 — with GitHub Actions Inactive

aliev temporarily deployed to ci May 12, 2026 10:52 — with GitHub Actions Inactive

aliev force-pushed the fix/auto-reconnect-signaling-ws branch from c983be2 to 2f6797e Compare May 12, 2026 11:10

aliev had a problem deploying to ci May 12, 2026 11:10 — with GitHub Actions Failure

aliev temporarily deployed to ci May 12, 2026 11:10 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Comment thread getstream/video/rtc/signaling.py

Comment thread tests/test_signaling.py

dangusev approved these changes May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241

fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241
aliev merged 2 commits into
mainfrom
fix/auto-reconnect-signaling-ws

aliev commented May 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aliev commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Changes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aliev commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading