Skip to content

fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241

Merged
aliev merged 2 commits into
mainfrom
fix/auto-reconnect-signaling-ws
May 12, 2026
Merged

fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241
aliev merged 2 commits into
mainfrom
fix/auto-reconnect-signaling-ws

Conversation

@aliev
Copy link
Copy Markdown
Member

@aliev aliev commented May 11, 2026

Why

WebSocketClient in getstream/video/rtc/signaling.py has no reconnect logic of its own. On unexpected close, error after handshake, or health-check timeout, it logs the failure and stops. The session then sits hanging until the frontend's own timeout fires a DELETE /sessions/... and the agent goes away.

Under concurrent load this is now the dominant cause of "the agent disconnected randomly" reports on a production deploy: a single brief TCP reset or missed heartbeat between agent and SFU is fatal because nothing reconnects.

We can drive into the existing ReconnectionManager — it already handles strategy escalation (FASTREJOINMIGRATE), locking, and the disconnection-timeout deadline. The missing piece is just the notification.

Changes

  • signaling.py: emit a connection_lost event (with a reason string) from the WS worker thread when the connection drops outside of a user-initiated close(). Three sites:
    • _on_close — unexpected close.
    • _on_error — error after first_message_event is set (i.e. past initial handshake; pre-handshake errors continue to surface via first_message).
    • _ping_loop — health-check timeout. Notify before self.close(), otherwise the not self.closed guard suppresses the event.
    • Idempotent via _connection_lost_sent so a chained _on_error_on_close only fires once.
  • connection_manager.py: register on_event("connection_lost", self._on_signaling_connection_lost) after the other handlers. The handler invokes self._reconnector.reconnect(ReconnectionStrategy.FAST, …).

No new public API; consumers that don't care about the event are unaffected.

Test plan

  • Existing tests/test_signaling.py — all 7 cases pass locally (error path during connect, event callbacks, threading, traces) and don't trip on the new emit.
  • E2E: verified in a deployed environment that prior to this fix, transient WS drops under load caused the symptom this PR addresses. Will re-verify after deploy with the fix.

Summary by CodeRabbit

  • New Features

    • The system now automatically attempts to reconnect when signaling WebSocket connections drop unexpectedly, improving reliability during network disruptions.
    • Connection loss events are emitted when WebSocket errors or unexpected closures occur, enabling proper detection and response to disconnections.
  • Tests

    • Added tests to verify automatic reconnection is triggered on connection loss and that events are properly emitted.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 11, 2026

Warning

Rate limit exceeded

@aliev has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 26 minutes and 48 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9c28fc40-ea57-4eb1-8f41-9ab5a38bfa60

📥 Commits

Reviewing files that changed from the base of the PR and between 20777a7 and 9c9d8b6.

📒 Files selected for processing (2)
  • getstream/video/rtc/signaling.py
  • tests/test_signaling.py
📝 Walkthrough

Walkthrough

When the SFU signaling WebSocket drops unexpectedly, WebSocketClient now emits a connection_lost event that ConnectionManager listens to and responds with automatic fast reconnection. Event emission is driven by raw socket/transport loss from worker threads.

Changes

Signaling WebSocket reconnection on loss

Layer / File(s) Summary
WebSocketClient connection_lost event emission
getstream/video/rtc/signaling.py, tests/test_signaling.py
WebSocketClient detects WebSocket errors and unexpected closes, schedules a connection_lost event onto the main asyncio loop from worker threads, and emits the event before close during health-check timeout. Test verifies the event is emitted on code 1006 close with the close code in the reason.
ConnectionManager reconnection handler
getstream/video/rtc/connection_manager.py, tests/rtc/test_connection_manager.py
ConnectionManager imports ReconnectionStrategy, registers _on_signaling_connection_lost to receive WebSocket client connection_lost events, and triggers the internal reconnector with ReconnectionStrategy.FAST when the manager is running. Test verifies the reconnector is called with the fast strategy and the provided reason.

Sequence Diagram

sequenceDiagram
  participant WebSocketTransport
  participant WebSocketClient
  participant AsyncioLoop
  participant ConnectionManager
  participant Reconnector
  WebSocketTransport->>WebSocketClient: error or unexpected close
  WebSocketClient->>AsyncioLoop: run_coroutine_threadsafe(emit)
  AsyncioLoop->>ConnectionManager: connection_lost event
  ConnectionManager->>Reconnector: reconnect(strategy=FAST, reason)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 When websockets drop in the dead of night,
Our speedy reconnector sets things right,
From thread to loop, the signal flies,
Connection lost—but quick, it tries!
Fast strategy saves the streaming day! 🚀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(rtc): auto-reconnect signaling WebSocket on unexpected loss' clearly and specifically describes the main change: adding automatic reconnection for the signaling WebSocket when it unexpectedly closes.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/auto-reconnect-signaling-ws

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@aliev aliev force-pushed the fix/auto-reconnect-signaling-ws branch from 7be299a to a81381f Compare May 11, 2026 14:47
@aliev aliev force-pushed the fix/auto-reconnect-signaling-ws branch from c983be2 to 2f6797e Compare May 12, 2026 11:10
`WebSocketClient` had no reconnect logic of its own — `_on_close`,
`_on_error`, and the health-check timeout in `_ping_loop` only logged
and stopped the connection. When the signaling WS dropped (transient
TCP reset, missed health check, server-side close without an SFU error
event), the session sat hanging until the frontend's own timeout fired
a DELETE.

Under concurrent load this is the dominant remaining cause of "the
agent disconnected randomly" reports: even with the late-offer fix in
place, a single brief WS blip is fatal because nothing kicks
reconnection.

- `signaling.py`: emit a `connection_lost` event with a reason string
  on (a) unexpected `_on_close`, (b) `_on_error` after the initial
  handshake completed, and (c) `_ping_loop` health-check timeout
  (notify *before* `self.close()` so the user-initiated guard does not
  suppress it). Idempotent via `_connection_lost_sent`.
- `connection_manager.py`: subscribe to `connection_lost` on the WS
  client and route it into the existing `ReconnectionManager`
  (`ReconnectionStrategy.FAST`). The manager already handles strategy
  escalation, locking, and the disconnection-timeout deadline.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@getstream/video/rtc/signaling.py`:
- Around line 217-218: The issue is duplicate connection_lost events in
error→close sequences; add an idempotency flag (e.g.,
self._connection_lost_emitted initialized to False in __init__) and update
_notify_connection_lost to return immediately if the flag is True and otherwise
set it to True when emitting the event (so subsequent calls from _on_error and
_on_close won't re-emit); also reset the flag to False at the start of connect()
so a fresh connection can emit again.

In `@tests/test_signaling.py`:
- Around line 298-349: Add a new async test that covers the error→close sequence
to ensure connection_lost is emitted only once: create a WebSocketClient (same
pattern as test_connection_lost_emitted_on_unexpected_close), register an async
handler via client.on_event("connection_lost", on_lost) that appends to
received, finish the handshake using the join_response and client.connect(),
then simulate on_error followed by on_close using
mock_websocket.call_args[1]["on_error"](...) and
mock_websocket.call_args[1]["on_close"](...); after small asyncio.sleep()s
assert len(received) == 1 to catch duplicate emissions from _on_error and
_on_close. Ensure the new test is named
test_connection_lost_no_duplicate_on_error_then_close and cleans up by calling
client.close().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7777ec03-9bfc-4f44-b494-2628881c7a34

📥 Commits

Reviewing files that changed from the base of the PR and between c71b551 and 20777a7.

📒 Files selected for processing (4)
  • getstream/video/rtc/connection_manager.py
  • getstream/video/rtc/signaling.py
  • tests/rtc/test_connection_manager.py
  • tests/test_signaling.py

Comment thread getstream/video/rtc/signaling.py
Comment thread tests/test_signaling.py
The previous refactor removed the idempotency flag on the grounds that
`ReconnectionManager.reconnect` is already locked + state-guarded, so a
duplicate `connection_lost` from a chained `_on_error` → `_on_close`
was harmless. That reasoning only held for the SDK-internal path —
because of the `on_wildcard("*", self.emit)` re-emit in
`ConnectionManager`, the event also leaks to any user subscriber, who
shouldn't have to dedupe.

Bring the flag back, but split the check+set into a small
`_claim_connection_lost()` helper so the responsibilities stay
separated. Add a regression test exercising the actual
`_on_error` → `_on_close` chain.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants