fix(rtc): auto-reconnect signaling WebSocket on unexpected loss#241
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughWhen the SFU signaling WebSocket drops unexpectedly, ChangesSignaling WebSocket reconnection on loss
Sequence DiagramsequenceDiagram
participant WebSocketTransport
participant WebSocketClient
participant AsyncioLoop
participant ConnectionManager
participant Reconnector
WebSocketTransport->>WebSocketClient: error or unexpected close
WebSocketClient->>AsyncioLoop: run_coroutine_threadsafe(emit)
AsyncioLoop->>ConnectionManager: connection_lost event
ConnectionManager->>Reconnector: reconnect(strategy=FAST, reason)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
7be299a to
a81381f
Compare
c983be2 to
2f6797e
Compare
`WebSocketClient` had no reconnect logic of its own — `_on_close`, `_on_error`, and the health-check timeout in `_ping_loop` only logged and stopped the connection. When the signaling WS dropped (transient TCP reset, missed health check, server-side close without an SFU error event), the session sat hanging until the frontend's own timeout fired a DELETE. Under concurrent load this is the dominant remaining cause of "the agent disconnected randomly" reports: even with the late-offer fix in place, a single brief WS blip is fatal because nothing kicks reconnection. - `signaling.py`: emit a `connection_lost` event with a reason string on (a) unexpected `_on_close`, (b) `_on_error` after the initial handshake completed, and (c) `_ping_loop` health-check timeout (notify *before* `self.close()` so the user-initiated guard does not suppress it). Idempotent via `_connection_lost_sent`. - `connection_manager.py`: subscribe to `connection_lost` on the WS client and route it into the existing `ReconnectionManager` (`ReconnectionStrategy.FAST`). The manager already handles strategy escalation, locking, and the disconnection-timeout deadline.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@getstream/video/rtc/signaling.py`:
- Around line 217-218: The issue is duplicate connection_lost events in
error→close sequences; add an idempotency flag (e.g.,
self._connection_lost_emitted initialized to False in __init__) and update
_notify_connection_lost to return immediately if the flag is True and otherwise
set it to True when emitting the event (so subsequent calls from _on_error and
_on_close won't re-emit); also reset the flag to False at the start of connect()
so a fresh connection can emit again.
In `@tests/test_signaling.py`:
- Around line 298-349: Add a new async test that covers the error→close sequence
to ensure connection_lost is emitted only once: create a WebSocketClient (same
pattern as test_connection_lost_emitted_on_unexpected_close), register an async
handler via client.on_event("connection_lost", on_lost) that appends to
received, finish the handshake using the join_response and client.connect(),
then simulate on_error followed by on_close using
mock_websocket.call_args[1]["on_error"](...) and
mock_websocket.call_args[1]["on_close"](...); after small asyncio.sleep()s
assert len(received) == 1 to catch duplicate emissions from _on_error and
_on_close. Ensure the new test is named
test_connection_lost_no_duplicate_on_error_then_close and cleans up by calling
client.close().
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7777ec03-9bfc-4f44-b494-2628881c7a34
📒 Files selected for processing (4)
getstream/video/rtc/connection_manager.pygetstream/video/rtc/signaling.pytests/rtc/test_connection_manager.pytests/test_signaling.py
The previous refactor removed the idempotency flag on the grounds that
`ReconnectionManager.reconnect` is already locked + state-guarded, so a
duplicate `connection_lost` from a chained `_on_error` → `_on_close`
was harmless. That reasoning only held for the SDK-internal path —
because of the `on_wildcard("*", self.emit)` re-emit in
`ConnectionManager`, the event also leaks to any user subscriber, who
shouldn't have to dedupe.
Bring the flag back, but split the check+set into a small
`_claim_connection_lost()` helper so the responsibilities stay
separated. Add a regression test exercising the actual
`_on_error` → `_on_close` chain.
Why
WebSocketClientingetstream/video/rtc/signaling.pyhas no reconnect logic of its own. On unexpected close, error after handshake, or health-check timeout, it logs the failure and stops. The session then sits hanging until the frontend's own timeout fires aDELETE /sessions/...and the agent goes away.Under concurrent load this is now the dominant cause of "the agent disconnected randomly" reports on a production deploy: a single brief TCP reset or missed heartbeat between agent and SFU is fatal because nothing reconnects.
We can drive into the existing
ReconnectionManager— it already handles strategy escalation (FAST→REJOIN→MIGRATE), locking, and the disconnection-timeout deadline. The missing piece is just the notification.Changes
signaling.py: emit aconnection_lostevent (with a reason string) from the WS worker thread when the connection drops outside of a user-initiatedclose(). Three sites:_on_close— unexpected close._on_error— error afterfirst_message_eventis set (i.e. past initial handshake; pre-handshake errors continue to surface viafirst_message)._ping_loop— health-check timeout. Notify beforeself.close(), otherwise thenot self.closedguard suppresses the event._connection_lost_sentso a chained_on_error→_on_closeonly fires once.connection_manager.py: registeron_event("connection_lost", self._on_signaling_connection_lost)after the other handlers. The handler invokesself._reconnector.reconnect(ReconnectionStrategy.FAST, …).No new public API; consumers that don't care about the event are unaffected.
Test plan
tests/test_signaling.py— all 7 cases pass locally (error path during connect, event callbacks, threading, traces) and don't trip on the new emit.Summary by CodeRabbit
New Features
Tests