Skip to content

Resolve room handle synchronously in on_disconnect#1066

Closed
MaxHeimbrock wants to merge 2 commits into
mainfrom
max/fix-on-disconnect-handle-race
Closed

Resolve room handle synchronously in on_disconnect#1066
MaxHeimbrock wants to merge 2 commits into
mainfrom
max/fix-on-disconnect-handle-race

Conversation

@MaxHeimbrock
Copy link
Copy Markdown
Contributor

@MaxHeimbrock MaxHeimbrock commented May 8, 2026

Summary

on_disconnect spawned an async task that retrieved the FfiRoom via the DashMap inside the spawn body and unwrap()'d the result. The synchronous DisconnectResponse returned to the foreign-language client before the spawn was scheduled, so a client that disposed its room handle as soon as the request returned would race the tokio scheduler. When the spawn finally polled, retrieve_handle saw the entry already removed by drop_handle and the unwrap aborted the process via SIGABRT inside the runtime worker. On iOS the race lost roughly half the time.

Move the retrieve_handle + clone before the spawn. The spawn now owns its Arc<FfiRoom> clone, so the close runs even if the foreign side has already removed the DashMap entry. Also replace the unwrap with ? so a genuinely missing handle surfaces as a proper FfiResult error rather than an abort.

Surfaced in the Unity SDK after a separate change started disposing the room handle synchronously on disconnect (rather than waiting for the Disconnected event + GC of the SafeHandle). Either change is reasonable on the foreign side; the FFI server should be robust to it.

Test plan

  • Reproduced the SIGABRT on iOS by disposing the room handle immediately after the request returns; ~50% crash rate before fix, 0 crashes after fix across multiple connect/disconnect cycles.
  • macOS still works.
  • CI tests pass.

Unity

Found in this Unity PR:
livekit/client-sdk-unity#278

The handler spawned an async task that retrieved the FfiRoom via the
DashMap inside the spawn body and unwrap()'d the result. The synchronous
DisconnectResponse returned to the foreign-language client before the
spawn was scheduled, so a client that disposed its room handle as soon
as the request returned would race the tokio scheduler. When the spawn
finally polled, retrieve_handle saw the entry already removed by
drop_handle and the unwrap aborted the process via SIGABRT inside the
runtime worker. On iOS the race lost roughly half the time.

Move the retrieve_handle + clone before the spawn. The spawn now owns
its Arc<FfiRoom> clone, so the close runs even if the foreign side has
already removed the DashMap entry. Also replace the unwrap with `?` so
a genuinely missing handle surfaces as a proper FfiResult error rather
than an abort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

No changeset found

This PR modifies the following packages but doesn't include a changeset:

Directly changed:

  • livekit-ffi

Click here to create a changeset

The link pre-populates a changeset file with patch bumps for all affected packages.
Edit the description and bump types as needed before committing.

If this change doesn't require a version bump, add the internal label to this PR.

@MaxHeimbrock
Copy link
Copy Markdown
Contributor Author

MaxHeimbrock commented May 8, 2026

Would the fix in max/fix-on-disconnect-handle-race not be needed if we dispose the room only after the disconnect event arrives back in Unity?
⎿ ⧉ Selected 1 lines from ../../client-sdk-rust~/livekit-ffi/src/server/requests.rs in Visual Studio Code

⏺ Yes, you're right — that's an alternative fix that would also avoid the race. If C# waited for the Disconnected room event (or DisconnectCallback) before    
  disposing the handle, the Rust spawn would have already retrieved its Arc<FfiRoom> clone before the C# drop happened, so the panic wouldn't trigger.         
                                                                                                                                                               
  But I'd argue the Rust fix is still worth landing, for a few reasons:                                                                                        
                                                                                                                                                               
  1. Defensive vs. polite                                                                                                                                      
                                                                                                                                                               
  The Rust function panics on user-controllable input (room_handle). Other handlers in the same file use ? and return proper errors. The current behavior is   
  "if anyone passes me a stale handle, I abort the whole process" — which is too sharp. Fixing it makes the FFI honest about what it does.                     
                                                                                                                                                               
  2. Other foreign clients                                                                                                                                     
                                                                                                                                                               
  Python, Node, the official Swift SDK, etc. each have their own threading models and lifecycle assumptions. We can't expect every binding to know "wait for   
  the Disconnected event before drop_handle, or you abort the host process." The fix lets the FFI server be safe regardless of how the binding chooses to      
  handle the disconnect signal.                                                                                                                                
                                                                                                                                                               
  3. Other races we haven't enumerated                                                                                                                         
  
  - Network drop → Rust may emit a Disconnected event independently of any C# Disconnect call. If C# decides to drop the handle in response, there could still 
  be in-flight requests against that room. Same race shape, different trigger.
  - Concurrent calls (Disconnect + something else) — some other request retrieves the handle, races with drop.                                                 
                                                                                                                                                               
  4. Cost
                                                                                                                                                               
  The Rust fix is 6 lines that strictly tightens behavior. Same compile, no perf change, no semantic change for well-behaved clients.                          
   
  Trade-off summary                                                                                                                                            
                  
  ┌───────────────────────────────────────┬───────────────────┬───────────────────────────────────────────────────────────────────────┐                        
  │               Approach                │       Scope       │                                 Risk                                  │
  ├───────────────────────────────────────┼───────────────────┼───────────────────────────────────────────────────────────────────────┤
  │ C# waits for Disconnected before drop │ Unity-only        │ Need a timeout fallback if event is lost; more complex Disconnect API │
  ├───────────────────────────────────────┼───────────────────┼───────────────────────────────────────────────────────────────────────┤
  │ Rust resolves handle synchronously    │ All FFI clients   │ None I can see                                                        │                        
  ├───────────────────────────────────────┼───────────────────┼───────────────────────────────────────────────────────────────────────┤                        
  │ Both                                  │ Belt + suspenders │ Slightly more code                                                    │                        
  └───────────────────────────────────────┴───────────────────┴───────────────────────────────────────────────────────────────────────┘                        
                  
  I'd recommend keeping the Rust fix regardless of whether you also change the C# side. They're not mutually exclusive — the Rust fix is just hardening that   
  benefits everyone using the FFI.
                                                                                                                                                               
  But if you'd rather have the C# fix instead and close the Rust PR, that's a defensible call too. Up to you. Want to discuss either path further?  

@MaxHeimbrock
Copy link
Copy Markdown
Contributor Author

What about the ffi_room.close call on an already disposed room? Isn't the ffi_room now cached and in the spawn it could be invalid already? What kind of error behaviour will happen if we lose the race now?

⏺ Good question. Let me trace ffi_room.close() to be sure it doesn't re-look-up itself or otherwise depend on the DashMap entry.                 
                                                                                                                                                 
  Searched for 1 pattern, read 1 file (ctrl+o to expand)     
                                                                                                                                                 
⏺ Good thing to verify. Let me walk through close() line by line:
                                                                                                                                                 
  pub async fn close(&self, server: &'static FfiServer, reason: DisconnectReason) {                                               
      // drop associated track handles                                                                                                           
      for (_, &handle) in self.inner.track_handle_lookup.lock().iter() {                                                                         
          if server.drop_handle(handle) {                                                                                                        
              // Store an empty handle for the FFI client that assumes a handle exists for this id.                                              
              server.store_handle(handle, ());                                                                                                   
          }                                                                                                                                      
      }                     

      let _ = self.inner.room.close_with_reason(reason.into()).await;                                                                            
  
      let handle = self.handle.lock().await.take();                                                                                              
      if let Some(handle) = handle {
          let _ = handle.close_tx.send(());
          let _ = handle.event_handle.await;                                                                                                     
          ...
      }                                                                                                                                          
  }                         

  Does anything inside close() need the FfiRoom's DashMap entry to still exist?                                                                  
  
  ┌──────────────────────────────────────────────┬────────────────────────────────────────────────┬──────────────────────────────────────────┐   
  │                     Step                     │                  Operates on                   │  Depends on DashMap entry for the room?  │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────────────────────────────────────┤
  │ self.inner.track_handle_lookup.lock().iter() │ self.inner (Arc we own)                        │ No                                       │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────────────────────────────────────┤
  │ server.drop_handle(handle)                   │ DashMap, but for track handles                 │ No — and returns false benignly if       │   
  │                                              │                                                │ already gone                             │   
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────────────────────────────────────┤   
  │ server.store_handle(handle, ())              │ Only runs if drop succeeded — sentinel for     │ No                                       │   
  │                                              │ late drops                                     │                                          │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────────────────────────────────────┤   
  │ self.inner.room.close_with_reason(...)       │ The owned webrtc Room inside RoomInner         │ No                                       │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────────────────────────────────────┤   
  │ self.handle.lock().await.take()              │ self.handle (Arc we own)                       │ No                                       │
  ├──────────────────────────────────────────────┼────────────────────────────────────────────────┼──────────────────────────────────────────┤   
  │ Tokio task close signals + awaits            │ Owned JoinHandles                              │ No                                       │
  └──────────────────────────────────────────────┴────────────────────────────────────────────────┴──────────────────────────────────────────┘   
                            
  Everything operates on data reachable through the Arc we cloned before the spawn. None of it does a DashMap lookup for the room itself.        
                            
  Interaction with the C# cascade                                                                                                                
                            
  Specifically interesting because my C# Cleanup() also disposes the track handles:                                                              
                            
  1. C# Cleanup disposes each Track.Handle → Rust DashMap entries for those tracks gone.                                                         
  2. Rust spawn eventually runs close().
  3. The for ... server.drop_handle(handle) loop sees false for every entry (already dropped) → no panic, no store_handle runs. Just no-op       
  iterations.                                                                                                                                    
                                                                                                                                                 
  Note that drop_handle is specifically written to be safe on missing IDs:                                                                       
  pub fn drop_handle(&self, id: FfiHandleId) -> bool {
      let existed = self.ffi_handles.remove(&id).is_some();                                                                                      
      ...                   
  }                                                                                                                                              
  Returns a bool, never panics.
                                                                                                                                                 
  What happens to events fired during close
                                                                                                                                                 
  Events flow through server.send_event → C# FFICallback → OnEventReceived in the Room. After Cleanup(), my null-guard short-circuits any late   
  RoomEvent:                                                                                                                                     
  if (RoomHandle == null) return;                                                                                                                
                            
  So late events are silently dropped on the C# side too.                                                                                        
  
  Net answer                                                                                                                                     
                            
  If we lose the race now: nothing user-visible happens. close() runs to completion on the owned Arc<RoomInner>, emits some events that get      
  filtered out by C#, and returns. The Arc count then hits zero, RoomInner drops, all tokio tasks have been awaited, clean teardown.
                                                                                                                                                 
  The only thing that's technically different is that those drop_handle calls inside close() become no-ops (return false) instead of removing    
  entries — because C# beat us to it. That's not a bug; it's idempotent cleanup.
                                                                                                                                                 
  So the Rust fix is safe both for "well-behaved" clients (wait for Disconnected) and "eager" clients (drop right after request returns). That's 
  the property we wanted.

@MaxHeimbrock
Copy link
Copy Markdown
Contributor Author

Already fixed by DZ in #1069

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant