Skip to content

feat(telemetry): instrument connection lifecycle#942

Open
EhabY wants to merge 7 commits intomainfrom
feat/issue-905-connection-telemetry
Open

feat(telemetry): instrument connection lifecycle#942
EhabY wants to merge 7 commits intomainfrom
feat/issue-905-connection-telemetry

Conversation

@EhabY
Copy link
Copy Markdown
Collaborator

@EhabY EhabY commented May 7, 2026

Summary

  • instrument SSH process discovery, loss/recovery, and sampled network info telemetry
  • instrument reconnecting WebSocket open/drop/reconnect/state-transition lifecycle telemetry
  • wire telemetry into primary and remote-scoped Coder API clients while leaving throwaway clients opt-in
  • add TestSink assertions for the requested lifecycle events and sampling behavior

Closes #905.

Validation

  • pnpm test:extension ./test/unit/remote/sshProcess.test.ts
  • pnpm test:extension ./test/unit/websocket/reconnectingWebSocket.test.ts
  • pnpm typecheck
  • pnpm lint
  • pnpm format:check
Implementation plan and decisions

Confirmed decisions:

  1. Telemetry injection is optional for throwaway CoderApi instances; long-lived primary/remote clients receive telemetry.
  2. connection.open uses the sanitized route (pathname + search) rather than the full URL.
  3. Reconnect aggregation emits telemetry for terminal DISCONNECTED outcomes as well as successful recovery.
  4. SSH loss causes use stale_network_info, missing_network_info, process_changed, disposed.
  5. Network derp uses the preferred_derp string, empty string if unavailable.

Implemented:

  • SshProcessMonitor: ssh.process.discovered, ssh.process.lost, ssh.process.recovered, and sampled ssh.network.info.
  • ReconnectingWebSocket: connection.state_transition, connection.open, connection.drop, and aggregated connection.reconnect.
  • Unit coverage for requested event behavior and sampling cadence.

Generated by Coder Agents.

@EhabY EhabY self-assigned this May 7, 2026
@EhabY EhabY force-pushed the feat/issue-905-connection-telemetry branch from 11b393a to 2903de2 Compare May 7, 2026 15:05
@EhabY EhabY force-pushed the feat/issue-905-connection-telemetry branch from 2903de2 to 5e25405 Compare May 7, 2026 15:07
@EhabY EhabY force-pushed the feat/issue-905-connection-telemetry branch from 4115b83 to cbffa11 Compare May 8, 2026 11:52
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 8, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean instrumentation layer. SshTelemetry and WebSocketTelemetry absorb real complexity (sampling thresholds, reconnect cycle aggregation, dedup guards) that would otherwise pollute their consumers. The TelemetryReporter interface extraction with NOOP default is well-designed ISP. Test coverage is approximately 1:1 with source, and assertions verify event shapes rather than just counts. The network sampling logic is well-calibrated (p2p flips, DERP changes, latency swings, heartbeat).

Severity summary: 1 P2, 10 P3, 6 Nit. The P2 is a one-line fix. Most P3s are telemetry data model concerns that are cheapest to address before the events ship to dashboards.

"Someone querying ssh.process.discovered events to measure discovery success rate or latency would count these ghost discoveries as successes." (Chopper)

🤖 This review was automatically generated with Coder Agents.

Comment thread src/instrumentation/ssh.ts
Comment thread src/instrumentation/ssh.ts Outdated
Comment thread src/websocket/reconnectingWebSocket.ts Outdated
Comment thread src/websocket/reconnectingWebSocket.ts Outdated
Comment thread src/remote/sshProcess.ts Outdated
Comment thread src/instrumentation/ssh.ts Outdated
Comment thread src/websocket/reconnectingWebSocket.ts Outdated
Comment thread src/instrumentation/ssh.ts Outdated
Comment thread test/unit/websocket/reconnectingWebSocket.test.ts
Comment thread src/websocket/reconnectingWebSocket.ts
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 8, 2026

/coder-agents-review

- ssh.process.discovered: emit found=true|false so abandoned searches are distinguishable
- Add ssh.process.disposed as a dedicated terminal event with wasLost flag
- ssh.process.replaced: always emit, never conflate with recovered
- Remove dead process_changed and disposed members from ProcessLossCause
- Rename connection events to past-tense and url -> route property
- Route normal_close through disconnectWithReason for state-machine consistency
- Distinguish certificate_refresh from manual_reconnect on cycle reason
- Clear #reconnectCycle in WebSocketTelemetry.reset()
- Restore Only attempt refresh once per connection cycle comment
- Unblock SshProcessMonitor.delay() on dispose so abandoned traces complete
- Add unit and integration coverage for the new event shapes
Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 17 R1 findings verified fixed in 5ed31ab. Clean work. The refactored disposed() method, #reconnectInternal cert-refresh path, past-tense event naming, and expanded integration test coverage all address root causes.

Three new P3 findings surfaced during re-review: one concurrent-access regression in the new pendingDelayResolve mechanism, and two test accuracy gaps. The integration test for stale-network recovery exercises the wrong event chain (4/6 reviewers converged on this independently).

Severity summary: 0 P2, 3 P3.

"These events are from two completely different monitoring cycles. The recovery is not 'around' the stale-network reconnect." (Bisky)

🤖 This review was automatically generated with Coder Agents.

Comment thread test/unit/remote/sshProcess.test.ts
Comment thread src/remote/sshProcess.ts Outdated
Comment thread test/unit/remote/sshProcess.test.ts
- Replace single pendingDelayResolve with a Set so dispose unblocks
  every concurrent searchForLogFile/monitorNetwork delay, not just the
  most recent one.
- Use a single PID throughout the stale-network reconnect test so the
  asserted recovered event matches the stale-network cycle the test
  name describes.
- Dispose the monitor before the negative recovered assertion in the
  process-replaced test to keep the 888 monitoring loop from racing.
- Add dedicated processRecovered unit tests (early return, emission
  shape, single-emit guard).
@EhabY
Copy link
Copy Markdown
Collaborator Author

EhabY commented May 8, 2026

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 20 prior findings (R1 + R3) verified fixed. The pendingDelays Set, processRecovered() unit tests, and deterministic dispose-before-assert all address root causes correctly. Code quality is high; 114 tests pass, every assertion has teeth.

One new P3 on event naming semantics, two Nits from the mechanical pass. The P3 is cheapest to address before the events ship to dashboards.

Severity summary: 0 P2, 1 P3, 2 Nit.

"The verb is the event. 'connection.reconnected' breaks this pattern: it emits with result: 'error' when the reconnect cycle terminates without reconnecting." (Mafuuu)

🤖 This review was automatically generated with Coder Agents.

Comment thread src/instrumentation/websocket.ts Outdated
Comment thread test/unit/remote/sshProcess.test.ts Outdated
Comment thread src/instrumentation/ssh.ts Outdated
- Rename connection.reconnected to connection.reconnect_resolved so the
  event name reflects "cycle ended, outcome in result" instead of
  implying success. Failed cycles emit with result=error and are no
  longer miscounted by name-only consumers.
- Drop unused export on ProcessDiscoveryResult.
- Replace em-dash in sshProcess.test.ts comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telemetry: instrument connection lifecycle (SSH process, WebSocket)

1 participant