Skip to content

feat(gastown): add debug replay-events endpoint for reconciler phase 5#1373

Open
jrf0110 wants to merge 6 commits intomainfrom
convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head
Open

feat(gastown): add debug replay-events endpoint for reconciler phase 5#1373
jrf0110 wants to merge 6 commits intomainfrom
convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head

Conversation

@jrf0110
Copy link
Contributor

@jrf0110 jrf0110 commented Mar 21, 2026

Summary

Adds a POST /debug/towns/:townId/replay-events endpoint that replays town events from a given time range for debugging purposes. The endpoint:

  • Accepts from/to ISO timestamps, queries all town_events in that range (regardless of processed_at)
  • Applies each event via reconciler.applyEvent() to reconstruct state transitions
  • Runs reconciler.reconcile() against the resulting state to compute what actions would be emitted
  • Captures agent and non-terminal bead snapshots
  • Rolls back all mutations via SQLite SAVEPOINT so the endpoint is fully side-effect-free

Also includes the preceding commits on this convoy branch: dry-run reconciler endpoint, debug dry-run with event draining, and a fix for skipping container_status events.

Verification

  • Code review: all patterns match existing debugDryRun endpoint conventions (SAVEPOINT/ROLLBACK, parameterized queries, Zod validation, eslint-disable comments)
  • Imports verified: town_events, TownEventRecord, reconciler, query, Action all correctly imported
  • SQL injection safe: user inputs passed as parameterized ? placeholders
  • Input validation: missing fields (400), invalid dates (400), reversed range (400)

Visual Changes

N/A

Reviewer Notes

  • This is an unauthenticated /debug/ route, consistent with existing debug endpoints marked for removal after debugging
  • Unlike debugDryRun, this endpoint does NOT call events.markProcessed() — this is intentional since it replays historical (already-processed) events rather than draining pending ones
  • The SAVEPOINT pattern (SAVEPOINT → try/finally → ROLLBACK TORELEASE) is identical to the existing debugDryRun method

jrf0110 and others added 4 commits March 21, 2026 11:15
Filter out 'running' status in the alarm pre-phase before calling
upsertContainerStatus(). Running is the steady-state for healthy agents
and a no-op in applyEvent(), so recording it just bloats the event table
(~720 events/hour/agent). Non-running statuses (stopped, error, unknown)
still get inserted for reconciler detection.
Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount
* feat(claw): evaluate button-vs-card feature flag for PostHog experiment tracking

* fix(claw): move button-vs-card flag eval to CreateInstanceCard

Moves useFeatureFlagVariantKey('button-vs-card') from ClawDashboard
(which renders for all users including those with existing instances)
to CreateInstanceCard (which only renders for users who haven't
provisioned yet). This scopes the experiment exposure to users who
can actually see the create CTA, avoiding population dilution.

* feat(gastown): add POST /debug/reconcile-dry-run endpoint

Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount

* fix(gastown): drain pending events in debugDryRun() before reconciling

Wrap debugDryRun() in a SQLite savepoint so it can drain and apply
pending town_events (Phase 0) before running reconcile (Phase 1),
matching the real alarm loop behavior. The savepoint is rolled back
in a finally block so the endpoint remains fully side-effect-free.

Adds eventsDrained to the returned metrics.

---------

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
Co-authored-by: Pedro Heyerdahl <pedro@kilocode.ai>
Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>
…y debugging

Adds debugReplayEvents(from, to) method to Town.do.ts that queries all
town_events in a time range (regardless of processed_at), applies them
to reconstruct state transitions, runs the reconciler, and returns the
computed actions and a state snapshot. Uses a SQLite SAVEPOINT that is
rolled back so the endpoint remains fully side-effect-free.

Route: POST /debug/towns/:townId/replay-events
Body: { from: ISO, to: ISO }
Response: { eventsReplayed, actions, stateSnapshot }
@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 21, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/src/dos/Town.do.ts 3795 debugReplayEvents() re-applies historical events on top of current state, so non-idempotent handlers can return misleading actions and snapshots.
Other Observations (not in diff)

N/A

Files Reviewed (3 files)
  • cloudflare-gastown/gastown-grafana-dash-1.json - 0 issues
  • cloudflare-gastown/src/dos/Town.do.ts - 1 issue
  • cloudflare-gastown/src/gastown.worker.ts - 0 issues

Reviewed by gpt-5.4-20260305 · 173,075 tokens

…afana dashboard panels (#1372)

- Extend writeEvent() to support double3-double10 fields for reconciler metrics
- Emit reconciler_tick event after each alarm tick with all 9 metrics
- Add Reconciler row to Grafana dashboard with 6 panels:
  1. Events drained per tick (timeseries)
  2. Actions emitted per tick by type (stacked bar)
  3. Side effects attempted/succeeded/failed (timeseries)
  4. Invariant violations (stat with >0 alert threshold)
  5. Reconciler wall clock time (timeseries with >500ms threshold)
  6. Pending event queue depth (gauge with >50 threshold)
…query

Add a caveat comment and response field to debugReplayEvents explaining
that events are re-applied on top of live state, not from a pre-window
snapshot — results are approximate, useful for debugging event flow but
not faithful historical reconstruction.

Fix the Grafana 'Pending Event Queue Depth' gauge to show the latest
row's double8 value instead of averaging across the time window.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant