Skip to content

feat: workflow replay#4411

Open
NathanFlurry wants to merge 5 commits intomainfrom
workflow-step-resume
Open

feat: workflow replay#4411
NathanFlurry wants to merge 5 commits intomainfrom
workflow-step-resume

Conversation

@NathanFlurry
Copy link
Member

Description

Add workflow rerun controls to RivetKit workflows through the inspector by introducing a v4 workflow rerun message, HTTP endpoint, and workflow-engine reset helper. Update the standalone Inspector UI with a current-step rerun button, previous-step right-click rerun, and helper text, and make the HTTP inspector route usable with actor inspector tokens so the standalone Inspector can trigger reruns without engine credentials. Also preserve workflow metadata in storage and document the new inspector API.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • pnpm --dir rivetkit-typescript/packages/workflow-engine exec vitest run tests/rerun.test.ts
  • pnpm --dir rivetkit-typescript/packages/rivetkit test driver-memory -t "POST /inspector/workflow/rerun reruns a workflow from the beginning|inspector endpoints require auth in non-dev mode|failed workflow steps sleep instead of surfacing as run errors"
  • Verified in the standalone Inspector frontend against a local serve-test-suite server, including the current-step rerun button and right-click rerun from a previous step.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link

railway-app bot commented Mar 12, 2026

🚅 Deployed to the rivet-pr-4411 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-cloud 😴 Sleeping (View Logs) Web Mar 23, 2026 at 5:37 am
frontend-inspector 😴 Sleeping (View Logs) Web Mar 22, 2026 at 8:50 pm
website 😴 Sleeping (View Logs) Web Mar 21, 2026 at 12:33 am
mcp-hub ✅ Success (View Logs) Web Mar 21, 2026 at 12:22 am
ladle ❌ Build Failed (View Logs) Web Mar 21, 2026 at 12:21 am

@NathanFlurry NathanFlurry requested review from jog1t and removed request for jog1t March 12, 2026 20:01
@NathanFlurry
Copy link
Member Author

Follow-up Inspector UI verification after the replay rename:

  • Hidden running-step case now disables Replay from this step and shows the tooltip Step currently in progress.
  • Failed-step case still leaves Replay from this step enabled, so operators can bypass the pending retry immediately.
  • Verified in the standalone Inspector frontend against the local serve-test-suite, not the engine UI.

Screenshots captured in the workspace:

  • /Users/nathan/conductor/workspaces/rivet/kathmandu/.context/disabled-replay-tooltip.png
  • /Users/nathan/conductor/workspaces/rivet/kathmandu/.context/enabled-replay-button.png

@NathanFlurry
Copy link
Member Author

NathanFlurry commented Mar 13, 2026

Standalone Inspector screenshots for the final replay flows:

  • Failed-step case still leaves Replay from this step enabled, so operators can bypass the pending retry immediately.
  • Replay is blocked when any workflow step is currently running, with the tooltip Step currently in progress.
enabled-replay-button disabled-replay-tooltip

@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 13, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4411

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4411

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4411

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4411

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4411

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4411

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4411

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4411

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4411

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4411

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4411

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4411

commit: 481dd2d

@claude
Copy link

claude bot commented Mar 13, 2026

PR 4411 Review: feat workflow replay

--- SUMMARY ---

This PR adds workflow replay/rerun functionality through the inspector, allowing developers to re-execute a workflow from a specific step or from the beginning without redeploying. The implementation touches the workflow engine, RivetKit inspector, the actor HTTP router, and the standalone Inspector frontend.

SUMMARY OF CHANGES

  • New replayWorkflowFromStep in workflow-engine/src/index.ts that deletes entries from a target step onward and resets workflow state to sleeping.
  • New ActorWorkflowControlDriver in rivetkit/src/workflow/driver.ts that provides KV access to the workflow engine for replay operations without a RunContext.
  • Inspector protocol bumped to v4 with a new WorkflowReplayRequest / WorkflowReplayResponse message pair.
  • New POST /inspector/workflow/replay HTTP endpoint in actor/router.ts.
  • Updated auth middleware in router.ts to also accept actor-specific inspector tokens.
  • Frontend additions: replay button in node detail panel, right-click-from-previous-step replay, polling loop to sync UI after replay.
  • Entry metadata is now eagerly loaded in loadStorage so status is available after actor wakes.
  • Tests covering full replay, in-flight rejection, and loop boundary rewinding.

--- ISSUES AND OBSERVATIONS ---

ISSUE 1. Race condition in restartRunHandler - potential double-run

rivetkit-typescript/packages/rivetkit/src/actor/instance/mod.ts (new method restartRunHandler):

The check actor.isRunHandlerActive() in workflow/mod.ts and the actual call to restartRunHandler are not atomic. Between those two calls, another concurrent replay or an internal wake could start the run handler active flag again, causing restartRunHandler to wait for that run to finish before checking again and returning early. The guard at the engine level (metadata.status check in replayWorkflowFromStep) helps, but the in-memory isRunHandlerActive() flag is checked before the KV mutation completes, leaving a narrow window.

ISSUE 2. Error from WebSocket WorkflowReplayRequest is not caught (FIX BEFORE MERGE)

rivetkit-typescript/packages/rivetkit/src/inspector/handler.ts (line ~179-193):

Compare to DatabaseSchemaRequest (line ~194), which wraps the await in a try/catch and sends an Error response on failure. WorkflowReplayRequest does not have that try/catch. If replayWorkflowFromStep throws (e.g., step not found, workflow in flight), the exception will propagate to the outer message handler and could silently drop the message or crash the WebSocket handler without informing the client. The in-flight rejection test in the driver test suite only covers the HTTP path, not the WebSocket path.

ISSUE 3. loadStorage now eagerly loads all entry metadata - potential KV scan cost on hot path

rivetkit-typescript/packages/workflow-engine/src/storage.ts (line ~158-166):

Previously loadMetadata lazily fetched individual metadata entries. Now every loadStorage call (including the normal live workflow execution path) does a full prefix scan over all entry metadata. For workflows with many steps, this can be a significant extra read on the hot path. The replay function needs this data, but the observers (inspector) are the minority case. Consider whether this should remain lazy and be loaded on-demand for replay only, or document the performance trade-off explicitly.

ISSUE 4. HTTP replay endpoint does not validate entryId type at runtime

rivetkit-typescript/packages/rivetkit/src/actor/router.ts (line ~304):

There is no runtime validation that body.entryId is actually a string when present. A malformed body with entryId: 12345 would pass through to the engine. Given that replay is a destructive operation (deletes KV entries), adding a runtime check is warranted even if other inspector endpoints follow the same loose pattern.

ISSUE 5. Test for in-flight rejection expects internal_error but implementation throws a plain Error

rivetkit-typescript/packages/rivetkit/src/driver-test-suite/tests/actor-inspector.ts (line ~1465):

replayWorkflowFromStep throws new Error with message Cannot replay a workflow while a step is currently running. Whether this becomes { code: internal_error } depends on the actor router's global error handler. The workflow-specific errors should ideally be surfaced as a distinct error code (e.g., workflow_in_flight) so the frontend can display a user-friendly message.

ISSUE 6. syncWorkflowHistoryAfterReplay polling does not cancel on unmount

frontend/src/components/actors/workflow/actor-workflow-tab.tsx (line ~158-198):

The polling loop is fire-and-forget via void syncWorkflowHistoryAfterReplay. If the user navigates away from the workflow tab before the loop finishes, the loop continues fetching and setting query data for an unmounted component. A useEffect cleanup or an AbortController passed into the loop would be cleaner.

ISSUE 7. getInspectorProtocolVersion nesting is fragile

frontend/src/components/actors/actor-inspector-context.tsx (version check block):

The v4 check is nested inside the v3 check because MIN_RIVETKIT_VERSION_WORKFLOW_REPLAY is newer than MIN_RIVETKIT_VERSION_DATABASE. The nesting is correct but fragile. A flat if-else-if chain ordered from newest to oldest would make the version hierarchy explicit and prevent a future reviewer from accidentally extracting the v4 check to the top level.

ISSUE 8. Duplicate query data updates between handleReplay and WebSocket response handler

For the embedded WebSocket inspector path, queryClient.setQueryData is called once when the mutation resolves in handleReplay and again when the WorkflowReplayResponse arrives on the WebSocket. This is harmless but results in a redundant double-render with the same data.

--- POSITIVE HIGHLIGHTS ---

  • The per-actor WeakMap approach correctly handles multiple actor instances sharing the same workflow() definition without leaking state between them.
  • The findReplayBoundaryEntry logic for loop rewinding is well-reasoned: finding the deepest enclosing loop ensures full loop re-execution rather than partial state.
  • The versioned protocol upgrade follows the existing pattern cleanly. v3 clients that receive a WorkflowReplayResponse get a graceful WORKFLOW_HISTORY_DROPPED_ERROR error message.
  • Tests cover the key scenarios: full replay, mid-workflow replay, loop boundary replay, and the sleeping-loop edge case.
  • The dual-token auth extension (global RIVET_INSPECTOR_TOKEN OR actor-specific inspector token) uses timingSafeEqual for both paths, which is correct.

--- VERDICT ---

Most important before merge: Issue 2 (unhandled WebSocket replay errors). Issues 3 (eager metadata load on hot path) and 5 (error code specificity) are also worth addressing. The rest are polish items.

Copy link
Contributor

@jog1t jog1t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

react part looks good!

<div className="mt-4 flex justify-end">
<MaybeTooltip
content={replayState.tooltip}
disabled={!replayState.tooltip}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can also display whenever user needs to update the actor to the latest rivetkit so he/she can replayh steps

@NathanFlurry NathanFlurry changed the title Add workflow rerun controls to RivetKit Inspector feat: workflow replay Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants