Skip to content

Make jobs cancelled due to a soft stop immediately available#1290

Open
brandur wants to merge 1 commit into
masterfrom
brandur-soft-stopped-jobs-immediately-available
Open

Make jobs cancelled due to a soft stop immediately available#1290
brandur wants to merge 1 commit into
masterfrom
brandur-soft-stopped-jobs-immediately-available

Conversation

@brandur

@brandur brandur commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

While working on #1289, I realized that jobs which are "soft stopped"
via context cancellation are still prone to the same side effects as if
they errored in any other way:

  • Their number of attempts is incremented.
  • They may be discarded if reaching max attempts.
  • They'll have to wait to be retried according to retry policy.

This doesn't really seem right because these jobs didn't actually
misbehave in any way, but were rather just slow-to-run jobs that
couldn't finish cleanly inside the default stop allowance while a client
was restarting or being deployed.

The proper behavior should probably be more like a snooze. i.e. The soft
timeout cancellation doesn't count and the jobs get a chance to be
retried immediately. Here, make that change.

@brandur brandur force-pushed the brandur-soft-stopped-jobs-immediately-available branch from 00b244b to 3c09064 Compare June 19, 2026 20:16
@brandur brandur requested a review from bgentry June 19, 2026 20:16
@brandur

brandur commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

@bgentry Do you agree with the rationale for this one?

@brandur

brandur commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c090646e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread internal/jobexecutor/job_executor.go Outdated
}

if softStopped {
params := riverdriver.JobSetStateErrorAvailable(jobRow.ID, now, ptrutil.Ptr(max(jobRow.Attempt-1, 0)), errData, metadataUpdates)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep soft-stop records out of retry backoff

For jobs that are soft-stopped and then later fail for a real reason, this path resets attempt but still appends errData to errors. The default retry policy uses len(job.Errors)+1 rather than attempt, so a deployment stop still changes future retry behavior: the first genuine failure after one soft stop is treated like the second error and backs off for 16s instead of 1s, with repeated stops inflating it further. This means soft-stop cancellations still count for retry scheduling even though the attempt count is restored.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed! Changed this to be more like snooze and not persist the cancellation error at all.

While working on #1289, I realized that jobs which are "soft stopped"
via context cancellation are still prone to the same side effects as if
they errored in any other way:

* Their number of attempts is incremented.
* They may be discarded if reaching max attempts.
* They'll have to wait to be retried according to retry policy.

This doesn't really seem right because these jobs didn't actually
misbehave in any way, but were rather just slow-to-run jobs that
couldn't finish cleanly inside the default stop allowance while a client
was restarting or being deployed.

The proper behavior should probably be more like a snooze. i.e. The soft
timeout cancellation doesn't count and the jobs get a chance to be
retried immediately. Here, make that change.
@brandur brandur force-pushed the brandur-soft-stopped-jobs-immediately-available branch from 3c09064 to 2e49010 Compare June 20, 2026 03:34

@bgentry bgentry left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general direction makes some sense and it's something I've also pondered in the past, but there are some tradeoffs we should discuss before adopting it.

First, this path still appears to publish a job_failed event because the job is completed back into available state with Snoozed=false. That feels inconsistent with the new model where this cancellation is intentionally not counted as an error, leaves errors unchanged, and restores attempt. Users consuming events for alerting or metrics may see a “failed” event for a job that no longer has failure state/history to explain it.

Second, skipping ErrorHandler removes a user-visible observation/control point. That may be the right outcome if River-owned stop cancellation is not considered a job error, but some users may currently rely on the handler to record shutdown interruptions, cleanup failures, or partial-work cases where the worker returned ctx.Err() because it did not reach a safe checkpoint.

Third, this bypasses retry policy and max-attempt accounting entirely. That is the desired property for normal deploy/restart interruption, but it also means repeated stops can keep a context-sensitive long-running job alive forever with no backoff and no eventual discard. That's intentional I think, but it’s a real semantic tradeoff we should decide on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants