Skip to content

FIX: per-frame watchdog so a nested timed call can't disarm the outer timeout#423

Open
ursm wants to merge 1 commit into
rubyjs:mainfrom
ursm:fix/nested-watchdog-timeout
Open

FIX: per-frame watchdog so a nested timed call can't disarm the outer timeout#423
ursm wants to merge 1 commit into
rubyjs:mainfrom
ursm:fix/nested-watchdog-timeout

Conversation

@ursm
Copy link
Copy Markdown
Contributor

@ursm ursm commented May 31, 2026

Summary

The per-Context watchdog state (c->wdmtx/cv/cancel) was shared by every v8_timedwait frame. When a timed dispatch re-enters Ruby and triggers another timed dispatch on the same Context — e.g. a host function calling eval/call again (and, with the in-flight ES module API, an instantiate resolver calling compile_module) — the inner frame's teardown did:

pthread_mutex_lock(&c->wd.mtx);
c->wd.cancel = 1;
pthread_cond_signal(&c->wd.cv);   // wakes BOTH watchdogs (shared cv)
...
c->wd.cancel = 0;                 // also written outside the mutex (data race)

The shared cancel/cv meant cancelling the inner (usually fast) call's watchdog also woke the outer frame's watchdog, which saw cancel == 1, broke out, and exited — so the outer call then ran with no timeout at all.

This is pre-existing on main; it reproduces with just a timeout and a host fn that re-enters eval:

ctx = MiniRacer::Context.new(timeout: 100)
ctx.attach("quick", proc { ctx.eval("1 + 1") })  # fast nested timed call
ctx.eval("quick(); while (true) {}")             # outer 100ms timeout never fires → spins forever

Fix

Move the watchdog into a stack-allocated struct Watchdog owned by each v8_timedwait frame (its own mtx/cv/cancel/deadline), created when the call arms a timeout and torn down before that frame returns. Cancelling an inner watchdog now touches only the inner state, so it can no longer disarm the outer one. The now-unused c->wd field and its init/teardown are dropped, and the cancel data race is gone (the flag is per-frame and only written under its own mutex). The per-frame condvar uses the same clock as deadline_ms (CLOCK_MONOTONIC off Apple).

Test

test_nested_timed_call_does_not_disarm_outer_timeout runs a fast nested eval from a host fn, then spins in the outer script, and asserts the outer timeout still fires. A safety stop turns a regression into a slow failure instead of a hang: against the shared watchdog it runs to the 5 s safety stop and fails (elapsed ~5s); with this fix it terminates in ~0.1 s.

Known follow-up (out of scope here)

TerminateExecution is isolate-global, so if an outer deadline fires while a nested call is still running, the nested handler's CancelTerminateExecution() can still clear it. Closing that requires nesting-aware termination ownership and is left for a separate change.

🤖 Generated with Claude Code

…imeout

The per-Context watchdog state (c->wd: mtx/cv/cancel) was shared by every
v8_timedwait frame. When a timed dispatch re-enters Ruby and triggers another
timed dispatch on the same Context — a host function (or, with the module API,
an instantiate resolver) calling eval/call/compile_module — the inner frame's
`cancel = 1; pthread_cond_signal()` also woke the OUTER frame's watchdog, which
saw the shared cancel flag, broke out, and exited. The outer call then ran with
no timeout at all. (The `cancel = 0` reset was also written outside the mutex,
a data race with a still-waiting watchdog.)

Move the watchdog state into a stack-allocated `struct Watchdog` owned by each
v8_timedwait frame and torn down before it returns, so cancelling an inner
watchdog can no longer disarm the outer one. Drops the now-unused c->wd field
and its init/teardown.

Adds a regression test (host fn does a fast nested eval, then the outer script
spins) asserting the outer timeout still fires; it fails against the shared
watchdog (the spin runs to the test's safety stop) and passes here in ~0.1s.

Note: a separate, narrower subtlety remains — because TerminateExecution is
isolate-global, if an outer deadline fires *while* a nested call is running,
the nested handler's CancelTerminateExecution can still clear it. That needs
nesting-aware termination ownership and is left as a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant