feat(mollifier): trigger burst smoothing — Phase 1 (monitoring)#3614
feat(mollifier): trigger burst smoothing — Phase 1 (monitoring)#3614d-cs wants to merge 25 commits into
Conversation
🦋 Changeset detectedLatest commit: adc29fc The changes in this PR will be included in the next version bump. This PR includes changesets to release 31 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR implements the first two phases of a trigger burst-smoothing system ("mollifier"): it adds a Redis-backed MollifierBuffer and MollifierDrainer, Zod schemas and payload (de)serialization, a real trip evaluator and mollifier gate wired into RunEngineTriggerTaskService, OpenTelemetry metrics, worker startup drainer wiring, environment configuration and feature flag, package re-exports, and comprehensive tests (unit, integration, and fuzz) validating buffer, drainer, gate, and evaluator behavior while keeping fail-open semantics and deferring full activation to later phases. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code reviewFound 2 issues:
trigger.dev/apps/webapp/app/v3/mollifier/mollifierGate.server.ts Lines 83 to 88 in 452ebda
trigger.dev/apps/webapp/app/runEngine/services/triggerTask.server.ts Lines 338 to 346 in 452ebda 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
Nitpicks (test-mock removal) also addressed in 98c1520:
|
Code review (follow-up)Four additional issues from the earlier scan, posting now since the gate-naming fix landed:
trigger.dev/apps/webapp/test/engine/triggerTask.test.ts Lines 1343 to 1347 in 98c1520
trigger.dev/apps/webapp/app/services/worker.server.ts Lines 129 to 152 in 98c1520
trigger.dev/apps/webapp/app/v3/mollifier/mollifierGate.server.ts Lines 11 to 16 in 98c1520 trigger.dev/packages/redis-worker/src/mollifier/buffer.ts Lines 294 to 312 in 98c1520 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
088e071 to
2a0d5ee
Compare
Redis-backed burst-smoothing layer behind MOLLIFIER_ENABLED=0 (default).
With the kill switch off, the gate short-circuits on its first env check
and production behaviour is identical to main.
@trigger.dev/redis-worker:
- MollifierBuffer: atomic Lua-backed FIFO with accept / pop / ack /
requeue / fail + TTL. Per-env queues with HSET entry storage,
atomic RPOP + status transition, FIFO retry ordering.
- MollifierDrainer: generic round-robin worker with concurrency cap,
retry semantics, and a stop deadline to avoid livelock on a hung
handler. Phase 3 will wire the handler to engine.trigger().
- Full testcontainers-backed test suite (21 tests).
apps/webapp:
- evaluateGate cascade-check (kill switch -> org feature flag ->
shadow mode -> trip evaluator -> mollify / shadow_log / pass_through).
Dependencies injected for testability; the trip evaluator stub
returns { divert: false } in phase 1.
- Inserted into RunEngineTriggerTaskService.call() before
traceEventConcern.traceRun. The mollify branch throws (unreachable
in phase 1).
- Lazy MollifierBuffer + MollifierDrainer singletons; no Redis
connection unless MOLLIFIER_ENABLED=1.
- 12 MOLLIFIER_* env vars (all safe defaults) and a mollifierEnabled
feature flag in the global catalog.
- Drainer booted from worker.server.ts on first import.
- Read-fallback stub for phase 3.
- Gate cascade tests + .env loader so env.server validates in vitest
workers.
Phase 2 will land the real trip evaluator; phase 3 will activate the
buffer-write + drain path.
…dual-write monitoring + drainer ack loop) Phase 1 of the trigger-burst smoothing initiative. Adds the A-side trip evaluator (atomic Lua sliding-window per env) and wires it into the trigger hot path. When the per-org mollifierEnabled feature flag is on AND the evaluator says divert, the canonical replay payload is buffered to Redis (via buffer.accept) AND the trigger continues through engine.trigger — i.e. dual-write. The drainer pops + acks (no-op handler) to prove the dequeue mechanism works end-to-end. Operators audit by joining mollifier.buffered (write) and mollifier.drained (consume) logs by runId. Buffer primitives hardened: - accept is idempotent on duplicate runId (Lua EXISTS guard) - pop skips orphan queue references (entry HASH TTL'd while runId queued) - fail no-ops on missing entry (no partial FAILED hash leak) - mollifier:envs set pruned on draining pop, restored on requeue - 16-row truth-table test enumerates the gate cascade - BufferedTriggerPayload defines the canonical replay shape Phase 2 will use to invoke engine.trigger - payload hash for audit-equivalence computed off the hot path (in the drainer) to avoid CPU during a spike Regression tests in apps/webapp/test/engine/triggerTask.test.ts pin the mollifier integration: - validation throws BEFORE the gate runs (no orphan buffer write on rejected triggers) - mollify dual-write happy path (Postgres + Redis both reflect the run) - pass_through path does NOT call buffer.accept - engine.trigger throwing AFTER buffer.accept leaves an orphan (documented behaviour — drainer auto-cleans; audit-trail surfaces it) - idempotency-key match short-circuits BEFORE the gate is consulted - debounce match produces an orphan (documented behaviour — Phase 2 must lift handleDebounce upfront before buffer.accept) Behaviour with MOLLIFIER_ENABLED=0 (default) is byte-identical to main. With MOLLIFIER_ENABLED=1 and the flag off, only mollifier.would_mollify logs fire (no buffer state). With the flag on, dual-write activates. Includes two opt-in *.fuzz.test.ts suites (gated on FUZZ=1) that randomise operation sequences against evaluateTrip and the drainer to find timing edges. They are clearly marked TEMPORARY in their headers.
- changeset: drop "deferred" wording — phase-1 actively dual-writes + runs the drainer ack loop. - worker.server.ts: wrap mollifier drainer init in try/catch + register SIGTERM/SIGINT handlers so the polling loop stops cleanly on shutdown. - bufferedTriggerPayload: only serialise idempotencyKeyExpiresAt when an idempotencyKey is present (avoid impossible orphan-expiry payloads). - mollifierTelemetry: narrow recordDecision reason to DecisionReason union to keep OTEL attribute cardinality bounded. - mollifierGate: rename resolveOrgFlag → resolveFlag. The underlying FeatureFlag table is global by key, so the "org" prefix was misleading; per-org gating is out of scope for phase-1. - tests: drop vi.fn mocks. mollifierGate now uses plain closure spies; mollifierTripEvaluator runs against a real MollifierBuffer backed by a redisTest container (closed client exercises the fail-open path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stacking Worker.init() is called per request from entry.server.tsx, so the process.once SIGTERM/SIGINT pair added in 98c1520 would stack a fresh listener every request under dev hot-reload (process.once only removes after firing). Gate registration on a process-global flag, matching the existing __worker__ pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion.featureFlags The mollifier gate's resolveOrgFlag was a global feature-flag lookup named as if org-scoped. Phase-1 plan and design doc both intended per-org gating; the implementation regressed because the global flag() helper has no orgId parameter. Adopt the existing per-org feature-flag pattern (used by canAccessAi, canAccessPrivateConnections, compute beta gating): pass `Organization.featureFlags` through as `flag()` overrides. Per-org opt-in now works admin-toggleable via the existing Organization.featureFlags JSON column — no schema migration needed. - mollifierGate: revert resolveFlag/flagEnabled back to resolveOrgFlag/orgFlagEnabled (the name now matches reality). GateInputs gains `orgFeatureFlags`; the default resolver passes them as overrides to `flag()`. - triggerTask.server.ts: thread `environment.organization.featureFlags` into the gate call. - tests: three new postgresTest cases exercise the real DB-backed resolveOrgFlag end-to-end, proving (a) per-org opt-in isolation, (b) unrelated beta flags don't bleed across, (c) per-org overrides take precedence over the global FeatureFlag row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nect The unit cascade tests in mollifierGate.test.ts import the gate module, which transitively pulls in ~/db.server. That module constructs the prisma singleton at import time and eagerly calls $connect(), which fails against localhost:5432 in the unit-test shard and surfaces as an unhandled rejection that fails the whole vitest run. Mocking the module keeps the cascade tests pure and leaves the postgresTest cases on the testcontainer-fixture prisma untouched.
- Gate drainer init on WORKER_ENABLED so only worker replicas run the polling loop. - Update the enqueueSystem TTL comment now that delayed/pending-version are first enqueues. - Correct the mollifier gate docstring to describe the fixed-window counter and tripped-key rearm. - Swap findUnique for findFirst in the trigger task test to match the webapp Prisma rule.
…eFlags The gate's `GateInputs` now requires `orgFeatureFlags`, but the surface type used by the trigger service was still the pre-org-scope shape, so the default evaluator wasn't assignable and the call site couldn't pass the flag overrides.
…est startup The per-org isolation suite uses `postgresTest`, which spins up a fresh Postgres testcontainer per case. On CI the 5s vitest default regularly times out on container start before the test body runs. Match the 30s `vi.setConfig` used by other postgresTest suites in this app.
…rrors resolveOrgFlag now checks the per-org Organization.featureFlags override in-memory before falling back to the global flag() helper, so the common per-org enablement path resolves without a Prisma round-trip on every trigger call. evaluateGate also wraps the flag resolution in try/catch and fails open to false on error, mirroring the trip evaluator.
…exit Pass a configurable timeout to drainer.stop() so SIGTERM/SIGINT can't hang forever if an in-flight handler is wedged. Matches the precedent set by BATCH_TRIGGER_WORKER_SHUTDOWN_TIMEOUT_MS (default 30s).
processOneFromEnv now catches buffer.pop() failures so one env's hiccup doesn't reject Promise.all and bubble up to the loop's outer catch. The polling loop itself wraps each runOnce in try/catch and backs off with capped exponential delay (up to 5s) instead of exiting permanently on the first listEnvs/pop error. Stop semantics are unchanged: only the stopping flag breaks the loop. Adds two regression tests using a stub buffer (no Redis container) so fault injection is deterministic.
The phase-1 scaffolding referenced MollifierBuffer, getMollifierBuffer, and deserialiseMollifierSnapshot without importing them — CI typecheck fails with TS2304. The runtime path is gated behind MOLLIFIER_ENABLED=0 so this never produced a runtime symptom, but the types must resolve.
…luator fail-open
The TripDecision header comment claimed each webapp instance maintained
its own rate counter — wrong. evaluateTrip writes to mollifier:rate:\${envId}
with no per-instance prefix, so all replicas pointing at the same Redis
share the key. The threshold is the fleet-wide ceiling.
Also wrap d.evaluator() in evaluateGate in try/catch so a throwing
evaluator falls back to no-divert. The default createRealTripEvaluator
catches its own errors, but the contract should be symmetric with the
already-wrapped resolveOrgFlag call so a future evaluator can't break
the trigger hot path's fail-open contract.
The two notes describe the same PR's behaviour from two angles; merging them into one entry gives a cleaner changelog line and matches how the PR is presented to reviewers.
drainer.fuzz.test.ts and evaluateTrip.fuzz.test.ts are valuable as ongoing property checks but aren't load-bearing for the phase-1 review. Moving them to a follow-up keeps this PR smaller without losing coverage of the production paths (buffer.test.ts and drainer.test.ts together cover the contract surface).
The enqueueSystem.ts comment touch-up was an unrelated drive-by during phase-1 review and doesn't belong in this PR. Will land separately.
External changelog readers don't have context on internal phase numbering; describe the feature itself (opt-in burst protection, default-off env vars, shadow mode, dual-write activation) instead of "phase 1".
…iversion The previous wording implied the buffer/drainer was active protection; in this release they're audit-only. Spell out that no trigger calls are diverted or rate-limited yet, and that active smoothing follows later.
MollifierEvaluateGate and MollifierGetBuffer were defined in the consumer (triggerTask.server.ts) but described the surface of the gate and the buffer accessor respectively. Move each to the module that owns the underlying implementation so the type lives with the producer, not the caller. No behavioural change.
…-redis fallback Two operational guards for misconfigured rollouts: 1. Drop the MOLLIFIER_REDIS_* fallback to the main REDIS_* cluster. The mollifier writes to a dedicated Redis to keep burst traffic off the engine's primary queue — silently colocating with the main Redis when MOLLIFIER_REDIS_HOST is unset defeats the design. 2. Degrade gracefully instead of crashing the pod. If MOLLIFIER_ENABLED was flipped on without setting MOLLIFIER_REDIS_HOST, the buffer returns null (with a one-shot warn log per process) and the drainer no-ops. No crash loops, no failed deploys, no traffic impact — operators see the warn line and fix the misconfig in a follow-up deploy. The drainer's previously-unreachable "env vars inconsistent" throw becomes reachable in this degraded mode; replace it with a null return so worker.server.ts's existing null check short-circuits cleanly.
mollifier:envs is a Redis SET that grows with the count of envs that currently have buffered entries. Under normal operation that's small, but an extended drainer outage can leave entries piled up across thousands of envs — at which point runOnce would queue one processOneFromEnv per env through pLimit, ballooning per-tick latency and event-loop queue depth. Cap per-tick fan-out at MOLLIFIER_DRAIN_MAX_ENVS_PER_TICK (default 500). When the set fits within the cap, behaviour is unchanged (take all, rotate cursor by 1 for fairness). When the set exceeds the cap, take a rotating slice and advance the cursor by the slice size so successive ticks sweep through the full set. Tests use a stub buffer to drive listEnvs() deterministically with thousands of envs without provisioning a real Redis.
…tchQueue The MollifierDrainer's stop() was polling `isRunning` every 20ms until the loop exited, which differs from the codebase's convention for similar polling loops (FairQueue, BatchQueue both hold the loop promise as a field and await it directly on stop). Switch to the same pattern: store the loop promise on start(), then in stop() race it against the timeout via Promise.race. With no timeout we just await the loop directly. With a timeout the warn-and-return behaviour is unchanged. No polling, no separate `isRunning` poll loop. Behaviour is identical to the previous implementation, including the hung-handler timeout path (covered by the existing "stop returns after timeoutMs even if a handler is hung" test).
The previous chunking advanced the cursor by sliceSize each tick, producing fixed disjoint slices like [0..3], [4..7], [0..3], ... With that pattern env_0 was always at slice position 0 (first into pLimit) and env_3 always at position 3 (last) — reinstating the head-of-line bias rotation was meant to prevent. Advance the cursor by 1 instead. Slices now overlap across consecutive ticks (e.g. [0..3], [1..4], [2..5], ...) so every env reaches every slot position 0..sliceSize-1 across one envs.length-tick cycle. Drainage rate per env is unchanged: each env still appears in exactly sliceSize of every envs.length ticks. New regression test pins the fairness property by asserting each env touches every slot at least once per cycle.
…y envs Adds a regression test that proves a light env (single buffered entry) is drained within (envs.length - sliceSize + 1) ticks regardless of how many entries the heavy envs have queued. The test uses a stub buffer whose listEnvs/pop pair mirrors the production atomic-Lua semantic: an env disappears from listEnvs the moment its queue empties, so the light env exits the rotation as soon as it's popped — while the heavy envs stay in the rotation until their thousands of entries are drained. Together with the head-of-line fairness test this pins both fairness properties: (1) every env touches every slice slot per cycle (no within-slice bias), and (2) no env's drainage latency depends on the queue depth of other envs (no across-slice starvation).
Summary
trigger()API calls during traffic spikes, with a per-env trip evaluator and a drainer ack-loop.engine.trigger. No customer-facing behaviour change.mollifier.would_mollify,mollifier.buffered,mollifier.drained, plus themollifier.decisionscounter.Test plan
pnpm run test --filter @trigger.dev/redis-workerpnpm run test --filter webapp -- mollifiermollifier.buffered+mollifier.drainedlog pairs with matchingrunId