feat(agents): add /goal slash command with token-budget enforcement#4552
feat(agents): add /goal slash command with token-budget enforcement#4552kevin-dp wants to merge 22 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4552 +/- ##
===========================================
- Coverage 72.77% 58.31% -14.47%
===========================================
Files 86 373 +287
Lines 9779 41150 +31371
Branches 2982 11681 +8699
===========================================
+ Hits 7117 23997 +16880
- Misses 2608 17078 +14470
- Partials 54 75 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Electric Agents Mobile BuildLocal mobile checks ran for commit The EAS Android preview build was skipped because the |
CI failures investigationThree test jobs went red on the first push. After investigation: Fixed in this PR (commits
Not introduced by this PR (pre-existing on the base branch):
These reproduce on the base branch ( My diff to |
36ccc20 to
2802169
Compare
a046b12 to
6ae1334
Compare
Claude Code ReviewSummaryReviewed the Critical1. Budget enforcement is gated on goal existence, not
|
Review follow-up — all findings addressedOne commit per finding, on top of the rebased branch:
Housekeeping note (base still containing #4502's commits) is inherent to the stacked-PR setup; GitHub will retarget when #4502 merges. 🤖 Generated with Claude Code |
Lets the user set a session-scoped objective with an optional token cap. Horton works autonomously toward the goal and stops when it calls `mark_goal_complete` or when the run exceeds the budget. Cap is enforced mid-run via an `onStepEnd` hook on the outbound bridge so the abort fires within a step, not after the agent decides to stop. - `/goal set "..." [--tokens N|unlimited]` (default 50k) - `/goal show | clear | complete` - One goal per session, persisted as a `kind: 'goal'` manifest entry — resumes across desktop restarts via the existing Electric sync. - New `ctx.replyText` helper synthesizes a complete runs+texts sequence so slash-command responses and the budget-limit notice render as ordinary agent replies. - `AgentHandle.run` gains an optional `abortSignal`, combined with the runtime's `runSignal` so a budget abort can co-exist with SIGINT-style user aborts. - State-changing `/goal` commands typed mid-run also signal SIGINT so the prior run interrupts instead of finishing old work first. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The assistantHandler now reads the active goal up front and wires budget enforcement via ctx.updateGoalUsage, so the tool-composition test stub needs the goal-related methods present. Stubs to a no-goal state so the captured agent config reflects the tool path only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same root cause as the previous commit — the assistantHandler now calls ctx.getGoal at the top, so any test that exercises the handler through a stubbed context needs those methods present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review #1 (critical): enforcement was gated on goal *existence*, so a budget_limited goal kept tripping on every subsequent chat turn (abort + stop message until /goal clear), and a complete goal kept accumulating tokensUsed from unrelated runs — eventually flipping back to budget_limited. Derive one `enforcedGoal` (status === 'active') and use it for the prompt info, the onStepEnd hook, and the post-run usage write. Adds gating regression tests for all four states. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… seed Review #2 (critical): ctx.replyText/ctx.recordRun seeded their run-N counter from runs.toArray, but writeEvent has no synchronous local apply — events only land in the collection after a round-trip. A synthetic reply written right after agent.run (e.g. the budget-stop notice) could reuse a run-N key the bridge had just allocated. New allocateRunKey consults and advances the bridge's shared id-seed cache (keyed by the runs collection id) with a caller-local floor for monotonicity when the collection lags. Test harness now uses unique collection ids per case, matching production where every entity has its own runs collection. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review #3: the persisted input_tokens (correctly, for display) sums input + cacheRead + cacheWrite, but budgeting on that re-counts the entire conversation on every warm-cache step — a session with ~20k of context would exhaust the 50k default budget in 2-3 steps regardless of new work. The step-end hook now also carries `uncachedInput` (raw `usage.input`, falling back to the flat legacy counter for providers without cache columns), and Horton accumulates `uncachedInput + output` toward the budget. Display semantics are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review #4: per-step updates went through writeEvent (live) while the end-of-wake refreshGoalUsage recomputed from possibly-stale collections and wrote via the staged wake-session transaction — which commits last and could clobber the live value. Horton's in-memory accumulator is the authority, so drop the fallback recompute entirely: refreshGoalUsage, the tokensAtCreation baseline it depended on, and sumStepTokens are all removed. updateGoalUsage (never-decrease, writeEvent-direct) is now the only usage writer. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review #5: defined, exposed on HandlerContext, stubbed in tests — never called. Horton flips status via updateGoalUsage(..., { status: 'budget_limited' }). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review #6: /goal worked but was invisible — no composer autocomplete, and serializeComposerInput flagged it unknown. Define GOAL_SLASH_COMMAND next to the parser and register it alongside the skill commands on the horton entity type. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review #7: MessageInput hand-rolled the /goal prefix + subcommand parsing (hardcoding the `done` alias) because goal-command.ts wasn't exported from the /client entry. The parser is pure — export it and delegate, so UI abort behavior and runtime dispatch share one grammar. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review #8: the tool's description promised the summary was recorded but execute() discarded it, and the system prompt told the model to use a summary when blocked. Add an optional summary field to the goal entry, thread it through markGoalComplete(summary?), echo it in the tool result, and surface it from /goal show. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review #9: the goal kind alone used epoch-ms numbers, forcing `createdAt?: string | number` onto the flattened Manifest type while every other manifest kind uses ISO strings. Switch to ISO strings — the widening disappears. Also carries the completion summary through updateGoalUsage's rebuilt entry, which previously dropped it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review suggestion: paused/blocked existed in the enum and CSS but no code path ever set them (and statusClass didn't even handle paused). Three real states remain: active, complete, budget_limited — which also makes statusClass exhaustive without a default arm. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review suggestion: three hand-rolled formatTokenCount copies (goal-command, horton, GoalBanner) diverged from TokenUsage's Intl-compact version — 12,500 rendered as "13k" in the banner and "12.5k" in the meta row of the same UI. One Intl-compact-based helper now lives in token-budget.ts (exported from index and /client) and all four call sites use it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review suggestion: the tool was registered unconditionally, so with no goal the model saw a tool whose only possible answer was "No active goal to mark complete." Tools are rebuilt per wake, so gate on ctx.getGoal()?.status === 'active' like the other conditional tools. Gating tests now also assert tool presence/absence. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review test-gap note: the parser/dispatcher had coverage but the goal API itself had none. Covers setGoal default/unlimited budgets, same-objective carryover vs new-objective reset, updateGoalUsage never-decrease + status flip + no-op writes + the writeEvent live path + summary preservation, markGoalComplete summary trimming, and clear/get round-trips. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Rebase integration: main added observe-pg-sync-tool.test.ts, which calls createHortonTools with a minimal ctx stub. Tool composition is now goal-aware (mark_goal_complete only registers for an active goal), so the stub needs getGoal present. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6427778 to
655cc46
Compare
Rebase integration bug: the native composer (#4533) sends a WireComposerInputPayload — raw text in `source` plus parsed `nodes` — instead of `{ text }`. extractWakeText only read `text`, so /goal messages fell through to the LLM, whose skills guidance made it reply "I don't have a /goal skill in my catalog". Read `source` as the fallback; regression tests cover both payload shapes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A goal mutation firing mid-run (e.g. the mark_goal_complete tool) read the local manifests collection — which lags live writeEvent writes by a stream round-trip — and persisted through the wake-session's staged transaction, which replays at end-of-wake. The stale snapshot landed after, and overwrote, the fresher per-step tokensUsed written live (observed: bar showed 3.3k after the counter had reached 5.7k). Route every goal mutation through a single ordered channel (direct writeEvent upserts when wired; the staged path remains only as a test fallback) and add an in-wake read-your-writes cache so same-wake reads always observe the latest write. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On cache-enabled providers the newly appended prompt tokens of a warm turn are reported as cacheWrite, with usage.input collapsing to ~0 — so the budget's "uncached input" side was effectively zero and the cap tracked output only. Include cacheWrite in the uncached-input figure the onStepEnd hook reports; cache reads stay excluded. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Adds a
/goalslash command to Horton sessions. The user sets an objective with an optional token cap; the agent works autonomously toward it and stops whenmark_goal_completeis called or when the run exceeds the budget. Budget enforcement is mid-run (via anonStepEndhook on the outbound bridge), not at run boundaries — so a goal with a small budget actually halts the agent within a step, rather than letting it finish whatever giant tool-loop is in flight.Behaviour
kind: 'goal'entry on themanifestscollection — resumes across desktop restarts via the existing Electric sync, no schema migration.onStepEndhook on the outbound bridge surfaces per-step input/output tokens; Horton accumulates them and abortsctx.agent.run()via anAbortControlleroncetokensUsed >= tokenBudget. The cap covers the sum of input + output across all steps since the goal was set.writeEventdirectly (not the wake-session's staged manifest transaction, which only commits at end-of-wake — too late for a long-running run).mark_goal_completetool: registered on Horton's tool list; flips status tocomplete. The chat reply renders via the newctx.replyTexthelper, which synthesizes a completeruns + texts + textDeltassequence./goalcommands interrupt the active run —/goal complete,/goal clear, and/goal settyped while a run is in flight signal SIGINT alongside sending the message, so the prior run aborts instead of finishing its old work first./goal showis read-only and never interrupts.Plumbing
entity-schema.ts— newManifestGoalEntryValue(objective, status, tokenBudget, tokensUsed, tokensAtCreation, createdAt, updatedAt) added to the manifest discriminated union.goal-api.ts(new) —setGoal/clearGoal/getGoal/markGoalComplete/markGoalBudgetLimited/updateGoalUsage/refreshGoalUsage.updateGoalUsagewrites the manifest update directly throughwriteEventfor live UI;refreshGoalUsagenever decreasestokensUsedso a stale collection sum can't clobber an authoritative in-memory value.goal-command.ts(new) —/goalparser (--tokens N|50k|1.2m|unlimited,--unlimitedflag, subcommand aliasesdone/status) and dispatcher.tools/goal-tools.ts(new) —createMarkGoalCompleteToolexposes the completion signal to the LLM.outbound-bridge.ts— new optionalOutboundBridgeHooks.onStepEndcallback, threaded throughpi-adapterand theAgentConfigpassed touseAgent.context-factory.ts—AgentHandle.runnow accepts an optionalabortSignaland combines it with the runtime'srunSignal. Newctx.replyText(text)writes a complete runs+texts+textDeltas sequence so synthetic replies render in the chat. New goal-related methods exposed onHandlerContext.horton.ts—tryHandleSlashCommandintercepts/goal *before the LLM;/goal setenqueues a one-shot kickoff so the agent starts immediately;assistantHandlerwires the budget-enforcingonStepEnd, aborts on overflow, and posts the explanation reply.agents-server-ui— newGoalBannercomponent above the timeline (objective + budget bar + status badge).MessageInputaborts the active run when a state-changing/goalcommand is submitted.EntityTimeline/EntityContextDrawerhandle the newgoalmanifest kind.Stacking
Branched off
kevin/agent-token-usage(#4502) since this depends on the persisted per-stepinput_tokens/output_tokenscolumns from that PR. Base will retarget tomainwhen #4502 merges.Test plan
npx tsc --noEmitclean inagents-runtime,agents, andagents-server-uipackages/agents-runtime/test/goal-command.test.ts(parser + dispatcher, including--tokensformats,unlimited, error paths, and all subcommands)/goal set "..." --tokens 100k— banner appears, agent kicks off, tokens tick up live during the run, goal completes when model callsmark_goal_complete/goal set "..." --tokens 5kon a large task — agent halts mid-run withbudget_limitedstatus and the explanation reply/goal completetyped while a run is active — prior run aborts via SIGINT, goal flips to complete/goal set "B"typed while goal A is running — prior run aborts, goal A replaced, agent kicks off on B/goal clearremoves the banner/goal showreports current state (no abort)Followups (deferred)
/goal setkickoff message (Start working toward the active goal now...) is currently visible in the chat as a self-sent inbox message. Could be filtered from the LLM's view or styled differently.mark_goal_completewhen it decides it's done.🤖 Generated with Claude Code