feat(safer-shell): autonomous blast-radius estimation for the shell tool#3223
feat(safer-shell): autonomous blast-radius estimation for the shell tool#3223Sayt-0 wants to merge 5 commits into
Conversation
When safer: true is set on a shell toolset, every shell command is checked against an embedded safety-pattern taxonomy before the normal approval flow. Matched destructive commands surface a blast-radius level in the confirmation UI; unmatched commands still warn with an unknown blast radius. The forced confirmation cannot be bypassed by --yolo or permission allow rules. Assisted-By: Claude
When safer mode's regex pass in assessDestructiveShellCommand returns
no match, an optional Judge is now consulted before falling back to
BlastRadiusUnknown. The judge is gated behind a small lexical-signal
trigger (wipe / destroy / drop / purge / nuke / obliterate / erase /
clobber / reset / annihilate) so a clean stream of inspection
commands never pays an LLM round-trip — only commands that look
possibly-destructive but don't match the embedded pattern set
escalate.
Behaviour:
- Pattern match -> deterministic safety wins, judge skipped.
- Pattern miss, no signal -> BlastRadiusUnknown fall-through, no LLM call.
- Pattern miss, signal -> 500 ms-cap LLM call; refined verdict wins.
- Judge timeout/error/nil -> BlastRadiusUnknown fall-through (fail-closed).
New files:
- pkg/tools/builtin/shell/judge.go
Judge interface, LexicalSignals, shouldConsultJudge gate,
ProviderJudge default impl wrapping provider.Provider with a
tight JSON-only system prompt and trailing-object verdict parser.
- pkg/tools/builtin/shell/judge_test.go
Gate semantics, the five validator branches, and parseJudgeVerdict
response-shape coverage (low downgrades to non-destructive, high
and medium escalate, unknown / blank / unparseable fall through).
Modified:
- pkg/tools/builtin/shell/shell.go
Added judge field on shellHandler, SetJudge wiring on ToolSet,
and the consultation between the regex miss and the existing
BlastRadiusUnknown return in ValidateShellToolCall.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… field
Plumbs the Judge interface added in the previous commit into the toolset
config so users can opt in to the residual classifier from agent YAML.
When safer_judge_model is set on a shell toolset with safer:true, the
runtime constructs a Provider from the "provider/model" string and
calls ToolSet.SetJudge(NewProviderJudge(p)) during toolset load. Unset
keeps the existing behaviour: pattern misses fall through to
BlastRadiusUnknown without an LLM round-trip.
toolsets:
- type: shell
safer: true
safer_judge_model: anthropic/claude-haiku-4-5 # opt-in residual judge
Provider construction at wire-time is best-effort: a failure (unknown
provider, missing credentials, etc.) is logged at WARN and the toolset
proceeds without the judge — safer mode still runs, just without the
refined verdict on pattern misses.
Changes:
- pkg/config/latest/types.go add SaferJudgeModel *string on Toolset
- pkg/config/latest/validate.go reject when non-shell, no safer:true, or empty
- pkg/config/toolset_validate_test.go five new cases
- agent-schema.json sibling block for safer_judge_model
- pkg/tools/builtin/shell/shell.go buildSaferJudge helper + CreateToolSet wiring
- examples/shell_safer.yaml demonstrate the field
End-to-end flow reachable from the example:
ANTHROPIC_API_KEY=... docker-agent run examples/shell_safer.yaml
> "Drop my local database, run \`bun run drop-db\`"
→ safer-mode regex misses → lexical-signal trips → ProviderJudge
classifies → confirmation dialog shows refined blast radius with
reason "Safer-mode LLM judge: ...".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a deterministic blast-radius estimator so safer mode derives a real tier for shell commands that miss the curated safety_patterns.json set, instead of a blanket BlastRadiusUnknown. The estimator is wired as a layer between the pattern pass and the residual judge in ValidateShellToolCall. It analyses command structure (a lexer for compound commands, redirections and substitution; recursion/force flags; target path scope; a read-only allowlist; a destructive-op table for rm/dd/mkfs/git/docker/...) and runs a bounded, read-only probe of the working directory to sharpen recursive deletes (a large tree scores higher than a missing target). The probe stays inside the working directory, is bounded, never follows symlinks, and never runs the command. Behavior change: confidently read-only commands (ls, cat, git status, docker ps, ...) are no longer force-gated and go through the normal approval flow; recognized destructive commands get a real tier; anything unresolved stays fail-closed (gated). The read-only verdict is refused for command-executing flags, code-injecting environment assignments, non-shell interpreters, wrapper smuggling, and command substitution. Update safer_test.go and judge_test.go for the new read-only behavior and document the layered estimation and safety posture in docs/tools/shell.
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟡 NEEDS ATTENTION
3 medium-severity issues found in the new safer-mode shell validator and residual LLM judge (all in new + lines). No false-safe bypass of the read-only fast path was found in the estimator itself.
| // | ||
| // Returns a non-destructive verdict only on explicit "low". | ||
| func parseJudgeVerdict(response string) *tools.ToolCallSafety { | ||
| start := strings.LastIndex(response, "{") |
There was a problem hiding this comment.
[MEDIUM] parseJudgeVerdict uses LastIndex — a multi-object LLM response can silently downgrade a destructive verdict to low
strings.LastIndex(response, "{") finds the last { in the string. If the LLM produces more than one JSON-like object in its response (e.g. a re-statement of the input command that happens to contain {}, followed by the actual verdict), the parser extracts the last object instead of the verdict. In the worst case:
response = `{"blast_radius":"high","reason":"..."} {"blast_radius":"low","reason":"safe"}`
// ^^^^^ first object — the real verdict ^^^^^ last object — parsed instead
json.Unmarshal(response[start:], &v) would succeed with blast_radius = "low", and parseJudgeVerdict would return a non-destructive ToolCallSafety{Destructive: false} — bypassing the safer-mode confirmation gate for a command the judge actually rated high.
The system prompt says "Output the JSON object and nothing else", but LLMs that produce chain-of-thought or preamble text (e.g. Haiku sometimes emits a brief sentence before the JSON) can violate this, and an adversarially crafted command name embedding a JSON snippet (e.g. a command named {"blast_radius":"low",...}; rm -rf /) could reliably trigger it.
Fix: use strings.Index (first {) instead of strings.LastIndex, or find the balanced JSON object starting from the first { (e.g. scan for the matching }).
| args = "{}" | ||
| } | ||
| if err := json.Unmarshal([]byte(args), ¶ms); err != nil { | ||
| return nil |
There was a problem hiding this comment.
[MEDIUM] Safer-mode validator returns nil (allow-through) when tool-call JSON arguments fail to parse
When json.Unmarshal cannot decode toolCall.Function.Arguments, ValidateShellToolCall returns nil. A nil return means "no safety concern" — the call proceeds through the safer-mode gate without human confirmation.
The fail-closed contract of safer mode requires that anything the validator cannot classify be gated, not allowed. An unparseable argument string is, by definition, unclassifiable. Returning nil here is the opposite of fail-closed.
Practical window: an LLM that sends structurally valid arguments will never trigger this, so the immediate risk is low. But the principle matters: any future code path (argument pre-processing, encoding quirks, partial writes) that produces malformed JSON would silently disable safer-mode gating for the shell tool.
Fix: return a conservative *tools.ToolCallSafety{Destructive: true, BlastRadius: tools.BlastRadiusUnknown, Reason: "safer: could not parse shell tool-call arguments"} instead of nil when unmarshal fails.
| // invocations; redo the wiring during agent reload rather than mid-run | ||
| // when avoidable. | ||
| func (t *ToolSet) SetJudge(j Judge) { | ||
| t.handler.judge = j |
There was a problem hiding this comment.
[MEDIUM] Data race on shellHandler.judge — unsynchronized write in SetJudge, unsynchronized read in ValidateShellToolCall
SetJudge writes t.handler.judge without holding any lock; ValidateShellToolCall reads h.judge on every tool-call validation without holding any lock either. The Go memory model requires explicit synchronization for concurrent field accesses where at least one is a write — this code has none.
The code comment acknowledges the race ("races with in-flight validator invocations") but treats it as advisory. In practice:
- A torn read of the two-word interface value
(type, data)can yield a non-nil type with a nil data pointer, causing a nil-pointer dereference insideh.judge.Refine(...). - The race detector will flag this as soon as any integration test wires a judge and then runs concurrent tool-call validation.
Fix: protect judge with an atomic.Value (stores the interface) or a sync/atomic.Pointer[Judge]; both give safe concurrent reads with rare writes at zero lock contention:
var judge atomic.Pointer[Judge]
func (t *ToolSet) SetJudge(j Judge) { t.handler.judge.Store(&j) }
// in ValidateShellToolCall:
if jp := h.judge.Load(); jp != nil && shouldConsultJudge(params.Cmd) {
// use (*jp)
}Address the three medium-severity review findings and the failing lint and build-image checks on the safer-shell stack. Review findings: - judge: parseJudgeVerdict now decodes the first JSON object (json.Decoder from the first brace) instead of strings.LastIndex, so a response that emits several JSON-like objects cannot append a lower "low" verdict to downgrade a genuine one and slip past the gate. - shell: ValidateShellToolCall fails closed (gates with BlastRadiusUnknown) when tool-call arguments cannot be decoded, instead of returning nil. - shell: the residual judge is held in an atomic.Pointer, removing the data race between SetJudge and concurrent ValidateShellToolCall reads. CI: - estimator, estimator_test, judge_test: clear the golangci-lint findings (dogsled, gocritic builtinShadow/ifElseChain/appendCombine, gofumpt, modernize stringscut, staticcheck SA9003/QF1001). - .dockerignore: allowlist safety_patterns.json so the go:embed in safer.go resolves inside the Docker build context. Add regression tests for the first-object verdict and the fail-closed unparseable-arguments path.
Depends on #3216 (safer mode + residual judge). This branch is stacked on that work, so until #3216 merges the diff also shows its three commits; only the final commit (the estimator) is new here. Review the estimator commit, or rebase once #3216 lands.
What
Safer mode gates destructive shell commands against a curated
safety_patterns.jsonset and falls back to a blanketunknownblast radius for anything not in it. This adds a deterministic estimator so the shell tool derives a real tier for those pattern misses, on its own.How it works
ValidateShellToolCallruns four layers, most precise first:unknownThe estimator derives the tier from program identity, recursion/force flags, target path scope (root, system dirs,
$HOME, inside vs outside the working directory, symlink-resolved), wildcard breadth, redirections, pipelines into interpreters,sh -cbodies, and command substitution. The filesystem probe stays inside the working directory, is bounded (5000 entries / depth 8), never follows symlinks, and never runs the command.Behavior change
lsunknownsafer_test.goandjudge_test.goare updated for the read-only-passes behavior.Safety posture
The read-only fast path is conservative and fail-closed. A command only skips the gate when it uses a known read-only program with no overwrite redirect, no command-executing flag (
rg --pre,git --upload-pack, ...), no code-injecting environment assignment (LD_PRELOAD=...), no pipe into an interpreter, and no command substitution. It is best-effort, not a sandbox; untrusted agents should still run under sandbox mode. The estimator was hardened against false-safes through several rounds of adversarial review, with the broad bypass classes closed by general rules; a narrow residual (an unaudited obscure command-executing flag on an allowlisted program) is documented and accepted.Tests
estimator_test.go: classification tables, filesystem-probe cases, lexer units, and a regression suite per review round.go test ./pkg/tools/builtin/shell/,go vet ./..., andgo build ./...pass.