feat(safer-shell): autonomous blast-radius estimation for the shell tool by Sayt-0 · Pull Request #3223 · docker/docker-agent

Sayt-0 · 2026-06-24T12:10:38Z

Depends on #3216 (safer mode + residual judge). This branch is stacked on that work, so until #3216 merges the diff also shows its three commits; only the final commit (the estimator) is new here. Review the estimator commit, or rebase once #3216 lands.

What

Safer mode gates destructive shell commands against a curated safety_patterns.json set and falls back to a blanket unknown blast radius for anything not in it. This adds a deterministic estimator so the shell tool derives a real tier for those pattern misses, on its own.

How it works

ValidateShellToolCall runs four layers, most precise first:

Layer	Role
Curated patterns	exact known destructive commands
Autonomous estimator (new)	read-only / destructive(tier) / uncertain, from command structure plus a bounded read-only working-directory probe
Residual LLM judge	refines genuinely ambiguous cases (unchanged trigger)
Fail-closed fallback	tier-enriched gate instead of a blanket `unknown`

The estimator derives the tier from program identity, recursion/force flags, target path scope (root, system dirs, $HOME, inside vs outside the working directory, symlink-resolved), wildcard breadth, redirections, pipelines into interpreters, sh -c bodies, and command substitution. The filesystem probe stays inside the working directory, is bounded (5000 entries / depth 8), never follows symlinks, and never runs the command.

Behavior change

Before	After
safer mode force-gated every shell command, even `ls`	confidently read-only commands are not force-gated and use the normal approval flow
pattern miss results in `unknown`	recognized destructive gets a real tier; unresolved stays a fail-closed gate

safer_test.go and judge_test.go are updated for the read-only-passes behavior.

Safety posture

The read-only fast path is conservative and fail-closed. A command only skips the gate when it uses a known read-only program with no overwrite redirect, no command-executing flag (rg --pre, git --upload-pack, ...), no code-injecting environment assignment (LD_PRELOAD=...), no pipe into an interpreter, and no command substitution. It is best-effort, not a sandbox; untrusted agents should still run under sandbox mode. The estimator was hardened against false-safes through several rounds of adversarial review, with the broad bypass classes closed by general rules; a narrow residual (an unaudited obscure command-executing flag on an allowlisted program) is documented and accepted.

Tests

estimator_test.go: classification tables, filesystem-probe cases, lexer units, and a regression suite per review round.
go test ./pkg/tools/builtin/shell/, go vet ./..., and go build ./... pass.

When safer: true is set on a shell toolset, every shell command is checked against an embedded safety-pattern taxonomy before the normal approval flow. Matched destructive commands surface a blast-radius level in the confirmation UI; unmatched commands still warn with an unknown blast radius. The forced confirmation cannot be bypassed by --yolo or permission allow rules. Assisted-By: Claude

When safer mode's regex pass in assessDestructiveShellCommand returns no match, an optional Judge is now consulted before falling back to BlastRadiusUnknown. The judge is gated behind a small lexical-signal trigger (wipe / destroy / drop / purge / nuke / obliterate / erase / clobber / reset / annihilate) so a clean stream of inspection commands never pays an LLM round-trip — only commands that look possibly-destructive but don't match the embedded pattern set escalate. Behaviour: - Pattern match -> deterministic safety wins, judge skipped. - Pattern miss, no signal -> BlastRadiusUnknown fall-through, no LLM call. - Pattern miss, signal -> 500 ms-cap LLM call; refined verdict wins. - Judge timeout/error/nil -> BlastRadiusUnknown fall-through (fail-closed). New files: - pkg/tools/builtin/shell/judge.go Judge interface, LexicalSignals, shouldConsultJudge gate, ProviderJudge default impl wrapping provider.Provider with a tight JSON-only system prompt and trailing-object verdict parser. - pkg/tools/builtin/shell/judge_test.go Gate semantics, the five validator branches, and parseJudgeVerdict response-shape coverage (low downgrades to non-destructive, high and medium escalate, unknown / blank / unparseable fall through). Modified: - pkg/tools/builtin/shell/shell.go Added judge field on shellHandler, SetJudge wiring on ToolSet, and the consultation between the regex miss and the existing BlastRadiusUnknown return in ValidateShellToolCall. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… field Plumbs the Judge interface added in the previous commit into the toolset config so users can opt in to the residual classifier from agent YAML. When safer_judge_model is set on a shell toolset with safer:true, the runtime constructs a Provider from the "provider/model" string and calls ToolSet.SetJudge(NewProviderJudge(p)) during toolset load. Unset keeps the existing behaviour: pattern misses fall through to BlastRadiusUnknown without an LLM round-trip. toolsets: - type: shell safer: true safer_judge_model: anthropic/claude-haiku-4-5 # opt-in residual judge Provider construction at wire-time is best-effort: a failure (unknown provider, missing credentials, etc.) is logged at WARN and the toolset proceeds without the judge — safer mode still runs, just without the refined verdict on pattern misses. Changes: - pkg/config/latest/types.go add SaferJudgeModel *string on Toolset - pkg/config/latest/validate.go reject when non-shell, no safer:true, or empty - pkg/config/toolset_validate_test.go five new cases - agent-schema.json sibling block for safer_judge_model - pkg/tools/builtin/shell/shell.go buildSaferJudge helper + CreateToolSet wiring - examples/shell_safer.yaml demonstrate the field End-to-end flow reachable from the example: ANTHROPIC_API_KEY=... docker-agent run examples/shell_safer.yaml > "Drop my local database, run \`bun run drop-db\`" → safer-mode regex misses → lexical-signal trips → ProviderJudge classifies → confirmation dialog shows refined blast radius with reason "Safer-mode LLM judge: ...". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a deterministic blast-radius estimator so safer mode derives a real tier for shell commands that miss the curated safety_patterns.json set, instead of a blanket BlastRadiusUnknown. The estimator is wired as a layer between the pattern pass and the residual judge in ValidateShellToolCall. It analyses command structure (a lexer for compound commands, redirections and substitution; recursion/force flags; target path scope; a read-only allowlist; a destructive-op table for rm/dd/mkfs/git/docker/...) and runs a bounded, read-only probe of the working directory to sharpen recursive deletes (a large tree scores higher than a missing target). The probe stays inside the working directory, is bounded, never follows symlinks, and never runs the command. Behavior change: confidently read-only commands (ls, cat, git status, docker ps, ...) are no longer force-gated and go through the normal approval flow; recognized destructive commands get a real tier; anything unresolved stays fail-closed (gated). The read-only verdict is refused for command-executing flags, code-injecting environment assignments, non-shell interpreters, wrapper smuggling, and command substitution. Update safer_test.go and judge_test.go for the new read-only behavior and document the layered estimation and safety posture in docs/tools/shell.

docker-agent

Assessment: 🟡 NEEDS ATTENTION

3 medium-severity issues found in the new safer-mode shell validator and residual LLM judge (all in new + lines). No false-safe bypass of the read-only fast path was found in the estimator itself.

docker-agent · 2026-06-24T12:18:23Z

+//
+// Returns a non-destructive verdict only on explicit "low".
+func parseJudgeVerdict(response string) *tools.ToolCallSafety {
+	start := strings.LastIndex(response, "{")


[MEDIUM] parseJudgeVerdict uses LastIndex — a multi-object LLM response can silently downgrade a destructive verdict to low

strings.LastIndex(response, "{") finds the last { in the string. If the LLM produces more than one JSON-like object in its response (e.g. a re-statement of the input command that happens to contain {}, followed by the actual verdict), the parser extracts the last object instead of the verdict. In the worst case:

response = `{"blast_radius":"high","reason":"..."} {"blast_radius":"low","reason":"safe"}` // ^^^^^ first object — the real verdict ^^^^^ last object — parsed instead

json.Unmarshal(response[start:], &v) would succeed with blast_radius = "low", and parseJudgeVerdict would return a non-destructive ToolCallSafety{Destructive: false} — bypassing the safer-mode confirmation gate for a command the judge actually rated high.

The system prompt says "Output the JSON object and nothing else", but LLMs that produce chain-of-thought or preamble text (e.g. Haiku sometimes emits a brief sentence before the JSON) can violate this, and an adversarially crafted command name embedding a JSON snippet (e.g. a command named {"blast_radius":"low",...}; rm -rf /) could reliably trigger it.

Fix: use strings.Index (first {) instead of strings.LastIndex, or find the balanced JSON object starting from the first { (e.g. scan for the matching }).

docker-agent · 2026-06-24T12:18:23Z

+		args = "{}"
+	}
+	if err := json.Unmarshal([]byte(args), &params); err != nil {
+		return nil


[MEDIUM] Safer-mode validator returns nil (allow-through) when tool-call JSON arguments fail to parse

When json.Unmarshal cannot decode toolCall.Function.Arguments, ValidateShellToolCall returns nil. A nil return means "no safety concern" — the call proceeds through the safer-mode gate without human confirmation.

The fail-closed contract of safer mode requires that anything the validator cannot classify be gated, not allowed. An unparseable argument string is, by definition, unclassifiable. Returning nil here is the opposite of fail-closed.

Practical window: an LLM that sends structurally valid arguments will never trigger this, so the immediate risk is low. But the principle matters: any future code path (argument pre-processing, encoding quirks, partial writes) that produces malformed JSON would silently disable safer-mode gating for the shell tool.

Fix: return a conservative *tools.ToolCallSafety{Destructive: true, BlastRadius: tools.BlastRadiusUnknown, Reason: "safer: could not parse shell tool-call arguments"} instead of nil when unmarshal fails.

docker-agent · 2026-06-24T12:18:23Z

+// invocations; redo the wiring during agent reload rather than mid-run
+// when avoidable.
+func (t *ToolSet) SetJudge(j Judge) {
+	t.handler.judge = j


[MEDIUM] Data race on shellHandler.judge — unsynchronized write in SetJudge, unsynchronized read in ValidateShellToolCall

SetJudge writes t.handler.judge without holding any lock; ValidateShellToolCall reads h.judge on every tool-call validation without holding any lock either. The Go memory model requires explicit synchronization for concurrent field accesses where at least one is a write — this code has none.

The code comment acknowledges the race ("races with in-flight validator invocations") but treats it as advisory. In practice:

A torn read of the two-word interface value (type, data) can yield a non-nil type with a nil data pointer, causing a nil-pointer dereference inside h.judge.Refine(...).

The race detector will flag this as soon as any integration test wires a judge and then runs concurrent tool-call validation.

Fix: protect judge with an atomic.Value (stores the interface) or a sync/atomic.Pointer[Judge]; both give safe concurrent reads with rare writes at zero lock contention:

var judge atomic.Pointer[Judge] func (t *ToolSet) SetJudge(j Judge) { t.handler.judge.Store(&j) } // in ValidateShellToolCall: if jp := h.judge.Load(); jp != nil && shouldConsultJudge(params.Cmd) { // use (*jp) }

Address the three medium-severity review findings and the failing lint and build-image checks on the safer-shell stack. Review findings: - judge: parseJudgeVerdict now decodes the first JSON object (json.Decoder from the first brace) instead of strings.LastIndex, so a response that emits several JSON-like objects cannot append a lower "low" verdict to downgrade a genuine one and slip past the gate. - shell: ValidateShellToolCall fails closed (gates with BlastRadiusUnknown) when tool-call arguments cannot be decoded, instead of returning nil. - shell: the residual judge is held in an atomic.Pointer, removing the data race between SetJudge and concurrent ValidateShellToolCall reads. CI: - estimator, estimator_test, judge_test: clear the golangci-lint findings (dogsled, gocritic builtinShadow/ifElseChain/appendCombine, gofumpt, modernize stringscut, staticcheck SA9003/QF1001). - .dockerignore: allowlist safety_patterns.json so the go:embed in safer.go resolves inside the Docker build context. Add regression tests for the first-object verdict and the fail-closed unparseable-arguments path.

dgageot and others added 4 commits June 24, 2026 09:23

Sayt-0 requested a review from a team as a code owner June 24, 2026 12:10

docker-agent reviewed Jun 24, 2026

View reviewed changes

aheritier added area/tools For features/issues/fixes related to the usage of built-in and MCP tools kind/feat PR adds a new feature (maps to feat: commit prefix) labels Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(safer-shell): autonomous blast-radius estimation for the shell tool#3223

feat(safer-shell): autonomous blast-radius estimation for the shell tool#3223
Sayt-0 wants to merge 5 commits into
mainfrom
feat/shell-blast-radius-estimator

Sayt-0 commented Jun 24, 2026

Uh oh!

docker-agent left a comment

Uh oh!

docker-agent Jun 24, 2026

Uh oh!

docker-agent Jun 24, 2026

Uh oh!

docker-agent Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

Sayt-0 commented Jun 24, 2026

What

How it works

Behavior change

Safety posture

Tests

Uh oh!

docker-agent left a comment

Choose a reason for hiding this comment

Assessment: 🟡 NEEDS ATTENTION

Uh oh!

docker-agent Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

docker-agent Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

docker-agent Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants