Skip to content

feat(safer-shell): autonomous blast-radius estimation for the shell tool#3223

Open
Sayt-0 wants to merge 5 commits into
mainfrom
feat/shell-blast-radius-estimator
Open

feat(safer-shell): autonomous blast-radius estimation for the shell tool#3223
Sayt-0 wants to merge 5 commits into
mainfrom
feat/shell-blast-radius-estimator

Conversation

@Sayt-0

@Sayt-0 Sayt-0 commented Jun 24, 2026

Copy link
Copy Markdown
Member

Depends on #3216 (safer mode + residual judge). This branch is stacked on that work, so until #3216 merges the diff also shows its three commits; only the final commit (the estimator) is new here. Review the estimator commit, or rebase once #3216 lands.

What

Safer mode gates destructive shell commands against a curated safety_patterns.json set and falls back to a blanket unknown blast radius for anything not in it. This adds a deterministic estimator so the shell tool derives a real tier for those pattern misses, on its own.

How it works

ValidateShellToolCall runs four layers, most precise first:

Layer Role
Curated patterns exact known destructive commands
Autonomous estimator (new) read-only / destructive(tier) / uncertain, from command structure plus a bounded read-only working-directory probe
Residual LLM judge refines genuinely ambiguous cases (unchanged trigger)
Fail-closed fallback tier-enriched gate instead of a blanket unknown

The estimator derives the tier from program identity, recursion/force flags, target path scope (root, system dirs, $HOME, inside vs outside the working directory, symlink-resolved), wildcard breadth, redirections, pipelines into interpreters, sh -c bodies, and command substitution. The filesystem probe stays inside the working directory, is bounded (5000 entries / depth 8), never follows symlinks, and never runs the command.

Behavior change

Before After
safer mode force-gated every shell command, even ls confidently read-only commands are not force-gated and use the normal approval flow
pattern miss results in unknown recognized destructive gets a real tier; unresolved stays a fail-closed gate

safer_test.go and judge_test.go are updated for the read-only-passes behavior.

Safety posture

The read-only fast path is conservative and fail-closed. A command only skips the gate when it uses a known read-only program with no overwrite redirect, no command-executing flag (rg --pre, git --upload-pack, ...), no code-injecting environment assignment (LD_PRELOAD=...), no pipe into an interpreter, and no command substitution. It is best-effort, not a sandbox; untrusted agents should still run under sandbox mode. The estimator was hardened against false-safes through several rounds of adversarial review, with the broad bypass classes closed by general rules; a narrow residual (an unaudited obscure command-executing flag on an allowlisted program) is documented and accepted.

Tests

  • estimator_test.go: classification tables, filesystem-probe cases, lexer units, and a regression suite per review round.
  • go test ./pkg/tools/builtin/shell/, go vet ./..., and go build ./... pass.

dgageot and others added 4 commits June 24, 2026 09:23
When safer: true is set on a shell toolset, every shell command is checked
against an embedded safety-pattern taxonomy before the normal approval flow.
Matched destructive commands surface a blast-radius level in the confirmation
UI; unmatched commands still warn with an unknown blast radius. The forced
confirmation cannot be bypassed by --yolo or permission allow rules.

Assisted-By: Claude
When safer mode's regex pass in assessDestructiveShellCommand returns
no match, an optional Judge is now consulted before falling back to
BlastRadiusUnknown. The judge is gated behind a small lexical-signal
trigger (wipe / destroy / drop / purge / nuke / obliterate / erase /
clobber / reset / annihilate) so a clean stream of inspection
commands never pays an LLM round-trip — only commands that look
possibly-destructive but don't match the embedded pattern set
escalate.

Behaviour:

  - Pattern match           -> deterministic safety wins, judge skipped.
  - Pattern miss, no signal -> BlastRadiusUnknown fall-through, no LLM call.
  - Pattern miss, signal    -> 500 ms-cap LLM call; refined verdict wins.
  - Judge timeout/error/nil -> BlastRadiusUnknown fall-through (fail-closed).

New files:

  - pkg/tools/builtin/shell/judge.go
      Judge interface, LexicalSignals, shouldConsultJudge gate,
      ProviderJudge default impl wrapping provider.Provider with a
      tight JSON-only system prompt and trailing-object verdict parser.

  - pkg/tools/builtin/shell/judge_test.go
      Gate semantics, the five validator branches, and parseJudgeVerdict
      response-shape coverage (low downgrades to non-destructive, high
      and medium escalate, unknown / blank / unparseable fall through).

Modified:

  - pkg/tools/builtin/shell/shell.go
      Added judge field on shellHandler, SetJudge wiring on ToolSet,
      and the consultation between the regex miss and the existing
      BlastRadiusUnknown return in ValidateShellToolCall.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… field

Plumbs the Judge interface added in the previous commit into the toolset
config so users can opt in to the residual classifier from agent YAML.
When safer_judge_model is set on a shell toolset with safer:true, the
runtime constructs a Provider from the "provider/model" string and
calls ToolSet.SetJudge(NewProviderJudge(p)) during toolset load. Unset
keeps the existing behaviour: pattern misses fall through to
BlastRadiusUnknown without an LLM round-trip.

  toolsets:
    - type: shell
      safer: true
      safer_judge_model: anthropic/claude-haiku-4-5   # opt-in residual judge

Provider construction at wire-time is best-effort: a failure (unknown
provider, missing credentials, etc.) is logged at WARN and the toolset
proceeds without the judge — safer mode still runs, just without the
refined verdict on pattern misses.

Changes:
  - pkg/config/latest/types.go      add SaferJudgeModel *string on Toolset
  - pkg/config/latest/validate.go   reject when non-shell, no safer:true, or empty
  - pkg/config/toolset_validate_test.go five new cases
  - agent-schema.json               sibling block for safer_judge_model
  - pkg/tools/builtin/shell/shell.go buildSaferJudge helper + CreateToolSet wiring
  - examples/shell_safer.yaml       demonstrate the field

End-to-end flow reachable from the example:

  ANTHROPIC_API_KEY=... docker-agent run examples/shell_safer.yaml
  > "Drop my local database, run \`bun run drop-db\`"
  → safer-mode regex misses → lexical-signal trips → ProviderJudge
    classifies → confirmation dialog shows refined blast radius with
    reason "Safer-mode LLM judge: ...".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a deterministic blast-radius estimator so safer mode derives a real
tier for shell commands that miss the curated safety_patterns.json set,
instead of a blanket BlastRadiusUnknown. The estimator is wired as a layer
between the pattern pass and the residual judge in ValidateShellToolCall.

It analyses command structure (a lexer for compound commands, redirections
and substitution; recursion/force flags; target path scope; a read-only
allowlist; a destructive-op table for rm/dd/mkfs/git/docker/...) and runs a
bounded, read-only probe of the working directory to sharpen recursive
deletes (a large tree scores higher than a missing target). The probe stays
inside the working directory, is bounded, never follows symlinks, and never
runs the command.

Behavior change: confidently read-only commands (ls, cat, git status,
docker ps, ...) are no longer force-gated and go through the normal approval
flow; recognized destructive commands get a real tier; anything unresolved
stays fail-closed (gated). The read-only verdict is refused for
command-executing flags, code-injecting environment assignments, non-shell
interpreters, wrapper smuggling, and command substitution.

Update safer_test.go and judge_test.go for the new read-only behavior and
document the layered estimation and safety posture in docs/tools/shell.
@Sayt-0 Sayt-0 requested a review from a team as a code owner June 24, 2026 12:10

@docker-agent docker-agent left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

3 medium-severity issues found in the new safer-mode shell validator and residual LLM judge (all in new + lines). No false-safe bypass of the read-only fast path was found in the estimator itself.

Comment thread pkg/tools/builtin/shell/judge.go Outdated
//
// Returns a non-destructive verdict only on explicit "low".
func parseJudgeVerdict(response string) *tools.ToolCallSafety {
start := strings.LastIndex(response, "{")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] parseJudgeVerdict uses LastIndex — a multi-object LLM response can silently downgrade a destructive verdict to low

strings.LastIndex(response, "{") finds the last { in the string. If the LLM produces more than one JSON-like object in its response (e.g. a re-statement of the input command that happens to contain {}, followed by the actual verdict), the parser extracts the last object instead of the verdict. In the worst case:

response = `{"blast_radius":"high","reason":"..."} {"blast_radius":"low","reason":"safe"}`
//           ^^^^^ first object — the real verdict     ^^^^^ last object — parsed instead

json.Unmarshal(response[start:], &v) would succeed with blast_radius = "low", and parseJudgeVerdict would return a non-destructive ToolCallSafety{Destructive: false} — bypassing the safer-mode confirmation gate for a command the judge actually rated high.

The system prompt says "Output the JSON object and nothing else", but LLMs that produce chain-of-thought or preamble text (e.g. Haiku sometimes emits a brief sentence before the JSON) can violate this, and an adversarially crafted command name embedding a JSON snippet (e.g. a command named {"blast_radius":"low",...}; rm -rf /) could reliably trigger it.

Fix: use strings.Index (first {) instead of strings.LastIndex, or find the balanced JSON object starting from the first { (e.g. scan for the matching }).

Comment thread pkg/tools/builtin/shell/shell.go Outdated
args = "{}"
}
if err := json.Unmarshal([]byte(args), &params); err != nil {
return nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Safer-mode validator returns nil (allow-through) when tool-call JSON arguments fail to parse

When json.Unmarshal cannot decode toolCall.Function.Arguments, ValidateShellToolCall returns nil. A nil return means "no safety concern" — the call proceeds through the safer-mode gate without human confirmation.

The fail-closed contract of safer mode requires that anything the validator cannot classify be gated, not allowed. An unparseable argument string is, by definition, unclassifiable. Returning nil here is the opposite of fail-closed.

Practical window: an LLM that sends structurally valid arguments will never trigger this, so the immediate risk is low. But the principle matters: any future code path (argument pre-processing, encoding quirks, partial writes) that produces malformed JSON would silently disable safer-mode gating for the shell tool.

Fix: return a conservative *tools.ToolCallSafety{Destructive: true, BlastRadius: tools.BlastRadiusUnknown, Reason: "safer: could not parse shell tool-call arguments"} instead of nil when unmarshal fails.

Comment thread pkg/tools/builtin/shell/shell.go Outdated
// invocations; redo the wiring during agent reload rather than mid-run
// when avoidable.
func (t *ToolSet) SetJudge(j Judge) {
t.handler.judge = j

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Data race on shellHandler.judge — unsynchronized write in SetJudge, unsynchronized read in ValidateShellToolCall

SetJudge writes t.handler.judge without holding any lock; ValidateShellToolCall reads h.judge on every tool-call validation without holding any lock either. The Go memory model requires explicit synchronization for concurrent field accesses where at least one is a write — this code has none.

The code comment acknowledges the race ("races with in-flight validator invocations") but treats it as advisory. In practice:

  • A torn read of the two-word interface value (type, data) can yield a non-nil type with a nil data pointer, causing a nil-pointer dereference inside h.judge.Refine(...).
  • The race detector will flag this as soon as any integration test wires a judge and then runs concurrent tool-call validation.

Fix: protect judge with an atomic.Value (stores the interface) or a sync/atomic.Pointer[Judge]; both give safe concurrent reads with rare writes at zero lock contention:

var judge atomic.Pointer[Judge]

func (t *ToolSet) SetJudge(j Judge) { t.handler.judge.Store(&j) }

// in ValidateShellToolCall:
if jp := h.judge.Load(); jp != nil && shouldConsultJudge(params.Cmd) {
    // use (*jp)
}

@aheritier aheritier added area/tools For features/issues/fixes related to the usage of built-in and MCP tools kind/feat PR adds a new feature (maps to feat: commit prefix) labels Jun 24, 2026
Address the three medium-severity review findings and the failing lint
and build-image checks on the safer-shell stack.

Review findings:
- judge: parseJudgeVerdict now decodes the first JSON object (json.Decoder
  from the first brace) instead of strings.LastIndex, so a response that
  emits several JSON-like objects cannot append a lower "low" verdict to
  downgrade a genuine one and slip past the gate.
- shell: ValidateShellToolCall fails closed (gates with BlastRadiusUnknown)
  when tool-call arguments cannot be decoded, instead of returning nil.
- shell: the residual judge is held in an atomic.Pointer, removing the data
  race between SetJudge and concurrent ValidateShellToolCall reads.

CI:
- estimator, estimator_test, judge_test: clear the golangci-lint findings
  (dogsled, gocritic builtinShadow/ifElseChain/appendCombine, gofumpt,
  modernize stringscut, staticcheck SA9003/QF1001).
- .dockerignore: allowlist safety_patterns.json so the go:embed in safer.go
  resolves inside the Docker build context.

Add regression tests for the first-object verdict and the fail-closed
unparseable-arguments path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tools For features/issues/fixes related to the usage of built-in and MCP tools kind/feat PR adds a new feature (maps to feat: commit prefix)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants