Skip to content

feat(aiagents): poll backend for hook enable/disable state#74

Merged
ashishkurmi merged 2 commits into
step-security:mainfrom
swarit-stepsecurity:swarit/feat/wt/ai-hooks-integration
May 16, 2026
Merged

feat(aiagents): poll backend for hook enable/disable state#74
ashishkurmi merged 2 commits into
step-security:mainfrom
swarit-stepsecurity:swarit/feat/wt/ai-hooks-integration

Conversation

@swarit-stepsecurity
Copy link
Copy Markdown
Member

@swarit-stepsecurity swarit-stepsecurity commented May 14, 2026

Summary

Adds server-driven enable/disable of AI agent hooks. A toggle in the
Dev Machine Guard console writes desired state to agent-api; this
branch makes DMG observe and act on that state during its existing
scheduled telemetry tick — no daemon, no websocket, no new
configuration. Two transports converge through a single on-disk
cache: today only the scheduled poll writes it, but the design
leaves room for a WS client to write the same file later without
touching the hot path.

What this PR adds

A new internal/aiagents/state package and the wire-up to call it.

internal/aiagents/state/
  doc.go         package overview
  state.go       State / Hooks types, SchemaVersion, Source* constants, Default()
  cache.go       ~/.stepsecurity/hooks-state.json read/write (atomic, mode 0600)
  fetcher.go     HTTPFetcher → GET /developer-mdm-agent/features
  reconciler.go  Reconciler.Reconcile = fetch → cache → idempotent Install/Uninstall
  + unit tests for each

Plus:

  • Hot-path cache check in internal/aiagents/hook/runtime.go
    11 new lines that read hooks-state.json and short-circuit Run
    to an allow response when enabled=false. Missing or unparseable
    cache reads as Default() (enabled) so first-run after install
    keeps working. New test TestRunHonorsDisabledStateCache covers
    the disabled path and asserts no upload / no enrichment / no
    error-log entries.

  • runHookStateReconcile in cmd/.../main.go — invoked after
    telemetry.Run in both send-telemetry and install paths.
    Silent no-op in community mode (ingest.Snapshot returns false).
    Failures are logged via cli.AppendError but never crash main.

Architecture

  UI toggle ──PUT──▶  agent-api  ──persist──▶  DDB
                          │
                          │ GET (Bearer TenantAPIKey, device_id)
                          │
                       DMG (scheduled telemetry tick)
                          │
                          ├─ writes  ~/.stepsecurity/hooks-state.json
                          └─ calls   RunInstall / RunUninstall (idempotent)

  Each `_hook` invocation:
       reads  ~/.stepsecurity/hooks-state.json
       short-circuits to allow if disabled

The cache file is the single source of truth for the hot path. Both
the polling reconciler (this PR) and any future WebSocket transport
are expected to converge on the same file, so the hot path never
has to know which transport is active.

Design decisions

Question Choice Why
Cache missing Hot path runs enabled Settings entry exists ⇒ someone installed it deliberately. Default-deny would silently break first-run before reconciler ticks.
API unreachable during reconcile Keep prior cache, no settings change Don't flap. Last-known-good wins.
Cache says enabled, settings missing Reconciler reinstalls Admin policy is authoritative.
Cache says disabled, settings present Reconciler uninstalls AND hot path short-circuits Reconciler is the slow truth; hot-path check gives instant convergence in the gap between API toggle and next tick.
User runs hooks uninstall while backend says enabled Reconciler reinstalls on next tick Admin wins, intentional.
Granularity Global (enabled: bool) for v1 Schema reserves room for per-agent (hooks.per_agent: {claude-code, codex}).
Newly-registered device Backend default enabled: false UI must explicitly opt in.

Backend dependency

The endpoint we call already exists in agent-api int:

GET /v1/:customer/developer-mdm-agent/features?device_id=<id>
Auth: Bearer <TenantAPIKey>
200:  {"features": {"ai_agents_hooks_install": {"enabled": bool}, ...}}

Feature key constant in both sides: ai_agents_hooks_install. UI
PUT path (PUT /v1/:customer/developer-mdm/features) is also live;
console PR adds the toggle UI on top of it.

Failure modes covered by tests

Failure Behavior Test
Cache file missing hot path runs as enabled cache_test.go TestReadMissingFileReturnsDefault
Cache file corrupt hot path runs as enabled cache_test.go TestReadMalformedReturnsDefault
Cache says disabled hot path short-circuits — no upload, no enrich, no error log runtime_test.go TestRunHonorsDisabledStateCache
API 5xx / 401 / timeout reconciler error, cache untouched fetcher_test.go
Reconcile fetch error cache untouched, no install/uninstall, error propagated reconciler_test.go TestReconcileFetchErrorPreservesCache
Install/uninstall non-zero exit cache still reflects desired state, next tick retries reconciler_test.go TestReconcileInstallFailureSurfacesError
Missing feature key in API response treated as disabled (matches server baseline) fetcher_test.go TestFetcherMissingKeyMeansDisabled

Manual verification

End-to-end tested on a Fedora 42 EC2 VM against the int environment:

  • stepsecurity-dev-machine-guard install triggered initial telemetry + reconcile.
  • Systemd timer stepsecurity-dev-machine-guard.timer fired on the configured interval and reconciled.
  • Flipping the UI toggle off → next tick wrote enabled: false, reconciler called RunUninstall, hook entries removed from ~/.claude/settings.json and ~/.codex/hooks.json with .dmg-<stamp>.bak siblings.
  • Flipping the UI toggle on → next tick wrote enabled: true, reconciler called RunInstall, entries reappeared.
  • Synthetic disabled-cache invocation of _hook confirmed the hot-path short-circuit emits the allow response without uploading.
  • No new entries in ~/.stepsecurity/ai-agent-hook-errors.jsonl across the test session.

Files

12 files changed, 884 insertions(+)
new   internal/aiagents/state/{doc,state,cache,fetcher,reconciler}.go
new   internal/aiagents/state/{state,cache,fetcher,reconciler}_test.go
edit  internal/aiagents/hook/runtime.go         (+11 — cache check + short-circuit)
edit  internal/aiagents/hook/runtime_test.go    (+44 — TestRunHonorsDisabledStateCache + helper)
edit  cmd/stepsecurity-dev-machine-guard/main.go (+52 — runHookStateReconcile + wire-up)

Out of scope (explicit)

  • WebSocket transport. Architecture supports it (cache file is the seam) but no client ships in this PR.
  • Per-agent toggles. Schema reserves room; v1 ships global on/off.
  • Telemetry-response piggyback. Saves a round-trip but couples two response shapes. Revisit if telemetry latency becomes a complaint.
  • hooks status / hooks reconcile CLI verbs. Useful for debugging; follow-up.
  • last_reconciled_at per-device on telemetry. Would let the UI render real "synced 8 min ago" badges per device instead of a generic disclaimer. Worth doing; not in this PR.

Type of change

  • Enhancement

Testing

  • Tested on Linux (Fedora 42)
  • go test ./... green across all 24 packages
  • go vet ./... clean
  • gofmt -l . clean
  • Manual end-to-end against int (toggle UI → DMG converges)
  • No secrets or credentials included

@swarit-stepsecurity swarit-stepsecurity force-pushed the swarit/feat/wt/ai-hooks-integration branch from e4bee2d to 8013f56 Compare May 14, 2026 06:32
Comment thread internal/aiagents/state/cache.go Fixed
@swarit-stepsecurity swarit-stepsecurity force-pushed the swarit/feat/wt/ai-hooks-integration branch from 16fb9fa to a9e3bb1 Compare May 16, 2026 11:41
Adds a new internal/aiagents/state package and wires it into the
scheduled telemetry tick so a UI toggle on the agent-api side
converges to local install/uninstall on the next run.

- state package owns the cache file ~/.stepsecurity/hooks-state.json,
  the HTTP fetcher against /developer-mdm-agent/features, and the
  Reconciler that ties fetch → cache write → idempotent install or
  uninstall together.
- _hook hot path reads the cache before any work and short-circuits
  to the allow response when disabled. Missing or unparseable cache
  reads as enabled, so first-run after install keeps working.
- main.go runs the reconciler after telemetry.Run in send-telemetry
  and install paths; community mode (no enterprise config) is a
  silent no-op.

No agent-api changes needed: the existing feature key
ai_agents_hooks_install and the GET /developer-mdm-agent/features
endpoint already serve the resolved state.
@swarit-stepsecurity swarit-stepsecurity force-pushed the swarit/feat/wt/ai-hooks-integration branch from a9e3bb1 to 1ed02f2 Compare May 16, 2026 11:52
@ashishkurmi ashishkurmi merged commit 6b34f69 into step-security:main May 16, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants