coder · DevelopmentCats · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026 · Jun 22, 2026
@@ -8,8 +8,10 @@ Every 6 hours, the scheduled workflow in this repo:
 1. Enumerates every skill in `coder/registry` (both the in-tree
    `.agents/skills/` format and the future external-sources format).
 2. Shallow-clones each source repo.
-3. Runs [NVIDIA SkillSpector](https://github.com/NVIDIA/SkillSpector) in
-   `--no-llm` static mode over the upstream content.
+3. Runs [NVIDIA SkillSpector](https://github.com/NVIDIA/SkillSpector) over
+   the upstream content. The scheduled scan runs SkillSpector's LLM
+   semantic pass when the workflow's LLM credential secret is
+   configured, and falls back to `--no-llm` static-only mode otherwise.
 4. Builds a per-skill verdict (`clean`, `suspicious`, `malicious`,
    `unknown`) from `risk_score` plus the thresholds in `config.yaml`.
 5. Builds the React SPA in `site/` and ships it together with
@@ -60,6 +62,28 @@ Vite's dev proxy (see `site/vite.config.ts`) forwards `latest.json`,
 app sees real scanner output without CORS shenanigans. SPA routes such
 as `/skills/coder/setup` stay client-side.
 
+## One-time setup on the repo
+
+Three things have to be configured once on the GitHub repo before the
+scheduled scan publishes a useful result:
+
+1. **Settings > Pages**: set source to "GitHub Actions". The
+   `publish-pages` job in `scan.yaml` will fail until this is set.
+2. **Settings > Actions**: workflow permissions "Read and write" so
+   `publish-release` can create the rolling `latest` release.
+3. **Settings > Secrets and variables > Actions > Secrets**: add the
+   LLM credential matching the provider in `config.yaml`'s
+   `scanners.skillspector.llm.provider`. For the default `anthropic`
+   provider this is `ANTHROPIC_API_KEY` (from
+   [console.anthropic.com](https://console.anthropic.com); this is a
+   separate billing line from Coder usage because SkillSpector cannot
+   be routed through aibridge today, see `docs/CALIBRATION.md`).
+   Without the secret, the scan still runs but SkillSpector falls
+   back to `--no-llm` static-only mode and precision drops. See
+   `docs/CALIBRATION.md` for the measured before/after numbers. The
+   optional `SLACK_WEBHOOK_URL` secret enables the
+   `notify-slack-on-failure` job; without it that job is a no-op.
+
 ## Repo layout
 
 ```text
@@ -97,7 +121,12 @@ This scanner is data-driven. To run it against a different registry:
    "GitHub Actions").
 4. Set Actions workflow permissions to "Read and write" so the
    publish-release job can create releases.
-5. Enable Actions.
+5. To enable the LLM semantic pass, add the credential secret per
+   "One-time setup on the repo" above, AND confirm
+   `.github/workflows/scan.yaml` exports the secret into the
+   SkillSpector step. Static-only mode (without the secret) is the
+   default and works out of the box.
+6. Enable Actions.
 
 No source changes required for catalogue changes.
 
@@ -115,7 +144,8 @@ SkillSpector's `risk_score` (0-100) is the only input. The thresholds
 are aligned to SkillSpector's own `HIGH` and `CRITICAL` bands;
 [`docs/CALIBRATION.md`](./docs/CALIBRATION.md) walks through the
 evidence (SkillSpector source, the ClawHub paper, our in-tree
-catalogue) behind the chosen numbers.
+catalogue) behind the chosen numbers and the measured LLM-on-vs-off
+impact on the five in-tree skills.
 
 The architecture keeps room for additional scanners (gitleaks, Semgrep,
 VirusTotal Premium, etc.); adding one is a new module under `scanner/`,

@@ -39,8 +39,12 @@ scanners:
     # so a bumper bot lives outside the loop until the upstream
     # publishes to PyPI and the pin can move into pyproject.toml.
     pin: "skillspector @ git+https://github.com/NVIDIA/SkillSpector.git@2eb844780ab163f01468ecf142c40a2ec0fcaec0"
-    flags:
-      - "--no-llm"
+    # Empty so .github/workflows/scan.yaml can append --no-llm
+    # dynamically based on whether the LLM credential secret is set.
+    flags: []
+    llm:
+      provider: anthropic
+      model: "claude-sonnet-4-6"
 
 # Per-skill verdict policy. v1 has one input (SkillSpector risk_score).
 # When more scanners join the pipeline we add new threshold fields here

@@ -64,18 +64,18 @@ The current `coder/registry` in-tree catalogue contains five skills:
 `coder/coder-modules`, `coder/coder-templates`, `coder/modules`,
 `coder/templates`, and `coder/setup`. Under the chosen thresholds:
 
-| Skill                  | SkillSpector score | Verdict     |
-|------------------------|-------------------:|-------------|
-| `coder/coder-modules`  | 0                  | `clean`     |
-| `coder/coder-templates`| 0                  | `clean`     |
-| `coder/modules`        | 0                  | `clean`     |
-| `coder/templates`      | 10                 | `clean`     |
-| `coder/setup`          | 100                | `malicious` |
+| Skill                  | static score | LLM-mode score | static verdict | LLM-mode verdict |
+|------------------------|-------------:|---------------:|----------------|------------------|
+| `coder/coder-modules`  | 10           | 0              | `clean`        | `clean`          |
+| `coder/coder-templates`| 10           | 0              | `clean`        | `clean`          |
+| `coder/modules`        | 0            | 0              | `clean`        | `clean`          |
+| `coder/templates`      | 0            | 0              | `clean`        | `clean`          |
+| `coder/setup`          | 100          | 26             | `malicious`    | `clean`          |
 
 The previous thresholds (40/75) produced the same outcome for these
-five inputs. The change does not silence any signal that was firing
-today; it raises the bar that future skills must clear before being
-called out.
+five inputs under static-only mode. The change does not silence any
+signal that was firing today; it raises the bar that future skills
+must clear before being called out.
 
 ## Threshold choices
 
@@ -99,6 +99,99 @@ verdict:
   This avoids broadcasting the ~half-of-catalogue base rate that
   ClawHub measured.
 
+## LLM semantic pass
+
+SkillSpector ships a two-stage analyser: fast static rules (the 64
+patterns SkillSpector documents) followed by an optional LLM semantic
+pass. The LLM pass reads each finding's surrounding context, classifies
+intent, filters context-aware false positives, and writes a
+human-readable explanation that ships in the per-finding output.
+
+### Measured impact on the five in-tree skills
+
+Measured against `gpt-4.1-mini` through Coder's AI Gateway during
+development, before the provider swap below. Methodology: ran
+`skillspector scan` twice on each upstream skill (once with
+`--no-llm`, once with LLM mode on) and aggregated the per-skill
+results. Total catalogue-wide findings dropped from 25 to 2:
+
+| Skill                  | findings (static) | findings (LLM) | Δ        |
+|------------------------|------------------:|---------------:|----------|
+| `coder/coder-modules`  | 1                 | 0              | -1       |
+| `coder/coder-templates`| 1                 | 0              | -1       |
+| `coder/modules`        | 0                 | 0              | 0        |
+| `coder/setup`          | 23                | 2              | -21      |
+| `coder/templates`      | 0                 | 0              | 0        |
+| **TOTAL**              | **25**            | **2**          | **-23**  |
+
+`coder/setup`'s verdict moves from `malicious` (100) to `clean` (26).
+The LLM filtered all 23 static-only findings as context-aware false
+positives (the EA2 hits on safeguard prose, the MP2 hits on PNG
+assets, the SC2 hits on `curl coder.com/install.sh`, the PE3 hits on
+the skill's own scratch files, etc.) and surfaced 2 new MEDIUM
+findings (`SQP-2`) the static pass missed: the GitHub device-flow
+scripts write the OAuth token and session config to disk without a
+user-visible notification. Those 2 findings are real and minor; the
+cleanest fix is a one-line `echo` before each write in the upstream
+skill repo rather than any change here.
+
+**Model swap caveat**: production runs against `claude-sonnet-4-6`
+via the Anthropic API (see "Provider choice" below), not against
+`gpt-4.1-mini`. The 25 → 2 delta above measures SkillSpector's LLM
+semantic pass *as a capability*; absolute counts may shift one or two
+either way under Claude because the two models filter false positives
+slightly differently. The verdict-band outcomes (`coder/setup` flips
+malicious → clean, every other in-tree skill stays clean) are robust
+to that drift: every static finding on the four other skills is well
+below the `suspicious_risk_score: 51` cutoff to begin with, so even a
+100% no-filter LLM still leaves them clean. Recalibration against
+Claude is a 30-minute follow-up PR once the secret is wired in and
+the first production scan lands; this doc gets the real numbers then.
+
+### Provider choice and the workflow gap
+
+The scheduled scan runs LLM mode when the workflow's chosen credential
+secret is configured. The fallback to `--no-llm` is automatic when the
+secret is missing, so an unset secret on a fresh fork degrades the
+scan rather than breaking it.
+
+Provider is `anthropic` against `api.anthropic.com` directly, model
+`claude-sonnet-4-6`. The Anthropic API key is on a separate billing
+line from Coder usage because SkillSpector cannot be routed through
+Coder's AI Gateway today:
+
+- aibridge does proxy Claude under its `/anthropic` path, but only in
+  Anthropic's native `/v1/messages` shape.
+- SkillSpector pipes every provider through
+  `langchain_openai.ChatOpenAI`, which speaks OpenAI's
+  `/v1/chat/completions` shape.
+- aibridge does not mount `/v1/chat/completions` on its `/anthropic`
+  path (verified: `route not supported`).
+- SkillSpector's `anthropic` provider also hardcodes
+  `https://api.anthropic.com/v1/` in `providers/anthropic/provider.py`
+  and ignores `ANTHROPIC_BASE_URL`, so even if aibridge did expose the
+  OpenAI-compat route on its Anthropic path, an env-only swap would
+  not steer SkillSpector at it.
+
+Using `openai` against aibridge with `gpt-4.1-mini` is a viable
+alternative (and is what the calibration table above was measured
+against). The trade-off is real: aibridge routing keeps inference
+spend on Coder's existing billing line and avoids a second vendor,
+but commits the scanner to whichever OpenAI-class model aibridge
+exposes rather than Claude. If aibridge later adds either a Claude
+OpenAI-compat route on `/anthropic` or a native-Anthropic
+integration into SkillSpector, the provider line in `config.yaml`
+flips back without any workflow change.
+
+### How the LLM pass interacts with the verdict math
+
+The LLM pass does not affect the threshold math. SkillSpector's
+`risk_score` is still a 0-100 weighted sum of rule hits, and the
+51/81 cutoffs above still map directly to `HIGH` and `CRITICAL` bands.
+What changes is which findings reach the verdict: false positives the
+LLM filters out no longer contribute to the score. Verdicts move down
+(or stay the same) when LLM mode flips on, not up.
+
 ## What we did not change (and why)
 
 - We did not raise `suspicious_risk_score` above `51`. SkillSpector
@@ -127,6 +220,11 @@ Re-run this analysis when any of:
   that shifts where its bands sit. The pinned commit in `config.yaml`
   protects us from drifting silently; a deliberate bump should walk
   through this doc.
+- The LLM model or provider changes (e.g., moving from
+  `claude-sonnet-4-6` to Opus, Fable, or to a non-Anthropic
+  provider). Different models filter differently; spot-check the
+  five in-tree skills before merging the provider swap and refresh
+  the table above.
 - We observe a real-world skill that lands in an obviously wrong
   bucket (false positive or false negative). Open a tracking issue,
   link it from this doc, and adjust with evidence in the next PR.