Skip to content

feat: AI eval framework for benchmarking MCP servers#2414

Merged
kodiakhq[bot] merged 25 commits into
mainfrom
brandon/ai-evals
Jun 9, 2026
Merged

feat: AI eval framework for benchmarking MCP servers#2414
kodiakhq[bot] merged 25 commits into
mainfrom
brandon/ai-evals

Conversation

@brandon-pereira

Copy link
Copy Markdown
Member

Summary

Adds packages/hdx-eval — an eval framework for benchmarking AI agents against observability MCP servers. The framework generates deterministic synthetic telemetry with planted anomalies, spawns Claude Code as an SRE agent, records full trajectories, and grades answers using programmatic checks + LLM-as-judge.

Key Features

  • MCP-agnostic — compare any combination of MCPs (HyperDX vs ClickHouse, feature-branch vs main, two HyperDX instances, or any N-way comparison)
  • 5 scenarios covering error root-cause, latency spikes, noisy signal triage, segmented regression, and service health checks
  • Deterministic seeding — mulberry32 PRNG produces byte-identical data for fair comparison
  • Blinded LLM judging — brand terms and tool names are redacted so the judge cannot tell which MCP produced the answer
  • Baseline + challengers reporting model with delta columns
  • Web viewer for browsing comparison dashboards, per-scenario breakdowns, and individual run trajectories
  • Dual-slot setup for A/B comparison of two HyperDX branches running simultaneously

What is included

  • packages/hdx-eval/ — full eval package (CLI, harness, generators, grading, reports, viewer)
  • packages/hdx-eval/README.md — comprehensive docs covering setup, config, scenarios, CLI reference, and scoring
  • .opencode/commands/eval-summary.md — eval analysis skill for reviewing results
  • AGENTS.md — minor addition documenting common utility locations to check before writing new functions

Usage

yarn workspace @hyperdx/hdx-eval dev setup-hyperdx
yarn workspace @hyperdx/hdx-eval dev seed error-root-cause --volume-factor 0.1
yarn workspace @hyperdx/hdx-eval dev run error-root-cause
yarn workspace @hyperdx/hdx-eval viewer

See packages/hdx-eval/README.md for full documentation.

- leverage kv rollup mvs
- allow claude access to read only in temp dir
- tweaks to sysprompt
The runs/ gitignore pattern was matching src/runs/ (source code) in
addition to the intended /runs/ (data directory). Anchor the pattern
so only the top-level runs/ directory is ignored, and commit the
three missing source files: instrument.ts, path.ts, store.ts.
Remove files that are dev-specific or experimental scratch work:
- ablation/ directory (REPORT.md, manifest.tsv)
- scripts/ablation.sh, ablation-report.ts
- scripts/compare-prompt-variants.sh, fast-eval.sh
- MCP_IMPROVEMENTS.md

Also clean up README references and ablation .gitignore patterns.
- Extract spreadTimestamp() and normalizeSeverityText() shared helpers
- Collapse 14 identical phase functions into streamLogPhase() generic
- Compact ground-truth programmatic checks to tuple format [id, weight, pattern, neg?]
- Delete dead checkoutEventLog export, unexport internal logfmtBody/jsonEventBody
- Generalize from fixed HyperDX+ClickHouse pair to config-driven MCP registry
- Add dual-slot eval setup docs for A/B branch comparison
- Add baseline+challengers reporting model with delta columns
- Expand README with MCP config reference, field tables, and examples
- Improve viewer with comparison dashboard and drill-down
- Update blinding to handle arbitrary brand terms per MCP
- Add --baseline, --ch-url, --no-grade, --no-judge CLI flags
@changeset-bot

changeset-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: bfc75a1

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@hyperdx/hdx-eval Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented Jun 3, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment Jun 9, 2026 8:22pm
hyperdx-storybook Ready Ready Preview, Comment Jun 9, 2026 8:22pm

Request Review

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 197 passed • 3 skipped • 1249s

Status Count
✅ Passed 197
❌ Failed 0
⚠️ Flaky 5
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

- Add packages/hdx-eval workspace to knip.json
- Remove unused @hyperdx/common-utils dependency
- Remove export from internal-only symbols (SOURCE_TRACES_TABLE,
  SOURCE_LOGS_TABLE, DENIED_BUILT_IN_TOOLS_BASE, CONFIG_FILENAME,
  loadGradedPairs, instrumentRun, HyperdxConnection, MeResponse)
- Remove dead code (getScenarioGroundTruth, ensureConfigDir)
- Add minor changeset for @hyperdx/hdx-eval
- Fix judge error silently penalizing combined score by 60% (grade.ts)
- Fix path traversal in viewer /api/batches/:batch route (server.js)
- Fix off-by-one in background operation selection (latency-spike)
- Fix worker pool crash-on-error losing all in-flight results (cli.ts)
- Fix claudeSpawn timer leaks on spawn error and SIGTERM escalation
- Fix listRunsInBatch including .grade.json/.timing.json sidecars
- Remove unused innerHTML attribute from viewer el() helper (XSS vector)
- Bind viewer server to 127.0.0.1 instead of all interfaces
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 4, 2026 16:14 Inactive
…idation, temp cleanup, compression, type safety

- Fix normalizeSeverityText case bug: 'FATAL' now correctly returns 'ERROR'
- Add AbortSignal.timeout to all HyperDX API and health check fetch calls
- Add identifier validation in scenarioSlug to reject unsafe characters
- Clean up temp directories after subprocess exits (leaked API keys)
- Enable ClickHouse request compression for batch inserts
- Replace per-row buildResourceAttrs with pre-built pool in noisy-signals
- Narrow groundTruth type from unknown to Record<string, unknown>
- Update AGENTS.md to list hdx-eval as sixth package
- Remove dead pickSeverity import and ScenarioOutput type alias
- Replace inline require('fs') with top-level import in cli.ts
- Add anchorTime field to EvalConfig, auto-generated and saved on first
  run so subsequent runs reuse the same anchor automatically
- Default to skip reseed (old --no-reseed behavior); add --reseed to
  opt in
- Add --live flag to opt out of saved anchor (wall-clock now, implies
  --reseed)
- --anchor-time <iso> now overrides and saves to eval.config.json
- Update README with new CLI flags and Anchor Time section
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 4, 2026 22:44 Inactive
- Add scenarioIsSeeded() check (queries traces table for any row)
- run command now checks for existing data before running; auto-seeds
  if the scenario tables are empty or missing
- Update README to document auto-seed behavior
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 14:52 Inactive
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 15:14 Inactive
Lead with export HDX_DEV_SLOT in the Quick Start so eval commands
connect to the correct ClickHouse instance regardless of which
worktree they are run from. Replace --ch-url examples in the
dual-slot seeding section with HDX_DEV_SLOT for consistency.

Also clarify that .env.local must be at the monorepo root.
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 15:45 Inactive
@brandon-pereira brandon-pereira marked this pull request as ready for review June 5, 2026 15:46
@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label Jun 5, 2026
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Large diff: 12006 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 59
  • Production lines changed: 12006 (+ 2479 in test files, excluded from tier calculation)
  • Branch: brandon/ai-evals
  • Author: brandon-pereira

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

Comment thread packages/hdx-eval/src/harness/claudeSpawn.ts Outdated
Comment thread packages/hdx-eval/src/runs/instrument.ts
Comment thread packages/hdx-eval/src/harness/runRun.ts
Comment thread packages/hdx-eval/src/grading/judge.ts
Comment thread packages/hdx-eval/src/grading/grade.ts
- Escape underscores and percent signs in SQL LIKE patterns for
  query attribution to avoid false matches on table names
- Accumulate token counts from both judge attempts when retry
  succeeds, fixing understated cost reporting
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 20:19 Inactive
Matches the existing exclusion pattern in listRunsInBatch (store.ts).
Without this, timing sidecars are picked up as run records and produce
garbage grade files.
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 20:22 Inactive
@brandon-pereira brandon-pereira requested review from a team and wrn14897 and removed request for a team June 5, 2026 20:38
Comment thread packages/hdx-eval/src/clickhouse/insert.ts Outdated
Comment thread packages/hdx-eval/src/harness/claudeSpawn.ts Outdated
Comment thread packages/hdx-eval/src/harness/claudeSpawn.ts
Comment thread packages/hdx-eval/README.md
- Wrap post-spawn block in try/finally so rmSync always runs,
  even if spawn() rejects (e.g. ENOENT). Prevents API keys from
  leaking in /tmp/hdx-eval-*/mcp-config.json.
- Change runner default model from claude-sonnet-4-6 to claude-opus-4-6.
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 9, 2026 17:51 Inactive
Generators already yield in ~10K-row batches, so 5K was needlessly
sub-chunking each batch into 2 inserts. 100K eliminates the split
with no memory/stability downside. Benchmarked locally: ~7% faster
(53.5s → 49.8s for 3.6M rows).
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 9, 2026 18:07 Inactive
…rvers

Spawn claude with detached:true so it gets its own process group.
SIGTERM still targets claude only (it handles graceful MCP shutdown).
The SIGKILL escalation now uses process.kill(-pid) to kill the entire
process group, ensuring orphaned MCP server children are reaped.
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 9, 2026 18:09 Inactive
@brandon-pereira

Copy link
Copy Markdown
Member Author

@wrn14897 all feedback addressed- ready for another review :)

@wrn14897 wrn14897 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kodiakhq kodiakhq Bot merged commit 5bd1c68 into main Jun 9, 2026
19 checks passed
@kodiakhq kodiakhq Bot deleted the brandon/ai-evals branch June 9, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants