ci: Improve autoevals CI and fix failing Python CI by AbhiPrasad · Pull Request #195 · braintrustdata/autoevals

Abhijeet Prasad (AbhiPrasad) · 2026-06-08T16:54:40Z

This branch focuses on CI and Python dependency modernization:

Migrates Python development/CI workflows from pip/setup-style installs to uv, adding pyproject.toml, uv.lock, and updating Makefile/env.sh. Also adds Python 3.13 and 3.14 to the test matrix.
Improves GitHub Actions coverage and maintenance:
- Adds Node 24 to JS CI.
- Updates checkout/setup action pins.
- Improves lint, Python, publish, and version-sync workflows.
Updates Python publishing documentation to use uv commands.
Adjusts tests for updated Braintrust OpenAI API behavior.
Adds/updates agent guidance via AGENTS.md and CLAUDE.md.
Removes dev engine constraints from package.json to align JS tooling expectations.

Update the JS CI matrix to include Node 24 and simplify dependency caching by using setup-node's built-in pnpm cache. Also refresh checkout and setup-node action pins.

github-actions · 2026-06-08T17:04:12Z

Braintrust eval report

Autoevals (abhi-autoevals-ci-improvements-1780938254)

Score	Average	Improvements	Regressions
NumericDiff	79.7% (+1pp)	9 🟢	7 🔴
Time_to_first_token	12.12tok (+2.81tok)	27 🟢	92 🔴
Llm_calls	1.09 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	-	-
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	317.7tok (+0tok)	-	-
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_5m_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_1h_tokens	0tok (+0tok)	-	-
Completion_tokens	239.83tok (-1.18tok)	51 🟢	46 🔴
Completion_reasoning_tokens	0tok (+0tok)	-	-
Total_tokens	557.53tok (-1.18tok)	51 🟢	46 🔴
Estimated_cost	0$ (0$)	49 🟢	39 🔴
Duration	12.48s (+3.05s)	39 🟢	180 🔴
Llm_duration	13.55s (+2.75s)	28 🟢	91 🔴

github-actions · 2026-06-08T17:08:33Z

Braintrust eval report

Autoevals (HEAD-1780938901)

Score	Average	Improvements	Regressions
NumericDiff	79.1% (-1pp)	3 🟢	7 🔴
Time_to_first_token	10.44tok (-1.69tok)	72 🟢	47 🔴
Llm_calls	1.09 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	-	-
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	317.7tok (+0tok)	-	-
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_5m_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_1h_tokens	0tok (+0tok)	-	-
Completion_tokens	241.01tok (+1.18tok)	56 🟢	49 🔴
Completion_reasoning_tokens	0tok (+0tok)	-	-
Total_tokens	558.71tok (+1.18tok)	56 🟢	49 🔴
Estimated_cost	0$ (+0$)	51 🟢	46 🔴
Duration	10.41s (-2.07s)	154 🟢	65 🔴
Llm_duration	11.98s (-1.57s)	70 🟢	49 🔴

Andrew Kent (realark) · 2026-06-08T18:22:57Z

+  pull_request:
+    # Uncomment to run only when files in the 'evals' directory change
+    # paths:
+    #   - "evals/**"


I think this means fork PRs would fail since they don't have our secrets. Not sure if that's a big deal but wanted to point it out just in case

github-actions · 2026-06-08T19:00:30Z

Braintrust eval report

Autoevals (main-1780945233)

Score	Average	Improvements	Regressions
NumericDiff	78.6% (0pp)	8 🟢	9 🔴
Time_to_first_token	10.56tok (+0.12tok)	58 🟢	61 🔴
Llm_calls	1.09 (+0)	-	-
Tool_calls	0 (+0)	-	-
Errors	0 (+0)	-	-
Llm_errors	0 (+0)	-	-
Tool_errors	0 (+0)	-	-
Prompt_tokens	317.7tok (+0tok)	-	-
Prompt_cached_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_5m_tokens	0tok (+0tok)	-	-
Prompt_cache_creation_1h_tokens	0tok (+0tok)	-	-
Completion_tokens	248.4tok (+7.38tok)	47 🟢	58 🔴
Completion_reasoning_tokens	0tok (+0tok)	-	-
Total_tokens	566.1tok (+7.38tok)	47 🟢	58 🔴
Estimated_cost	0$ (+0$)	46 🟢	54 🔴
Duration	10.64s (+0.23s)	108 🟢	111 🔴
Llm_duration	11.98s (0s)	62 🟢	57 🔴

Abhijeet Prasad (AbhiPrasad) added 11 commits June 8, 2026 12:34

ci: add Node 24 to JS workflow

6994a34

Update the JS CI matrix to include Node 24 and simplify dependency caching by using setup-node's built-in pnpm cache. Also refresh checkout and setup-node action pins.

ci: improve lint workflow

284cc22

ci: improve Python workflow coverage

28b0a5d

chore(ci): update workflow checkout action

304f956

ci: use uv for Python installs

0ed2a93

test: support updated braintrust oai API

9fdaa9c

ci: migrate Python project to uv

b2d1a58

chore: AGENTS.md it up

da38557

dev engines remove to align with js

6b09c2e

docs: update Python publishing commands for uv

60af5b2

chore: remove redundant setup.py

cc86444

Abhijeet Prasad (AbhiPrasad) self-assigned this Jun 8, 2026

pin versions

f6ca1e3

fix when run

32890f1

fix: add missing pytest plugins

4ad3a13

Andrew Kent (realark) reviewed Jun 8, 2026

View reviewed changes

Andrew Kent (realark) approved these changes Jun 8, 2026

View reviewed changes

Abhijeet Prasad (AbhiPrasad) merged commit 0cf9817 into main Jun 8, 2026
12 checks passed

Abhijeet Prasad (AbhiPrasad) mentioned this pull request Jun 8, 2026

[OAI] Support braintrust >=0.13 wrapping (fix Python CI) #193

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Improve autoevals CI and fix failing Python CI#195

ci: Improve autoevals CI and fix failing Python CI#195
Abhijeet Prasad (AbhiPrasad) merged 14 commits into
mainfrom
abhi-autoevals-ci-improvements

Abhijeet Prasad (AbhiPrasad) commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Andrew Kent (realark) Jun 8, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Abhijeet Prasad (AbhiPrasad) commented Jun 8, 2026

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Braintrust eval report

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Braintrust eval report

Uh oh!

Andrew Kent (realark) Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Braintrust eval report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading