Skip to content

ci: Improve autoevals CI and fix failing Python CI#195

Merged
Abhijeet Prasad (AbhiPrasad) merged 14 commits into
mainfrom
abhi-autoevals-ci-improvements
Jun 8, 2026
Merged

ci: Improve autoevals CI and fix failing Python CI#195
Abhijeet Prasad (AbhiPrasad) merged 14 commits into
mainfrom
abhi-autoevals-ci-improvements

Conversation

@AbhiPrasad

Copy link
Copy Markdown
Member

This branch focuses on CI and Python dependency modernization:

  • Migrates Python development/CI workflows from pip/setup-style installs to uv, adding pyproject.toml, uv.lock, and updating Makefile/env.sh. Also adds Python 3.13 and 3.14 to the test matrix.
  • Improves GitHub Actions coverage and maintenance:
    • Adds Node 24 to JS CI.
    • Updates checkout/setup action pins.
    • Improves lint, Python, publish, and version-sync workflows.
  • Updates Python publishing documentation to use uv commands.
  • Adjusts tests for updated Braintrust OpenAI API behavior.
  • Adds/updates agent guidance via AGENTS.md and CLAUDE.md.
  • Removes dev engine constraints from package.json to align JS tooling expectations.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Braintrust eval report

Autoevals (abhi-autoevals-ci-improvements-1780938254)

Score Average Improvements Regressions
NumericDiff 79.7% (+1pp) 9 🟢 7 🔴
Time_to_first_token 12.12tok (+2.81tok) 27 🟢 92 🔴
Llm_calls 1.09 (+0) - -
Tool_calls 0 (+0) - -
Errors 0 (+0) - -
Llm_errors 0 (+0) - -
Tool_errors 0 (+0) - -
Prompt_tokens 317.7tok (+0tok) - -
Prompt_cached_tokens 0tok (+0tok) - -
Prompt_cache_creation_tokens 0tok (+0tok) - -
Prompt_cache_creation_5m_tokens 0tok (+0tok) - -
Prompt_cache_creation_1h_tokens 0tok (+0tok) - -
Completion_tokens 239.83tok (-1.18tok) 51 🟢 46 🔴
Completion_reasoning_tokens 0tok (+0tok) - -
Total_tokens 557.53tok (-1.18tok) 51 🟢 46 🔴
Estimated_cost 0$ (0$) 49 🟢 39 🔴
Duration 12.48s (+3.05s) 39 🟢 180 🔴
Llm_duration 13.55s (+2.75s) 28 🟢 91 🔴

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Braintrust eval report

Autoevals (HEAD-1780938901)

Score Average Improvements Regressions
NumericDiff 79.1% (-1pp) 3 🟢 7 🔴
Time_to_first_token 10.44tok (-1.69tok) 72 🟢 47 🔴
Llm_calls 1.09 (+0) - -
Tool_calls 0 (+0) - -
Errors 0 (+0) - -
Llm_errors 0 (+0) - -
Tool_errors 0 (+0) - -
Prompt_tokens 317.7tok (+0tok) - -
Prompt_cached_tokens 0tok (+0tok) - -
Prompt_cache_creation_tokens 0tok (+0tok) - -
Prompt_cache_creation_5m_tokens 0tok (+0tok) - -
Prompt_cache_creation_1h_tokens 0tok (+0tok) - -
Completion_tokens 241.01tok (+1.18tok) 56 🟢 49 🔴
Completion_reasoning_tokens 0tok (+0tok) - -
Total_tokens 558.71tok (+1.18tok) 56 🟢 49 🔴
Estimated_cost 0$ (+0$) 51 🟢 46 🔴
Duration 10.41s (-2.07s) 154 🟢 65 🔴
Llm_duration 11.98s (-1.57s) 70 🟢 49 🔴

pull_request:
# Uncomment to run only when files in the 'evals' directory change
# paths:
# - "evals/**"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means fork PRs would fail since they don't have our secrets. Not sure if that's a big deal but wanted to point it out just in case

@AbhiPrasad Abhijeet Prasad (AbhiPrasad) merged commit 0cf9817 into main Jun 8, 2026
12 checks passed
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Braintrust eval report

Autoevals (main-1780945233)

Score Average Improvements Regressions
NumericDiff 78.6% (0pp) 8 🟢 9 🔴
Time_to_first_token 10.56tok (+0.12tok) 58 🟢 61 🔴
Llm_calls 1.09 (+0) - -
Tool_calls 0 (+0) - -
Errors 0 (+0) - -
Llm_errors 0 (+0) - -
Tool_errors 0 (+0) - -
Prompt_tokens 317.7tok (+0tok) - -
Prompt_cached_tokens 0tok (+0tok) - -
Prompt_cache_creation_tokens 0tok (+0tok) - -
Prompt_cache_creation_5m_tokens 0tok (+0tok) - -
Prompt_cache_creation_1h_tokens 0tok (+0tok) - -
Completion_tokens 248.4tok (+7.38tok) 47 🟢 58 🔴
Completion_reasoning_tokens 0tok (+0tok) - -
Total_tokens 566.1tok (+7.38tok) 47 🟢 58 🔴
Estimated_cost 0$ (+0$) 46 🟢 54 🔴
Duration 10.64s (+0.23s) 108 🟢 111 🔴
Llm_duration 11.98s (0s) 62 🟢 57 🔴

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants