[feat] Add `DaytonaRunner` for code `evaluators` by junaway · Pull Request #3258 · Agenta-AI/agenta

junaway · 2025-12-20T00:10:55Z

No description provided.

vercel · 2025-12-20T00:11:00Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Jan 7, 2026 9:34am

Copilot

Pull request overview

This PR implements and tests Daytona-based code evaluation functionality, transitioning from the legacy local sandbox to a new SDK-based approach. It includes improvements to code editor indentation handling for Python/code blocks and adds example evaluators for testing various dependencies and API endpoints.

Key Changes

Replaced legacy custom_code_run with new sdk_custom_code_run that uses the SDK's workflow-based evaluator system
Enhanced code editor to preserve exact indentation for Python/code (no transformations) while maintaining space-to-tab conversion for JSON/YAML
Added example evaluators for testing OpenAI, NumPy, and Agenta API endpoints in Daytona environments

Reviewed changes

Copilot reviewed 20 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`api/oss/src/services/evaluators_service.py`	Implements new SDK-based custom code runner function that delegates to workflow system
`api/oss/src/resources/evaluators/evaluators.py`	Updates default code template with deprecation note for app_params
`sdk/agenta/sdk/workflows/runners/daytona.py`	Adds environment variables (OPENAI_API_KEY, AGENTA_HOST, AGENTA_CREDENTIALS) to sandbox
`sdk/agenta/sdk/workflows/runners/local.py`	Exposes built-in Python types (dict, list, str, etc.) to restricted environment
`sdk/agenta/sdk/decorators/running.py`	Adds fallback to request.credentials in credential resolution chain
`web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts`	Preserves exact indentation for Python/code, converts spaces to tabs for JSON/YAML
`web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx`	Uses 4 spaces for Python/code tab insertion, 2 spaces for JSON/YAML
`web/oss/src/components/Editor/plugins/code/plugins/AutoFormatAndValidateOnPastePlugin.tsx`	Skips indentation transformation for Python/code, maintains it for JSON/YAML
`examples/python/evaluators/openai/*.py`	Adds OpenAI SDK evaluators for testing API availability and exact match comparisons
`examples/python/evaluators/numpy/*.py`	Adds NumPy evaluators for testing library availability and character counting
`examples/python/evaluators/basic/*.py`	Adds basic evaluators using Python stdlib for string matching, length checks, JSON validation
`examples/python/evaluators/ag/*.py`	Adds Agenta API endpoint evaluators for health, secrets, and config endpoints
`examples/python/evaluators/*.md`	Provides comprehensive documentation (README, QUICKSTART, SUMMARY) for evaluators

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ck-daytona-code-evaluator

Copilot

Pull request overview

Copilot reviewed 32 out of 37 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add standard provider keys from vault as env vars Add templates Fix credentials (and thus secrets and traces) in evaluator playground

Copilot

Pull request overview

Copilot reviewed 49 out of 56 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mmabrouk

Thanks for the PR @jp-agenta

Misc

I don't think we should remove restricedpython completely, I would still have it as the default option and allow self-hosting users to use codex execution if they feel their environment is safe.
We should add documentation to the new environment variable

QA Bugs:

Errors are not shown or propagated in the evaluator playground

AI review Issues

2. Sandbox not deleted on error in DaytonaRunner

File: sdk/agenta/sdk/workflows/runners/daytona.py:306-309
Issue: If exception occurs before sandbox.delete(), sandbox leaks

Current code:

except Exception as e:
    log.error(f"Error during Daytona code execution: {e}", exc_info=True)
    raise RuntimeError(...)  # sandbox.delete() not called!

Recommendation: Use try/finally pattern:

sandbox = self._create_sandbox(runtime=runtime)
try:
    # ... execution logic ...
finally:
    sandbox.delete()

4. 🔴 CRITICAL BUG: evaluators_service.py not updated for 3-tuple return
File: api/oss/src/services/evaluators_service.py lines 844 and 1028
Issue: SecretsManager.retrieve_secrets() now returns tuple[list, list, list] but these callers still expect a single list

Current (broken):

secrets = await SecretsManager.retrieve_secrets()
for secret in secrets:  # Iterates over tuple, not secrets!
    if secret.get("kind") == ...  # AttributeError: list has no .get()

Fix:

secrets, _, _ = await SecretsManager.retrieve_secrets()

Impact: Will break LLM-as-a-Judge (auto_ai_critique) and semantic similarity evaluators

Copilot

Pull request overview

Copilot reviewed 53 out of 60 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ardaerzin

no major frontend concerns, lgtm 👍 thank you @jp-agenta 🙏

Copilot

Pull request overview

Copilot reviewed 55 out of 63 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 56 out of 64 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 53 out of 61 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 55 out of 63 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (6)

sdk/agenta/sdk/types.py:1

The import re statement appears after class definitions (line 498 shows a class ending). Move this import to the top of the file with other imports to follow Python conventions and improve code organization.
web/oss/src/components/pages/evaluations/autoEvaluation/EvaluatorsModal/ConfigureEvaluator/index.tsx:1
The use of any type defeats TypeScript's type safety. Consider defining a more specific type or using unknown if the structure is truly dynamic, then narrow it with type guards where needed.
sdk/agenta/sdk/workflows/runners/local.py:1
Using dict() instead of {} for creating an empty dictionary is less idiomatic and slightly less efficient. Use {} instead for consistency with Python conventions.
web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx:1
The hardcoded space strings for indentation could be defined as named constants (e.g., JSON_YAML_INDENT = " ", CODE_INDENT = " ") to improve maintainability and make the indentation standards more explicit.
sdk/agenta/sdk/middlewares/running/vault.py:1
The comment # pylint: disable=bare-except is misleading since the code actually catches Exception rather than using a bare except clause. Remove this comment as it's no longer accurate.
sdk/agenta/sdk/middleware/vault.py:1
The comment # pylint: disable=bare-except is misleading since the code actually catches Exception rather than using a bare except clause. Remove this comment as it's no longer accurate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jp-agenta added 9 commits December 19, 2025 11:29

adding evaluators (WIP)

f75cb59

adding evaluators (WIP)

c2c553a

fixing evaluators

5a8dcd0

Merge branch 'release/v0.69.5' into chore/check-daytona-code-evaluator

91e69e8

testing numpy/openai/agenta

a602930

fix typos in init

59f4797

confirm works with localhost if public host

6c297c9

fix playground

a717366

fix presets

b3d90f2

Copilot AI review requested due to automatic review settings December 20, 2025 00:10

vercel Bot deployed to Preview December 20, 2025 00:11 View deployment

Copilot started reviewing on behalf of junaway December 20, 2025 00:11 View session

jp-agenta added 2 commits December 20, 2025 01:12

remove blaot

5bdc802

remove bloat

5304e0f

vercel Bot deployed to Preview December 20, 2025 00:13 View deployment

fix daytona imports

00958cc

vercel Bot deployed to Preview December 20, 2025 00:15 View deployment

remove openai key from daytona

4071a3d

vercel Bot deployed to Preview December 20, 2025 00:16 View deployment

Copilot AI reviewed Dec 20, 2025

View reviewed changes

jp-agenta added 2 commits December 23, 2025 12:31

WIP add runtimes

7d3ac94

Merge branch 'fix/remove-autoevals-and-rag-evaluators' into chore/che…

84bbdaa

…ck-daytona-code-evaluator

vercel Bot deployed to Preview December 23, 2025 11:32 View deployment

Copilot AI review requested due to automatic review settings December 23, 2025 11:39

Merge branch 'main' into chore/check-daytona-code-evaluator

a4ffa8c

Copilot started reviewing on behalf of junaway December 23, 2025 11:39 View session

vercel Bot deployed to Preview December 23, 2025 11:40 View deployment

Copilot AI reviewed Dec 23, 2025

View reviewed changes

WIP

93d7bb5

Add standard provider keys from vault as env vars Add templates Fix credentials (and thus secrets and traces) in evaluator playground

fix blank form on reload

7cfa317

Copilot AI review requested due to automatic review settings January 4, 2026 22:21

Copilot started reviewing on behalf of junaway January 4, 2026 22:22 View session

vercel Bot deployed to Preview January 4, 2026 22:22 View deployment

Copilot AI reviewed Jan 4, 2026

View reviewed changes

Merge branch 'release/v0.74.0' into chore/check-daytona-code-evaluator

5c6729f

vercel Bot deployed to Preview January 5, 2026 14:59 View deployment

mmabrouk changed the base branch from frontend-feat/new-testsets-integration to main January 5, 2026 16:42

mmabrouk requested changes Jan 5, 2026

View reviewed changes

mmabrouk requested a review from ardaerzin January 5, 2026 18:49

mmabrouk marked this pull request as ready for review January 5, 2026 18:49

dosubot Bot added the Evaluation label Jan 5, 2026

jp-agenta added 2 commits January 6, 2026 10:27

added _sandbox_context

b1e2664

fix 3-tuple in secrets

ccd6087

Copilot AI reviewed Jan 6, 2026

View reviewed changes

jp-agenta added 2 commits January 6, 2026 10:49

restricted python and dependencies cleanup

f0beb20

add missing locks

710bdf9

ardaerzin approved these changes Jan 6, 2026

View reviewed changes

Propagate custom code evaluator exception

840fcc1

Copilot AI reviewed Jan 6, 2026

View reviewed changes

jp-agenta added 2 commits January 6, 2026 14:57

Error copy fix

ce3ba20

added docs for env vars

93eee32

Copilot AI reviewed Jan 6, 2026

View reviewed changes

Comment thread sdk/agenta/sdk/middleware/vault.py

Comment thread sdk/agenta/sdk/middlewares/running/vault.py

Comment thread sdk/agenta/sdk/contexts/running.py

Comment thread sdk/agenta/sdk/types.py

Comment thread sdk/agenta/sdk/middlewares/running/vault.py

jp-agenta added 2 commits January 6, 2026 15:26

Merge branch 'main' into chore/check-daytona-code-evaluator

33ebca9

Merge branch 'release/v0.75.0' into chore/check-daytona-code-evaluator

9e7960f

Copilot AI reviewed Jan 7, 2026

View reviewed changes

jp-agenta added 3 commits January 7, 2026 10:15

Fix poetry.lock

d9feb15

fix var name in web

eef9477

fix fallback in js/ts evaluators

d8364c7

Copilot AI reviewed Jan 7, 2026

View reviewed changes

Conversation

junaway commented Dec 20, 2025

Uh oh!

vercel Bot commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mmabrouk left a comment

Choose a reason for hiding this comment

Misc

QA Bugs:

AI review Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ardaerzin left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

vercel Bot commented Dec 20, 2025 •

edited

Loading