Skip to content

[feat] Add DaytonaRunner for code evaluators#3258

Merged
junaway merged 63 commits intorelease/v0.75.0from
chore/check-daytona-code-evaluator
Jan 7, 2026
Merged

[feat] Add DaytonaRunner for code evaluators#3258
junaway merged 63 commits intorelease/v0.75.0from
chore/check-daytona-code-evaluator

Conversation

@junaway
Copy link
Copy Markdown
Contributor

@junaway junaway commented Dec 20, 2025

No description provided.

Copilot AI review requested due to automatic review settings December 20, 2025 00:10
@vercel
Copy link
Copy Markdown

vercel Bot commented Dec 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jan 7, 2026 9:34am

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements and tests Daytona-based code evaluation functionality, transitioning from the legacy local sandbox to a new SDK-based approach. It includes improvements to code editor indentation handling for Python/code blocks and adds example evaluators for testing various dependencies and API endpoints.

Key Changes

  • Replaced legacy custom_code_run with new sdk_custom_code_run that uses the SDK's workflow-based evaluator system
  • Enhanced code editor to preserve exact indentation for Python/code (no transformations) while maintaining space-to-tab conversion for JSON/YAML
  • Added example evaluators for testing OpenAI, NumPy, and Agenta API endpoints in Daytona environments

Reviewed changes

Copilot reviewed 20 out of 25 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
api/oss/src/services/evaluators_service.py Implements new SDK-based custom code runner function that delegates to workflow system
api/oss/src/resources/evaluators/evaluators.py Updates default code template with deprecation note for app_params
sdk/agenta/sdk/workflows/runners/daytona.py Adds environment variables (OPENAI_API_KEY, AGENTA_HOST, AGENTA_CREDENTIALS) to sandbox
sdk/agenta/sdk/workflows/runners/local.py Exposes built-in Python types (dict, list, str, etc.) to restricted environment
sdk/agenta/sdk/decorators/running.py Adds fallback to request.credentials in credential resolution chain
web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts Preserves exact indentation for Python/code, converts spaces to tabs for JSON/YAML
web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx Uses 4 spaces for Python/code tab insertion, 2 spaces for JSON/YAML
web/oss/src/components/Editor/plugins/code/plugins/AutoFormatAndValidateOnPastePlugin.tsx Skips indentation transformation for Python/code, maintains it for JSON/YAML
examples/python/evaluators/openai/*.py Adds OpenAI SDK evaluators for testing API availability and exact match comparisons
examples/python/evaluators/numpy/*.py Adds NumPy evaluators for testing library availability and character counting
examples/python/evaluators/basic/*.py Adds basic evaluators using Python stdlib for string matching, length checks, JSON validation
examples/python/evaluators/ag/*.py Adds Agenta API endpoint evaluators for health, secrets, and config endpoints
examples/python/evaluators/*.md Provides comprehensive documentation (README, QUICKSTART, SUMMARY) for evaluators

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/python/evaluators/numpy/dependency_check.py
Comment thread examples/python/evaluators/ag/secrets_check.py
Comment thread examples/python/evaluators/ag/configs_check.py
Comment thread sdk/agenta/sdk/workflows/runners/daytona.py Outdated
Comment thread web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts Outdated
Comment thread examples/python/evaluators/openai/exact_match.py Outdated
Comment thread examples/python/evaluators/ag/health_check.py
Comment thread examples/python/evaluators/openai/dependency_check.py Outdated
Comment thread examples/python/evaluators/openai/dependency_check.py Outdated
Comment thread examples/python/evaluators/numpy/dependency_check.py Outdated
Copilot AI review requested due to automatic review settings December 23, 2025 11:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 37 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sdk/agenta/sdk/workflows/handlers.py Outdated
Comment thread sdk/agenta/sdk/workflows/runners/daytona.py Outdated
Comment thread api/oss/src/services/evaluators_service.py
Comment thread examples/python/evaluators/openai/exact_match.py
Comment thread sdk/agenta/sdk/workflows/runners/local.py
Comment thread sdk/agenta/sdk/workflows/runners/daytona.py Outdated
Comment thread sdk/agenta/sdk/workflows/runners/daytona.py Outdated
Comment thread examples/python/evaluators/openai/dependency_check.py
Add standard provider keys from vault as env vars
Add templates
Fix credentials (and thus secrets and traces) in evaluator playground
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 49 out of 56 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sdk/agenta/sdk/workflows/runners/local.py Outdated
Comment thread sdk/agenta/sdk/workflows/runners/daytona.py
Comment thread examples/test_daytona_scripts.py
Comment thread examples/python/evaluators/openai/exact_match.py
@mmabrouk mmabrouk changed the base branch from frontend-feat/new-testsets-integration to main January 5, 2026 16:42
Copy link
Copy Markdown
Member

@mmabrouk mmabrouk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @jp-agenta

Misc

  • I don't think we should remove restricedpython completely, I would still have it as the default option and allow self-hosting users to use codex execution if they feel their environment is safe.
  • We should add documentation to the new environment variable

QA Bugs:

  • Errors are not shown or propagated in the evaluator playground

AI review Issues

  • 2. Sandbox not deleted on error in DaytonaRunner

    • File: sdk/agenta/sdk/workflows/runners/daytona.py:306-309
    • Issue: If exception occurs before sandbox.delete(), sandbox leaks
    • Current code:
      except Exception as e:
          log.error(f"Error during Daytona code execution: {e}", exc_info=True)
          raise RuntimeError(...)  # sandbox.delete() not called!
    • Recommendation: Use try/finally pattern:
      sandbox = self._create_sandbox(runtime=runtime)
      try:
          # ... execution logic ...
      finally:
          sandbox.delete()
  • 4. 🔴 CRITICAL BUG: evaluators_service.py not updated for 3-tuple return

  • File: api/oss/src/services/evaluators_service.py lines 844 and 1028

  • Issue: SecretsManager.retrieve_secrets() now returns tuple[list, list, list] but these callers still expect a single list

  • Current (broken):

    secrets = await SecretsManager.retrieve_secrets()
    for secret in secrets:  # Iterates over tuple, not secrets!
        if secret.get("kind") == ...  # AttributeError: list has no .get()
  • Fix:

    secrets, _, _ = await SecretsManager.retrieve_secrets()
  • Impact: Will break LLM-as-a-Judge (auto_ai_critique) and semantic similarity evaluators

@mmabrouk mmabrouk requested a review from ardaerzin January 5, 2026 18:49
@mmabrouk mmabrouk marked this pull request as ready for review January 5, 2026 18:49
@dosubot dosubot Bot added the Evaluation label Jan 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 53 out of 60 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sdk/agenta/sdk/workflows/runners/daytona.py
Comment thread web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts
Comment thread sdk/agenta/sdk/middlewares/running/vault.py
Comment thread sdk/agenta/sdk/middleware/vault.py
Comment thread sdk/agenta/sdk/workflows/templates.py
Comment thread sdk/agenta/sdk/middleware/vault.py
Copy link
Copy Markdown
Contributor

@ardaerzin ardaerzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no major frontend concerns, lgtm 👍 thank you @jp-agenta 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 55 out of 63 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 56 out of 64 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sdk/agenta/sdk/middleware/vault.py
Comment thread sdk/agenta/sdk/middlewares/running/vault.py
Comment thread sdk/agenta/sdk/contexts/running.py
Comment thread sdk/agenta/sdk/types.py
Comment thread sdk/agenta/sdk/middlewares/running/vault.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 53 out of 61 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 55 out of 63 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (6)

sdk/agenta/sdk/types.py:1

  • The import re statement appears after class definitions (line 498 shows a class ending). Move this import to the top of the file with other imports to follow Python conventions and improve code organization.
    web/oss/src/components/pages/evaluations/autoEvaluation/EvaluatorsModal/ConfigureEvaluator/index.tsx:1
  • The use of any type defeats TypeScript's type safety. Consider defining a more specific type or using unknown if the structure is truly dynamic, then narrow it with type guards where needed.
    sdk/agenta/sdk/workflows/runners/local.py:1
  • Using dict() instead of {} for creating an empty dictionary is less idiomatic and slightly less efficient. Use {} instead for consistency with Python conventions.
    web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx:1
  • The hardcoded space strings for indentation could be defined as named constants (e.g., JSON_YAML_INDENT = " ", CODE_INDENT = " ") to improve maintainability and make the indentation standards more explicit.
    sdk/agenta/sdk/middlewares/running/vault.py:1
  • The comment # pylint: disable=bare-except is misleading since the code actually catches Exception rather than using a bare except clause. Remove this comment as it's no longer accurate.
    sdk/agenta/sdk/middleware/vault.py:1
  • The comment # pylint: disable=bare-except is misleading since the code actually catches Exception rather than using a bare except clause. Remove this comment as it's no longer accurate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation example feature SDK size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants