[feat] Add DaytonaRunner for code evaluators#3258
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR implements and tests Daytona-based code evaluation functionality, transitioning from the legacy local sandbox to a new SDK-based approach. It includes improvements to code editor indentation handling for Python/code blocks and adds example evaluators for testing various dependencies and API endpoints.
Key Changes
- Replaced legacy
custom_code_runwith newsdk_custom_code_runthat uses the SDK's workflow-based evaluator system - Enhanced code editor to preserve exact indentation for Python/code (no transformations) while maintaining space-to-tab conversion for JSON/YAML
- Added example evaluators for testing OpenAI, NumPy, and Agenta API endpoints in Daytona environments
Reviewed changes
Copilot reviewed 20 out of 25 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
api/oss/src/services/evaluators_service.py |
Implements new SDK-based custom code runner function that delegates to workflow system |
api/oss/src/resources/evaluators/evaluators.py |
Updates default code template with deprecation note for app_params |
sdk/agenta/sdk/workflows/runners/daytona.py |
Adds environment variables (OPENAI_API_KEY, AGENTA_HOST, AGENTA_CREDENTIALS) to sandbox |
sdk/agenta/sdk/workflows/runners/local.py |
Exposes built-in Python types (dict, list, str, etc.) to restricted environment |
sdk/agenta/sdk/decorators/running.py |
Adds fallback to request.credentials in credential resolution chain |
web/oss/src/components/Editor/plugins/code/utils/pasteUtils.ts |
Preserves exact indentation for Python/code, converts spaces to tabs for JSON/YAML |
web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx |
Uses 4 spaces for Python/code tab insertion, 2 spaces for JSON/YAML |
web/oss/src/components/Editor/plugins/code/plugins/AutoFormatAndValidateOnPastePlugin.tsx |
Skips indentation transformation for Python/code, maintains it for JSON/YAML |
examples/python/evaluators/openai/*.py |
Adds OpenAI SDK evaluators for testing API availability and exact match comparisons |
examples/python/evaluators/numpy/*.py |
Adds NumPy evaluators for testing library availability and character counting |
examples/python/evaluators/basic/*.py |
Adds basic evaluators using Python stdlib for string matching, length checks, JSON validation |
examples/python/evaluators/ag/*.py |
Adds Agenta API endpoint evaluators for health, secrets, and config endpoints |
examples/python/evaluators/*.md |
Provides comprehensive documentation (README, QUICKSTART, SUMMARY) for evaluators |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ck-daytona-code-evaluator
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 37 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 49 out of 56 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mmabrouk
left a comment
There was a problem hiding this comment.
Thanks for the PR @jp-agenta
Misc
- I don't think we should remove restricedpython completely, I would still have it as the default option and allow self-hosting users to use codex execution if they feel their environment is safe.
- We should add documentation to the new environment variable
QA Bugs:
- Errors are not shown or propagated in the evaluator playground
AI review Issues
-
2. Sandbox not deleted on error in DaytonaRunner
- File:
sdk/agenta/sdk/workflows/runners/daytona.py:306-309 - Issue: If exception occurs before
sandbox.delete(), sandbox leaks - Current code:
except Exception as e: log.error(f"Error during Daytona code execution: {e}", exc_info=True) raise RuntimeError(...) # sandbox.delete() not called!
- Recommendation: Use
try/finallypattern:sandbox = self._create_sandbox(runtime=runtime) try: # ... execution logic ... finally: sandbox.delete()
- File:
-
4. 🔴 CRITICAL BUG:
evaluators_service.pynot updated for 3-tuple return -
File:
api/oss/src/services/evaluators_service.pylines 844 and 1028 -
Issue:
SecretsManager.retrieve_secrets()now returnstuple[list, list, list]but these callers still expect a single list -
Current (broken):
secrets = await SecretsManager.retrieve_secrets() for secret in secrets: # Iterates over tuple, not secrets! if secret.get("kind") == ... # AttributeError: list has no .get()
-
Fix:
secrets, _, _ = await SecretsManager.retrieve_secrets()
-
Impact: Will break LLM-as-a-Judge (
auto_ai_critique) and semantic similarity evaluators
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 53 out of 60 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ardaerzin
left a comment
There was a problem hiding this comment.
no major frontend concerns, lgtm 👍 thank you @jp-agenta 🙏
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 55 out of 63 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 56 out of 64 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 53 out of 61 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 55 out of 63 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (6)
sdk/agenta/sdk/types.py:1
- The
import restatement appears after class definitions (line 498 shows a class ending). Move this import to the top of the file with other imports to follow Python conventions and improve code organization.
web/oss/src/components/pages/evaluations/autoEvaluation/EvaluatorsModal/ConfigureEvaluator/index.tsx:1 - The use of
anytype defeats TypeScript's type safety. Consider defining a more specific type or usingunknownif the structure is truly dynamic, then narrow it with type guards where needed.
sdk/agenta/sdk/workflows/runners/local.py:1 - Using
dict()instead of{}for creating an empty dictionary is less idiomatic and slightly less efficient. Use{}instead for consistency with Python conventions.
web/oss/src/components/Editor/plugins/code/plugins/IndentationPlugin.tsx:1 - The hardcoded space strings for indentation could be defined as named constants (e.g.,
JSON_YAML_INDENT = " ",CODE_INDENT = " ") to improve maintainability and make the indentation standards more explicit.
sdk/agenta/sdk/middlewares/running/vault.py:1 - The comment
# pylint: disable=bare-exceptis misleading since the code actually catchesExceptionrather than using a bare except clause. Remove this comment as it's no longer accurate.
sdk/agenta/sdk/middleware/vault.py:1 - The comment
# pylint: disable=bare-exceptis misleading since the code actually catchesExceptionrather than using a bare except clause. Remove this comment as it's no longer accurate.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
No description provided.