feat: annotation queue testset export#4261
Conversation
Adds PRD, RFC, and README to docs/designs/testset-annotation-queue/. The design covers: - Per-scenario "Add to Testset" button in the annotate tab - "Done with queue" screen supporting both trace and testcase queues - All-annotations tab with row selection and upgraded commit modal - New AddToTestsetModal component with EntityPicker integration - Controller actions for addScenariosToTestset (trace + testcase paths) - Default target testset heuristics and last-used persistence https://claude.ai/code/session_01B2uQKidAr1KJ4CR9sroY2H
Key corrections: - Replace new AddToTestsetModal with EntityCommitModal + renderModeContent - Replace useState for selected testset with pendingTestsetSelectionAtom in annotationSessionController — survives re-renders and can be read imperatively by addScenariosToTestset without closure capture - selectedScenarioIdsAtom for row selection (also atom, not useState) - openAddToTestsetModal seeds pendingTestsetSelectionAtom from default - addScenariosToTestset reads target testset from atom, not payload - Add state atom summary table documenting all atom lifecycles https://claude.ai/code/session_01B2uQKidAr1KJ4CR9sroY2H
Inputs (agData.inputs) spread into N columns — one per input key. Outputs (agData.outputs) always map to a single "output" column regardless of value shape, matching extractOutputs() behaviour in trace/utils/selectors.ts which treats outputs as a leaf. Annotation values add one column per evaluator slug.
…n reset 1. Trace annotation resolution: use scenarioAnnotationsAtomFamily(scenarioId) not a raw query by traceId. The atom handles step-based resolution to avoid cross-queue bleed (as documented in the controller). 2. Scope label when scope="all": addToTestsetScenarioIds() is empty for the "all" case — read actual count from scenarioIds() instead of falling back to the "all" string. 3. Row selection reset: addScenariosToTestset clears selectedScenarioIdsAtom after a successful export so stale selections don't persist.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an "Add to Testset" feature: controller atoms/actions and types for modal/export jobs, background export implementation with row builders/remapping for traces/testcases, UI wiring and modal integration, annotation column child support, and related testset API adjustments. ChangesAdd to Testset Feature
sequenceDiagram Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly Related PRs
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
docs/designs/testset-annotation-queue/PRD.md (1)
42-48: 💤 Low valueAdd blank lines around tables for markdown compliance.
The tables in this document are missing surrounding blank lines, which causes markdownlint warnings. This affects tables at lines 42, 89, 107, and 118.
📝 Example fix for the competitive reference table
## Competitive Reference ### LangSmith + | Capability | Supported | |------------|-----------| | Create or extend a testset from both test cases and traces | Yes | | Add a single scenario to a testset | Yes | | Add to a testset from the focused/annotate tab | Yes | | Select table rows and add them to a testset | Yes | + LangSmith supports all four capabilities; Agenta currently supports none of them in full.docs/designs/testset-annotation-queue/RFC.md (1)
18-57: 💤 Low valueAdd language specifiers to fenced code blocks.
Several code blocks in this RFC lack language specifiers, which affects syntax highlighting and readability. The architecture diagram at line 18 and examples at lines 244, 361, 366, 385, 391, 485, 500, 517 should specify a language (use
textorplaintextfor ASCII diagrams).📝 Example fix for ASCII diagram
-``` +```text User action (button click) │ ▼
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 22636d88-f5f8-408b-a088-51bb814f81e5
📒 Files selected for processing (11)
docs/designs/testset-annotation-queue/PRD.mddocs/designs/testset-annotation-queue/README.mddocs/designs/testset-annotation-queue/RFC.mdweb/packages/agenta-annotation-ui/src/components/AnnotationSession/FocusView.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/ScenarioListView.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/SessionNavigation.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/index.tsxweb/packages/agenta-entity-ui/src/modals/commit/components/EntityCommitContent.tsxweb/packages/agenta-entity-ui/src/selection/components/UnifiedEntityPicker/types.tsweb/packages/agenta-entity-ui/src/selection/components/UnifiedEntityPicker/variants/CascadingVariant.tsxweb/packages/agenta-entity-ui/src/selection/hooks/modes/useCascadingMode.ts
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
web/packages/agenta-annotation-ui/src/components/AnnotationSession/ScenarioListView.tsx (1)
552-569:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winKeep the testcase fallback key separate from the metric path.
AnnotationOutputKeyCellpassesoutputKeyintouseAnnotationCellFallback(), and after Line 912 that value is often the full mapped path (attributes.ag.data.outputs.score, etc.). The fallback hook indexes the testcase container object by that same string, so grouped annotation cells now miss synced testcase values and render—until an annotation exists.Possible fix
const AnnotationOutputKeyCell = memo(function AnnotationOutputKeyCell({ scenarioId, def, outputKey, + fallbackOutputKey, fallbackDataKey, }: { scenarioId: string def: AnnotationColumnDef outputKey: string + fallbackOutputKey?: string fallbackDataKey?: string | null }) { const runId = useAtomValue(annotationSessionController.selectors.activeRunId()) ?? undefined const PopoverWrapper = useMetricPopoverWrapper() const {fallbackValue, isPending} = useAnnotationCellFallback( scenarioId, fallbackDataKey, - outputKey, + fallbackOutputKey ?? outputKey, )render: (_value: unknown, record: ScenarioTableRow) => ( <AnnotationOutputKeyCell scenarioId={record.scenarioId} def={outputColumn.annotationDef} outputKey={outputColumn.annotationDef.path ?? outputColumn.title} + fallbackOutputKey={outputColumn.title} fallbackDataKey={def.fallbackDataKey} /> ),Also applies to: 899-913
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: e691f741-e404-4527-b6c9-1dbc7e659952
📒 Files selected for processing (12)
web/packages/agenta-annotation-ui/src/components/AnnotationSession/ScenarioListView.tsxweb/packages/agenta-annotation/src/index.tsweb/packages/agenta-annotation/src/state/controllers/annotationSessionController.tsweb/packages/agenta-annotation/src/state/controllers/index.tsweb/packages/agenta-annotation/src/state/index.tsweb/packages/agenta-annotation/src/state/testsetSync.tsweb/packages/agenta-annotation/src/state/types.tsweb/packages/agenta-entities/src/testset/api/api.tsweb/packages/agenta-entities/src/testset/api/index.tsweb/packages/agenta-entities/src/testset/index.tsweb/packages/agenta-entity-ui/src/modals/commit/components/EntityCommitModal.tsxweb/packages/agenta-entity-ui/src/modals/commit/state.ts
💤 Files with no reviewable changes (2)
- web/packages/agenta-entity-ui/src/modals/commit/state.ts
- web/packages/agenta-entity-ui/src/modals/commit/components/EntityCommitModal.tsx
✅ Files skipped from review due to trivial changes (1)
- web/packages/agenta-annotation/src/state/index.ts
Railway Preview Environment
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
web/packages/agenta-entities/src/testset/api/mutations.ts (1)
47-73:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
createTestset’s return shape consistent.Line 73 still returns raw
response.data, but the callers inweb/packages/agenta-entities/src/testset/state/testsetMolecule.ts:685-705andweb/packages/agenta-entities/src/testset/state/mutations.ts:489-515only understand{ testset, revisionId }. If the backend ever omitsdata.testset, those flows fall into the"No revision ID returned from API"path even though the request itself may have succeeded. Throw here or normalize the fallback into the same shape.
🧹 Nitpick comments (2)
web/packages/agenta-annotation-ui/src/components/AnnotationSession/EmptyQueueState.tsx (1)
39-47: 💤 Low valueConsider using theme variables for button colors.
The hardcoded hex colors (
#051729,#0a2540) bypass the theme system. If these are intentional brand colors for the observability feature, consider extracting them to CSS variables or a shared constant for consistency across the codebase.web/packages/agenta-annotation-ui/src/components/AnnotationSession/index.tsx (1)
206-224: 💤 Low valueInline array creation in render callback.
Line 213 creates
[pendingTestsetSelection]on each invocation. IfEntityPickerperforms shallow comparison oninitialSelections, this could cause unnecessary re-renders. Consider memoizing the array:♻️ Suggested optimization
+ const initialTestsetSelections = useMemo( + () => [pendingTestsetSelection], + [pendingTestsetSelection], + ) + const renderAddToTestsetModeContent = useCallback( ({mode}: {mode?: string}) => ( <div className="flex flex-col gap-3"> {mode !== "new" && ( <EntityPicker<AddToTestsetTargetSelection> variant="cascading" adapter={ADD_TO_TESTSET_TARGET_ADAPTER} - initialSelections={[pendingTestsetSelection]} + initialSelections={initialTestsetSelections} showLabels showAutoIndicator={false} placeholders={["Select testset"]} onSelect={handleTestsetSelect} onDeselect={handleTestsetDeselect} /> )} </div> ), - [handleTestsetSelect, handleTestsetDeselect, pendingTestsetSelection], + [handleTestsetSelect, handleTestsetDeselect, initialTestsetSelections], )
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 403b6771-e83c-477b-bbb2-86059983e95e
📒 Files selected for processing (8)
web/packages/agenta-annotation-ui/src/components/AnnotationSession/EmptyQueueState.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/FocusView.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/SessionHeaderRight.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/SessionTitle.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/constants.tsweb/packages/agenta-annotation-ui/src/components/AnnotationSession/index.tsxweb/packages/agenta-annotation-ui/src/components/AnnotationSession/type.tsweb/packages/agenta-entities/src/testset/api/mutations.ts
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: f72c55ad-f996-4d48-b99d-45e5d8841ed3
📒 Files selected for processing (1)
web/packages/agenta-entities/src/testset/api/mutations.ts
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (3)
web/packages/agenta-annotation/src/state/controllers/annotationSessionController.ts (3)
2548-2562:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftSchema inference from a single testcase row may miss existing columns.
fetchLatestRevisionWithRowsrequeststestcaseLimit: 1, butextractExistingColumns(Line 2829) uses this single row to build the set of existing columns. If that sampled row is sparse or doesn't contain all leaf paths present in the target testset,remapRowsToExistingLeafColumnswon't recognize valid existing columns. This can causepatchRevisionto add duplicate or misaligned column paths.Consider fetching the testset's column schema directly (if available via API) or increasing the sample size to union columns across multiple rows:
Also applies to: 2823-2830
2633-2657:⚠️ Potential issue | 🟠 Major | ⚡ Quick winStep and annotation query waits don't validate returned state before building rows.
Unlike the trace query (Lines 2616-2621), the waits for
stepsQueryAtomandannotationsQueryAtom(Lines 2633-2644) don't check if the query succeeded or errored. When these queries are still pending at timeout or have errored,buildTraceAnnotationOutputs(Line 2648) reads fromscenarioAnnotationsAtomFamilywhich returns an empty array on failure. This silently produces export rows with missing annotation outputs.Add validation after each wait to fail the export or skip the scenario with a warning:
Proposed fix to validate query results
const stepsQueryAtom = scenarioStepsQueryStateAtomFamily(scenarioId) - await waitForStoreAtomValue<QueryStateLike | null | undefined>( + const stepsQuery = await waitForStoreAtomValue<QueryStateLike | null | undefined>( stepsQueryAtom, isQuerySettledOrNullForExport, ) + if (stepsQuery?.error) { + console.warn(`[export] Skipping scenario ${scenarioId}: steps query failed`) + processed += 1 + params.setProcessed(processed) + continue + } const annotationsQueryAtom = scenarioAnnotationsQueryStateAtomFamily(scenarioId) - await waitForStoreAtomValue<QueryStateLike | null | undefined>( + const annotationsQuery = await waitForStoreAtomValue<QueryStateLike | null | undefined>( annotationsQueryAtom, isQuerySettledOrNullForExport, 2500, ) + if (annotationsQuery?.error) { + console.warn(`[export] Skipping scenario ${scenarioId}: annotations query failed`) + processed += 1 + params.setProcessed(processed) + continue + }
2339-2365:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftTimeout resolution silently returns potentially invalid state.
When the timeout expires,
waitForStoreAtomValueresolves with the current atom value regardless of whetherisReady()passes. This allows callers to proceed with stale or pending data when queries take longer than expected.This relates to the past review concern: if annotation/step queries are still pending when the timeout fires, Lines 2646-2653 will build export rows from incomplete store state, potentially exporting rows with missing annotation outputs.
Consider rejecting on timeout or returning a discriminated result indicating whether the wait succeeded:
Proposed fix to reject on timeout
async function waitForStoreAtomValue<T>( atomToWatch: unknown, isReady: (value: T) => boolean, timeoutMs = 5000, ): Promise<T> { const store = getStore() const atomRef = atomToWatch as unknown as Parameters<typeof store.get>[0] const subRef = atomToWatch as unknown as Parameters<typeof store.sub>[0] const current = store.get(atomRef) as T if (isReady(current)) return current return await new Promise<T>((resolve, reject) => { const timeout = setTimeout(() => { unsubscribe() - resolve(store.get(atomRef) as T) + const final = store.get(atomRef) as T + if (isReady(final)) { + resolve(final) + } else { + reject(new Error("Timed out waiting for atom value")) + } }, timeoutMs) const unsubscribe = store.sub(subRef, () => { const next = store.get(atomRef) as T if (isReady(next)) { clearTimeout(timeout) unsubscribe() resolve(next) } }) }) }
🧹 Nitpick comments (1)
web/packages/agenta-annotation/src/state/controllers/annotationSessionController.ts (1)
1142-1219: 💤 Low valueRemove or gate debug logging in production code.
The
scenarioAnnotationTraceIdsAtomFamilyatom contains extensiveconsole.logstatements with[annot-debug]prefix (Lines 1142-1144, 1151-1153, 1180-1183, 1192-1218). These appear to be development debugging aids that would create noise in production console output.Consider removing these logs or gating them behind a debug flag:
Suggested cleanup
+const DEBUG_ANNOTATIONS = false // or process.env.NODE_ENV === 'development' + const scenarioAnnotationTraceIdsAtomFamily = atomFamily((scenarioId: string) => atom<string[]>((get) => { const runId = get(activeRunIdAtom) if (!runId || !scenarioId) { - console.log( - `[annot-debug] traceIds(${scenarioId?.slice(0, 8)}): no runId or scenarioId`, - ) + if (DEBUG_ANNOTATIONS) { + console.log( + `[annot-debug] traceIds(${scenarioId?.slice(0, 8)}): no runId or scenarioId`, + ) + } return [] } // ... similar changes for other debug logs
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: c15582c6-c412-43ae-80cd-a8c6c4f664cf
📒 Files selected for processing (2)
web/packages/agenta-annotation/src/state/controllers/annotationSessionController.tsweb/packages/agenta-annotation/src/state/testsetSync.ts
| newTestsetSlug?: string | ||
| } | ||
|
|
||
| const lastUsedTestsetByProjectAtom = atom<Record<string, string | null>>({}) |
…xt acceptance to provide entity link easily on the message for faster and better navigation
- refactor code - rendering json data in beautified mode by default
What's new
Fixes
QA scope (Annotation-queue)