UN-3266 [FIX] Preventing re-extraction in managing documents#1909
UN-3266 [FIX] Preventing re-extraction in managing documents#1909harini-venkataraman merged 155 commits intomainfrom
Conversation
Conflicts resolved: - docker-compose.yaml: Use main's dedicated dashboard_metric_events queue for worker-metrics - PromptCard.jsx: Keep tool_id matching condition from our async socket feature - PromptRun.jsx: Merge useEffect import from main with our branch - ToolIde.jsx: Keep fire-and-forget socket approach (spinner waits for socket event) - SocketMessages.js: Keep both session-store and socket-custom-tool imports + updateCusToolMessages dep - SocketContext.js: Keep simpler path-based socket connection approach - usePromptRun.js: Keep Celery fire-and-forget with socket delivery over polling - setupProxy.js: Accept main's deletion (migrated to Vite)
for more information, see https://pre-commit.ci
… into feat/execution-backend
for more information, see https://pre-commit.ci
… into feat/execution-backend
for more information, see https://pre-commit.ci
Collapse multi-line `<Typography.Text>null</Typography.Text>` JSX to a single line so biome's formatter passes in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a defensive guard in `UsageHelper.get_usage_by_model()` that drops `Usage` rows where `usage_type == "llm"` and `llm_usage_reason` is empty. Per the Usage model contract, an empty reason is only valid when `usage_type == "embedding"`; an empty reason combined with `usage_type == "llm"` is a producer-side bug (an LLM call site forgot to pass `llm_usage_reason` in `usage_kwargs`). Without this guard the row surfaces in API deployment responses as a malformed bare `"llm"` bucket with no token breakdown alongside the legitimate `"extraction_llm"` bucket. The guard logs a warning on every dropped row so future producer regressions are detectable. Adds three regression tests in `backend/usage_v2/tests/test_helper.py` that stub `account_usage.models` and `usage_v2.models` in `sys.modules` so the helper can be imported without Django being set up: - `test_unlabeled_llm_row_is_dropped` — bare "llm" bucket disappears - `test_embedding_row_is_preserved` — guard is scoped to LLM rows - `test_all_three_llm_reasons_coexist` — extraction/challenge/summarize Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
- legacy_executor: extract _run_pipeline_answer_step helper to drop _handle_structure_pipeline cognitive complexity from 18 to under 15 - legacy_executor: bundle 9 prompt-run scalars into a prompt_run_args dict so _run_line_item_extraction has 8 params (was 15, limit 13) - legacy_executor: merge implicitly concatenated log string - structure_tool_task: extract _write_pipeline_outputs helper used by both _execute_structure_tool_impl and _run_agentic_extraction to remove the duplicated INFILE / COPY_TO_FOLDER write block (fixes the 6.1% duplication on new code) - test_context_retrieval_metrics: use pytest.approx for float compare, drop unused executor local, drop always-true if is_single_pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ming Drop _inject_context_retrieval_metrics and its call site in _handle_single_pass_extraction. The helper was timing a second fs.read against a warm cache (the cloud plugin had already read the file to build its combined prompt) and reporting that under context_retrieval, which is a fabricated number, not a measurement. The cloud plugin is the source of the file read for single-pass and is responsible for populating context_retrieval in its returned metrics. Updated the docstring to spell out the contract. Also fix misleading "Completed prompt" streaming in the table and line-item extraction wrappers: the message was firing on both the success and failure branches, and on failure the user never saw the error (it only went to logger.error). Move the success-only message into the success branch and stream the error at LogLevel.ERROR on the failure branch. Fall back to "unknown error" when the plugin returns an empty result.error. Drop the now-orphan TestInjectContextRetrievalMetrics test class (six tests calling the deleted method) and update the module docstring. Surviving classes (TestSinglePassChunkSizeForcing, TestPipelineIndexUsageKwargsPropagation) cover unrelated invariants and are kept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary by CodeRabbit
WalkthroughThe changes implement an extraction marker reuse optimization for IDE index operations. When a recent extraction has been completed, the helper pre-computes a config hash, checks the extraction status marker, and pre-populates extracted text into the index payload. The executor then conditionally skips re-extraction if the text is available, falling back to full extraction on marker misses or failures. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
|
| Filename | Overview |
|---|---|
| backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py | Moves x2text_config_hash computation earlier and adds marker-check + file-reuse logic before building the index payload; minor misleading log message when file read raises a non-FileNotFoundError. |
| workers/executor/executors/legacy_executor.py | Adds pre-extracted text short-circuit in _handle_ide_index: if index_params carries EXTRACTED_TEXT, the extract step is skipped entirely. Logic is clean and correct. |
| backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py | New regression test suite covering 4 marker-reuse paths via heavy sys.modules stubbing; tests are thorough for the documented scenarios. |
| workers/tests/test_sanity_phase5.py | Adds two new Celery eager-mode integration tests covering the marker-hit (extract skipped) and marker-miss (extract runs) paths in _handle_ide_index. |
| backend/prompt_studio/prompt_studio_core_v2/tests/init.py | New empty init.py to register the tests directory as a Python package. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[build_index_payload called] --> B[Compute x2text_config_hash]
B --> C{check_extraction_status\nmarker hit?}
C -- Exception --> D[Log warning, reused_extracted_text = None]
C -- False / miss --> E[reused_extracted_text = None]
C -- True / hit --> F[fs_instance.read extract file]
F -- FileNotFoundError --> G[Log warning, reused_extracted_text = None]
F -- Success --> H[reused_extracted_text = file content]
D --> I{reused_extracted_text truthy?}
E --> I
G --> I
H --> I
I -- No --> J[index_params has NO extracted_text]
I -- Yes --> K[index_params pre-populated with extracted_text]
J --> L[dispatch ide_index to executor]
K --> L
L --> M{executor _handle_ide_index:\nextracted_text in index_params?}
M -- Yes --> N[Skip _handle_extract\nUse pre-populated text]
M -- No --> O[Run _handle_extract normally]
N --> P[_handle_index with extracted text]
O --> P
P --> Q[Return ExecutionResult success]
Prompt To Fix All With AI
This is a comment left during a code review.
Path: backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py
Line: 534-553
Comment:
**Outer `except` catches file-read errors with a misleading message**
When `fs_instance.read()` raises anything other than `FileNotFoundError` (e.g. `PermissionError`, `OSError`, a storage-specific exception), the inner `except FileNotFoundError` doesn't catch it, so it bubbles up to the outer `except Exception`. That outer handler logs `"check_extraction_status raised"`, which is incorrect — `check_extraction_status` succeeded fine, it was the subsequent file read that failed. The fallback to full extraction is still correct, but the wrong attribution will make debugging significantly harder.
Consider either broadening the inner exception clause or fixing the outer log message:
```suggestion
if already_extracted:
try:
reused_extracted_text = fs_instance.read(
path=extract_file_path, mode="r"
)
logger.info(
"Manage Documents index: marker valid, reusing existing "
"extract file for document=%s",
document_id,
)
except (FileNotFoundError, OSError):
logger.warning(
"Marker says extracted but extract file missing/unreadable: %s. "
"Will re-extract.",
extract_file_path,
)
except Exception:
logger.warning(
"Extraction status check or file read raised; falling back to full extraction",
exc_info=True,
)
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`:
- Around line 596-599: The current guard drops empty-but-valid reads because it
only assigns reused_extracted_text into index_params when truthy; change the
conditional in the block that sets index_params[IKeys.EXTRACTED_TEXT] (where
reused_extracted_text is checked) to test for "is not None" instead of
truthiness so that empty string results from fs_instance.read() are preserved
and the executor's _handle_ide_index will correctly skip extraction.
In
`@backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py`:
- Around line 131-170: The test stubs replace
prompt_studio.prompt_studio_core_v2 with a fake package that has an empty
__path__, breaking resolution of the submodule prompt_studio_helper and causing
tests to be skipped; fix by either preserving the real parent package __path__
when calling _install_package/_install for "prompt_studio.prompt_studio_core_v2"
so that prompt_studio_helper can be imported, or explicitly inject a stub module
for "prompt_studio.prompt_studio_core_v2.prompt_studio_helper" into sys.modules
before the import attempt (refer to symbols _install_package, _install,
prompt_studio.prompt_studio_core_v2, prompt_studio_helper, and sys.modules to
locate where to make the change).
In `@workers/executor/executors/legacy_executor.py`:
- Around line 412-420: The code treats a present-but-empty extracted text as a
cache miss due to a truthiness check on pre_extracted_text; change the
conditional that uses index_params.get(IKeys.EXTRACTED_TEXT, "") so that you
check for presence with "is not None" (i.e., pre_extracted_text is not None)
instead of relying on truthiness, and if present assign extracted_text =
pre_extracted_text and log via logger.info as before so _handle_extract is not
re-run for legitimate empty-string extractions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e3216ddd-5038-4c53-a052-3957345961ea
📒 Files selected for processing (5)
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.pybackend/prompt_studio/prompt_studio_core_v2/tests/__init__.pybackend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.pyworkers/executor/executors/legacy_executor.pyworkers/tests/test_sanity_phase5.py
backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py
Show resolved
Hide resolved
chandrasekharan-zipstack
left a comment
There was a problem hiding this comment.
LGTM, I hope the tests being written are useful and valid. I'd suggest wiring them with tox if that's the case. Otherwise, let's remove it and write it cleanly.
A test file named sanity_phase5 will not make sense after a while
Noted @chandrasekharan-zipstack. |



What
enable_highlightsetting have not changed.ide_indexdispatch added in the Phase 4 executor migration.Why
65b6b646) introducedPromptStudioHelper.build_index_payload, which ships a compoundoperation="ide_index"payload to the executor. The executor's_handle_ide_indexunconditionally calls_handle_extract, bypassing the marker check thatdynamic_extractor(still used by the Answer Prompt sibling flow) uses to gate extraction.10b24314wiredmark_extraction_status(extracted=True)into theide_index_completecallback, so the marker gets written after success, but nothing on the next click ever read it. Producer side was in place; consumer side was missing._handle_ide_indexis only dispatched frombuild_index_payload(sole production call site verified), so the fix is surgically scoped to a single entry path.How
Mirror exactly what the pre-async
index_documentdid: check the marker before dispatching, and if it's valid, read the existing extract file from disk and pre-populateindex_params[IKeys.EXTRACTED_TEXT]on the payload. The executor then sees the field is already set and skips its extract step. No new flags, operations, or state.backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py::build_index_payload:x2text_config_hashcomputation earlier (right afterfs_instanceis initialised) so both the marker check and the post-success callback consume the same value.dynamic_extractor: callPromptStudioIndexHelper.check_extraction_status(...); on a hit, callfs_instance.read(extract_file_path, mode="r")and stash the text in a localreused_extracted_text.FileNotFoundErrorand any other exception from the check fall through to full extraction with a warning log.index_paramsis built, pre-populateindex_params[IKeys.EXTRACTED_TEXT] = reused_extracted_textwhen the marker hit.workers/executor/executors/legacy_executor.py::_handle_ide_index:pre_extracted_text = index_params.get(IKeys.EXTRACTED_TEXT, "") or "".pre_extracted_texttruthy), logs and reuses the text directly;_handle_extractis not called.workers/ide_callback/tasks.py::ide_index_complete: unchanged. Already callsmark_extraction_status(extracted=True)on success — idempotent whether extraction ran or was skipped.Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
No regressions expected. Reasoning:
check_extraction_statusreturns False, the file is missing, or the check raises,reused_extracted_textstaysNone,index_params[IKeys.EXTRACTED_TEXT]is not pre-populated, and the executor runs the existing extract-then-index path. First-time index, X2Text config change, highlight toggle, deleted extract file, and unexpected internal errors all fall through to this unchanged path.dynamic_extractoralready uses for the Answer Prompt flow — samex2text_config_hash, sameIndexManager.extraction_statusrow, samefs_instance.read(extract_file_path). That flow is the one the user confirmed "is working fine". We are restoring the pre-async behaviour for the index path, not introducing a new mechanism._handle_ide_indexis only dispatched frombuild_index_payload(verified — no other production call sites). No other operation or flow is touched.ide_indexpayloads queued before deploy won't carryindex_params[IKeys.EXTRACTED_TEXT]; the executor'spre_extracted_textwill be empty and extract runs as before. No migration step required.ide_index_completecallsmark_extraction_status(extracted=True)regardless of whether extraction ran or was skipped, so on a marker hit it's a no-op refresh of an already-valid row.Database Migrations
IndexManager.extraction_statusJSON column and thePromptStudioIndexHelper.check_extraction_status/mark_extraction_statushelpers already present onmain.Env Config
Relevant Docs
Related Issues or PRs
65b6b646(Phase 4 async migration —build_index_payload).10b24314("Fixing re-indexing marker") which wired the producer side (mark_extraction_statusinide_index_complete). This PR completes the loop by adding the consumer side.PromptStudioHelper.dynamic_extractor(still present inprompt_studio_helper.py) — this PR inlines the same marker-check primitives intobuild_index_payload.Dependencies Versions
Notes on Testing
Automated
New tests added:
workers/tests/test_sanity_phase5.py::TestIdeIndexEagerChain:test_ide_index_reuses_pre_extracted_text— pre-populatesindex_params["extracted_text"], patches the X2Text adapter to raise if called, and asserts success +perform_indexingreceived the reused text.test_ide_index_without_pre_extracted_text_runs_extract— regression guard for the default (marker-miss) path.backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py(new file):test_marker_hit_prepopulates_extracted_text—check_extraction_status=True+fs.read="existing"→EXTRACTED_TEXT == "existing".test_marker_hit_missing_file_does_not_prepopulate—check_extraction_status=True+fs.readraisesFileNotFoundError→ field not set.test_marker_miss_does_not_prepopulate—check_extraction_status=False→ field not set,fs.readnot called.test_check_extraction_status_raises_is_swallowed—check_extraction_statusraises → warning logged, field not set, no exception propagates.The backend test uses
sys.modulesstubbing (mirroringusage_v2/tests/test_helper.py) so it runs under plainpytestwithoutpytest-django.Manual QA in Prompt Studio
_handle_extractran;IndexManager.extraction_statusgets an entry for the currentx2text_config_hash.Manage Documents index: marker valid, reusing existing extract file for document=<id>; worker logside_index: marker hit, skipping extract step; indexing completes;_handle_extractdid NOT run.enable_highlighton the tool → click Index → extraction runs (highlight mismatch →check_extraction_statusreturns False).Marker says extracted but extract file missing; executor runs extract.dynamic_extractordirectly viabuild_fetch_response_payload).Screenshots
N/A — no UI changes. Behaviour is verifiable via backend and worker logs during the manual QA steps above.
Checklist
I have read and understood the Contribution Guidelines.