UN-3266 [FIX] Preventing re-extraction in managing documents by harini-venkataraman · Pull Request #1909 · Zipstack/unstract

harini-venkataraman · 2026-04-09T09:51:33Z

What

Fix the Manage Documents → Index flow in Prompt Studio so it no longer re-extracts the document on every click when the X2Text config and enable_highlight setting have not changed.
Restore the marker-driven extract-reuse behaviour that existed in the pre-async sync path, but now inside the async ide_index dispatch added in the Phase 4 executor migration.

Why

QA bug: clicking Index in Prompt Studio → Manage Documents always ran full extraction, even on repeat clicks with an unchanged X2Text adapter and highlight toggle. Expected (and prior) behaviour: if the extraction marker is valid, skip the X2Text call and only run indexing.
Root cause: the Phase 4 async migration (commit 65b6b646) introduced PromptStudioHelper.build_index_payload, which ships a compound operation="ide_index" payload to the executor. The executor's _handle_ide_index unconditionally calls _handle_extract, bypassing the marker check that dynamic_extractor (still used by the Answer Prompt sibling flow) uses to gate extraction.
Commit 10b24314 wired mark_extraction_status(extracted=True) into the ide_index_complete callback, so the marker gets written after success, but nothing on the next click ever read it. Producer side was in place; consumer side was missing.
_handle_ide_index is only dispatched from build_index_payload (sole production call site verified), so the fix is surgically scoped to a single entry path.

How

Mirror exactly what the pre-async index_document did: check the marker before dispatching, and if it's valid, read the existing extract file from disk and pre-populate index_params[IKeys.EXTRACTED_TEXT] on the payload. The executor then sees the field is already set and skips its extract step. No new flags, operations, or state.

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py::build_index_payload:

Moved the x2text_config_hash computation earlier (right after fs_instance is initialised) so both the marker check and the post-success callback consume the same value.
Added a marker-check block that mirrors dynamic_extractor: call PromptStudioIndexHelper.check_extraction_status(...); on a hit, call fs_instance.read(extract_file_path, mode="r") and stash the text in a local reused_extracted_text. FileNotFoundError and any other exception from the check fall through to full extraction with a warning log.
After index_params is built, pre-populate index_params[IKeys.EXTRACTED_TEXT] = reused_extracted_text when the marker hit.

workers/executor/executors/legacy_executor.py::_handle_ide_index:

Reads pre_extracted_text = index_params.get(IKeys.EXTRACTED_TEXT, "") or "".
On a marker hit (pre_extracted_text truthy), logs and reuses the text directly; _handle_extract is not called.
On a miss, the existing extract → index flow runs unchanged.

workers/ide_callback/tasks.py::ide_index_complete: unchanged. Already calls mark_extraction_status(extracted=True) on success — idempotent whether extraction ran or was skipped.

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

No regressions expected. Reasoning:

Marker-miss path is byte-identical to current behaviour. When check_extraction_status returns False, the file is missing, or the check raises, reused_extracted_text stays None, index_params[IKeys.EXTRACTED_TEXT] is not pre-populated, and the executor runs the existing extract-then-index path. First-time index, X2Text config change, highlight toggle, deleted extract file, and unexpected internal errors all fall through to this unchanged path.
Marker-hit path uses the exact mechanism dynamic_extractor already uses for the Answer Prompt flow — same x2text_config_hash, same IndexManager.extraction_status row, same fs_instance.read(extract_file_path). That flow is the one the user confirmed "is working fine". We are restoring the pre-async behaviour for the index path, not introducing a new mechanism.
Single, narrow entry point. _handle_ide_index is only dispatched from build_index_payload (verified — no other production call sites). No other operation or flow is touched.
Rolling-deploy safe. Old in-flight ide_index payloads queued before deploy won't carry index_params[IKeys.EXTRACTED_TEXT]; the executor's pre_extracted_text will be empty and extract runs as before. No migration step required.
Callback remains idempotent. ide_index_complete calls mark_extraction_status(extracted=True) regardless of whether extraction ran or was skipped, so on a marker hit it's a no-op refresh of an already-valid row.

Database Migrations

None. Reuses the existing IndexManager.extraction_status JSON column and the PromptStudioIndexHelper.check_extraction_status / mark_extraction_status helpers already present on main.

Env Config

None. No new environment variables, feature flags, or settings.

Relevant Docs

N/A. No user-facing behaviour change — this restores the long-standing "Index doesn't re-extract when nothing changed" behaviour that the Phase 4 async migration regressed.

Related Issues or PRs

Regression introduced by commit 65b6b646 (Phase 4 async migration — build_index_payload).
Partial follow-up by commit 10b24314 ("Fixing re-indexing marker") which wired the producer side (mark_extraction_status in ide_index_complete). This PR completes the loop by adding the consumer side.
Prior sync implementation: PromptStudioHelper.dynamic_extractor (still present in prompt_studio_helper.py) — this PR inlines the same marker-check primitives into build_index_payload.

Dependencies Versions

No dependency changes.

Notes on Testing

Automated

UV=/home/harini/Documents/Workspace/unstract-poc/clean/unstract/backend/venv/bin/uv

# Workers: phase4 + phase5 sanity (41 passed, up from 39 — includes 2 new marker tests)
cd workers && $UV run pytest -v tests/test_sanity_phase4.py tests/test_sanity_phase5.py

# Workers: full suite (542 passed; 5 pre-existing failures in
# test_answer_prompt.py and test_sanity_phase3.py unrelated to this change,
# confirmed by stash-and-retest)
cd workers && $UV run pytest

# Backend: prompt_studio + usage_v2 tests (7 passed — includes 4 new
# tests for build_index_payload marker branches)
cd backend && ./venv/bin/uv run --active pytest -v \
    prompt_studio/prompt_studio_core_v2/tests/ usage_v2/tests/

New tests added:

workers/tests/test_sanity_phase5.py::TestIdeIndexEagerChain:
- test_ide_index_reuses_pre_extracted_text — pre-populates index_params["extracted_text"], patches the X2Text adapter to raise if called, and asserts success + perform_indexing received the reused text.
- test_ide_index_without_pre_extracted_text_runs_extract — regression guard for the default (marker-miss) path.
backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py (new file):
- test_marker_hit_prepopulates_extracted_text — check_extraction_status=True + fs.read="existing" → EXTRACTED_TEXT == "existing".
- test_marker_hit_missing_file_does_not_prepopulate — check_extraction_status=True + fs.read raises FileNotFoundError → field not set.
- test_marker_miss_does_not_prepopulate — check_extraction_status=False → field not set, fs.read not called.
- test_check_extraction_status_raises_is_swallowed — check_extraction_status raises → warning logged, field not set, no exception propagates.

The backend test uses sys.modules stubbing (mirroring usage_v2/tests/test_helper.py) so it runs under plain pytest without pytest-django.

Manual QA in Prompt Studio

Upload a PDF under Manage Documents, click Index. Worker logs show _handle_extract ran; IndexManager.extraction_status gets an entry for the current x2text_config_hash.
Click Index again on the same document. Backend logs Manage Documents index: marker valid, reusing existing extract file for document=<id>; worker logs ide_index: marker hit, skipping extract step; indexing completes; _handle_extract did NOT run.
Change the X2Text adapter config (any metadata change) → click Index → extraction runs (new hash → marker miss).
Toggle enable_highlight on the tool → click Index → extraction runs (highlight mismatch → check_extraction_status returns False).
Manually delete the extract file → click Index → backend logs Marker says extracted but extract file missing; executor runs extract.
Answer Prompt flow continues to work unchanged (it still calls dynamic_extractor directly via build_fetch_response_payload).

Screenshots

N/A — no UI changes. Behaviour is verifiable via backend and worker logs during the manual QA steps above.

Checklist

I have read and understood the Contribution Guidelines.

Conflicts resolved: - docker-compose.yaml: Use main's dedicated dashboard_metric_events queue for worker-metrics - PromptCard.jsx: Keep tool_id matching condition from our async socket feature - PromptRun.jsx: Merge useEffect import from main with our branch - ToolIde.jsx: Keep fire-and-forget socket approach (spinner waits for socket event) - SocketMessages.js: Keep both session-store and socket-custom-tool imports + updateCusToolMessages dep - SocketContext.js: Keep simpler path-based socket connection approach - usePromptRun.js: Keep Celery fire-and-forget with socket delivery over polling - setupProxy.js: Accept main's deletion (migrated to Vite)

…on-backend

for more information, see https://pre-commit.ci

…on-backend

… into feat/execution-backend

for more information, see https://pre-commit.ci

… into feat/execution-backend

for more information, see https://pre-commit.ci

Collapse multi-line `<Typography.Text>null</Typography.Text>` JSX to a single line so biome's formatter passes in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a defensive guard in `UsageHelper.get_usage_by_model()` that drops `Usage` rows where `usage_type == "llm"` and `llm_usage_reason` is empty. Per the Usage model contract, an empty reason is only valid when `usage_type == "embedding"`; an empty reason combined with `usage_type == "llm"` is a producer-side bug (an LLM call site forgot to pass `llm_usage_reason` in `usage_kwargs`). Without this guard the row surfaces in API deployment responses as a malformed bare `"llm"` bucket with no token breakdown alongside the legitimate `"extraction_llm"` bucket. The guard logs a warning on every dropped row so future producer regressions are detectable. Adds three regression tests in `backend/usage_v2/tests/test_helper.py` that stub `account_usage.models` and `usage_v2.models` in `sys.modules` so the helper can be imported without Django being set up: - `test_unlabeled_llm_row_is_dropped` — bare "llm" bucket disappears - `test_embedding_row_is_preserved` — guard is scoped to LLM rows - `test_all_three_llm_reasons_coexist` — extraction/challenge/summarize Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

- legacy_executor: extract _run_pipeline_answer_step helper to drop _handle_structure_pipeline cognitive complexity from 18 to under 15 - legacy_executor: bundle 9 prompt-run scalars into a prompt_run_args dict so _run_line_item_extraction has 8 params (was 15, limit 13) - legacy_executor: merge implicitly concatenated log string - structure_tool_task: extract _write_pipeline_outputs helper used by both _execute_structure_tool_impl and _run_agentic_extraction to remove the duplicated INFILE / COPY_TO_FOLDER write block (fixes the 6.1% duplication on new code) - test_context_retrieval_metrics: use pytest.approx for float compare, drop unused executor local, drop always-true if is_single_pass Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

for more information, see https://pre-commit.ci

…ming Drop _inject_context_retrieval_metrics and its call site in _handle_single_pass_extraction. The helper was timing a second fs.read against a warm cache (the cloud plugin had already read the file to build its combined prompt) and reporting that under context_retrieval, which is a fabricated number, not a measurement. The cloud plugin is the source of the file read for single-pass and is responsible for populating context_retrieval in its returned metrics. Updated the docstring to spell out the contract. Also fix misleading "Completed prompt" streaming in the table and line-item extraction wrappers: the message was firing on both the success and failure branches, and on failure the user never saw the error (it only went to logger.error). Move the success-only message into the success branch and stream the error at LogLevel.ERROR on the failure branch. Fall back to "unknown error" when the plugin returns an empty result.error. Drop the now-orphan TestInjectContextRetrievalMetrics test class (six tests calling the deleted method) and update the module docstring. Surviving classes (TestSinglePassChunkSizeForcing, TestPipelineIndexUsageKwargsPropagation) cover unrelated invariants and are kept. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-09T09:51:41Z

Summary by CodeRabbit

New Features
- Document indexing now optimizes by reusing pre-extracted text when available, reducing redundant processing.
- Graceful fallback to full extraction if pre-extracted content is unavailable or inaccessible.
Tests
- Added comprehensive test coverage for document indexing extraction optimization behavior.

Walkthrough

The changes implement an extraction marker reuse optimization for IDE index operations. When a recent extraction has been completed, the helper pre-computes a config hash, checks the extraction status marker, and pre-populates extracted text into the index payload. The executor then conditionally skips re-extraction if the text is available, falling back to full extraction on marker misses or failures.

Changes

Cohort / File(s)	Summary
Core Helper Logic `backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`	Modified `build_index_payload()` to compute `x2text_config_hash` once, call `PromptStudioIndexHelper.check_extraction_status()` for marker checks, and pre-populate `index_params[IKeys.EXTRACTED_TEXT]` with cached extracted content when marker hits and file is readable; falls back to full extraction on any failures.
Core Executor Logic `workers/executor/executors/legacy_executor.py`	Updated `_handle_ide_index()` to conditionally skip extraction when `IKeys.EXTRACTED_TEXT` is pre-populated in `index_params`; otherwise performs standard extraction and summarization workflow.
Test Infrastructure `backend/prompt_studio/prompt_studio_core_v2/tests/__init__.py`	Added module initialization file with comment.
Build Index Payload Tests `backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py`	New regression test module validating extraction marker reuse behavior via `MagicMock`-stubbed dependencies; covers marker-hit with readable file, marker-hit with `FileNotFoundError`, marker-miss, and exception-swallowing fallback scenarios.
Executor Integration Tests `workers/tests/test_sanity_phase5.py`	Added two integration tests verifying IDE index eager-chain behavior: one confirming extraction is skipped when `extracted_text` is pre-populated, another confirming extraction runs normally when `extracted_text` is absent.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and accurately summarizes the main change: fixing re-extraction in the Manage Documents Index flow, which is the core purpose of this PR.
Description check	✅ Passed	The description is comprehensive and complete, covering all template sections including What, Why, How, breaking changes analysis, database migrations, env config, testing notes, and related issues with detailed explanations.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/agentic-executor-queue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

for more information, see https://pre-commit.ci

sonarqubecloud · 2026-04-09T09:54:24Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

greptile-apps · 2026-04-09T09:56:09Z

Greptile Summary

This PR fixes the "extract runs every time" bug in the Manage Documents → Index flow by adding an extraction-marker check to build_index_payload: if the document was already extracted with the same x2text_config_hash + enable_highlight combination, the existing extract file is read and pre-populated into index_params[EXTRACTED_TEXT], letting the agentic executor's _handle_ide_index skip the extract step entirely. The executor-side skip logic and two new test suites (backend unit tests and worker integration tests) are included.

Confidence Score: 5/5

Safe to merge; the only finding is a misleading log message on an uncommon file-read error path, with no impact on correctness or user-facing behavior.

All remaining findings are P2. The core logic — marker check, file reuse, executor skip path — is correct and well-tested by the new unit and integration test suites. The fallback to full extraction on any failure is properly guarded.

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py — outer except message could mislead when the file read (not check_extraction_status) raises.

Vulnerabilities

No security concerns identified. The new code reads local extract files and passes their contents as in-process strings; no user-controlled input is introduced into the extraction path, and all exceptions are swallowed rather than surfaced to callers.

Important Files Changed

Filename	Overview
backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py	Moves x2text_config_hash computation earlier and adds marker-check + file-reuse logic before building the index payload; minor misleading log message when file read raises a non-FileNotFoundError.
workers/executor/executors/legacy_executor.py	Adds pre-extracted text short-circuit in _handle_ide_index: if index_params carries EXTRACTED_TEXT, the extract step is skipped entirely. Logic is clean and correct.
backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py	New regression test suite covering 4 marker-reuse paths via heavy sys.modules stubbing; tests are thorough for the documented scenarios.
workers/tests/test_sanity_phase5.py	Adds two new Celery eager-mode integration tests covering the marker-hit (extract skipped) and marker-miss (extract runs) paths in _handle_ide_index.
backend/prompt_studio/prompt_studio_core_v2/tests/init.py	New empty init.py to register the tests directory as a Python package.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[build_index_payload called] --> B[Compute x2text_config_hash]
    B --> C{check_extraction_status\nmarker hit?}
    C -- Exception --> D[Log warning, reused_extracted_text = None]
    C -- False / miss --> E[reused_extracted_text = None]
    C -- True / hit --> F[fs_instance.read extract file]
    F -- FileNotFoundError --> G[Log warning, reused_extracted_text = None]
    F -- Success --> H[reused_extracted_text = file content]
    D --> I{reused_extracted_text truthy?}
    E --> I
    G --> I
    H --> I
    I -- No --> J[index_params has NO extracted_text]
    I -- Yes --> K[index_params pre-populated with extracted_text]
    J --> L[dispatch ide_index to executor]
    K --> L
    L --> M{executor _handle_ide_index:\nextracted_text in index_params?}
    M -- Yes --> N[Skip _handle_extract\nUse pre-populated text]
    M -- No --> O[Run _handle_extract normally]
    N --> P[_handle_index with extracted text]
    O --> P
    P --> Q[Return ExecutionResult success]

Prompt To Fix All With AI

This is a comment left during a code review.
Path: backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py
Line: 534-553

Comment:
**Outer `except` catches file-read errors with a misleading message**

When `fs_instance.read()` raises anything other than `FileNotFoundError` (e.g. `PermissionError`, `OSError`, a storage-specific exception), the inner `except FileNotFoundError` doesn't catch it, so it bubbles up to the outer `except Exception`. That outer handler logs `"check_extraction_status raised"`, which is incorrect — `check_extraction_status` succeeded fine, it was the subsequent file read that failed. The fallback to full extraction is still correct, but the wrong attribution will make debugging significantly harder.

Consider either broadening the inner exception clause or fixing the outer log message:

```suggestion
            if already_extracted:
                try:
                    reused_extracted_text = fs_instance.read(
                        path=extract_file_path, mode="r"
                    )
                    logger.info(
                        "Manage Documents index: marker valid, reusing existing "
                        "extract file for document=%s",
                        document_id,
                    )
                except (FileNotFoundError, OSError):
                    logger.warning(
                        "Marker says extracted but extract file missing/unreadable: %s. "
                        "Will re-extract.",
                        extract_file_path,
                    )
        except Exception:
            logger.warning(
                "Extraction status check or file read raised; falling back to full extraction",
                exc_info=True,
            )
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py

github-actions · 2026-04-09T10:21:44Z

Test Results

Summary

✅ Runner Tests: 11 passed, 0 failed (11 total)
✅ SDK1 Tests: 178 passed, 0 failed (178 total)

Runner Tests - Full Report

filepath	function	$$\textcolor{#23d18b}{\tt{passed}}$$	SUBTOTAL
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_logs}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_cleanup\_skip}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_client\_init}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_exists}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_container\_run\_config\_without\_mount}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_run\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_get\_image\_for\_sidecar}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{runner/src/unstract/runner/clients/test\_docker.py}}$$	$$\textcolor{#23d18b}{\tt{test\_sidecar\_container}}$$	$$\textcolor{#23d18b}{\tt{1}}$$	$$\textcolor{#23d18b}{\tt{1}}$$
$$\textcolor{#23d18b}{\tt{TOTAL}}$$		$$\textcolor{#23d18b}{\tt{11}}$$	$$\textcolor{#23d18b}{\tt{11}}$$

SDK1 Tests - Full Report

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py`:
- Around line 596-599: The current guard drops empty-but-valid reads because it
only assigns reused_extracted_text into index_params when truthy; change the
conditional in the block that sets index_params[IKeys.EXTRACTED_TEXT] (where
reused_extracted_text is checked) to test for "is not None" instead of
truthiness so that empty string results from fs_instance.read() are preserved
and the executor's _handle_ide_index will correctly skip extraction.

In
`@backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py`:
- Around line 131-170: The test stubs replace
prompt_studio.prompt_studio_core_v2 with a fake package that has an empty
__path__, breaking resolution of the submodule prompt_studio_helper and causing
tests to be skipped; fix by either preserving the real parent package __path__
when calling _install_package/_install for "prompt_studio.prompt_studio_core_v2"
so that prompt_studio_helper can be imported, or explicitly inject a stub module
for "prompt_studio.prompt_studio_core_v2.prompt_studio_helper" into sys.modules
before the import attempt (refer to symbols _install_package, _install,
prompt_studio.prompt_studio_core_v2, prompt_studio_helper, and sys.modules to
locate where to make the change).

In `@workers/executor/executors/legacy_executor.py`:
- Around line 412-420: The code treats a present-but-empty extracted text as a
cache miss due to a truthiness check on pre_extracted_text; change the
conditional that uses index_params.get(IKeys.EXTRACTED_TEXT, "") so that you
check for presence with "is not None" (i.e., pre_extracted_text is not None)
instead of relying on truthiness, and if present assign extracted_text =
pre_extracted_text and log via logger.info as before so _handle_extract is not
re-run for legitimate empty-string extractions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3216ddd-5038-4c53-a052-3957345961ea

📥 Commits

Reviewing files that changed from the base of the PR and between 8dec64e and 7012ebf.

📒 Files selected for processing (5)

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py
backend/prompt_studio/prompt_studio_core_v2/tests/__init__.py
backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py
workers/executor/executors/legacy_executor.py
workers/tests/test_sanity_phase5.py

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py

backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py

workers/executor/executors/legacy_executor.py

chandrasekharan-zipstack

LGTM, I hope the tests being written are useful and valid. I'd suggest wiring them with tox if that's the case. Otherwise, let's remove it and write it cleanly.
A test file named sanity_phase5 will not make sense after a while

harini-venkataraman · 2026-04-09T11:02:24Z

LGTM, I hope the tests being written are useful and valid. I'd suggest wiring them with tox if that's the case. Otherwise, let's remove it and write it cleanly. A test file named sanity_phase5 will not make sense after a while

Noted @chandrasekharan-zipstack.

harini-venkataraman and others added 30 commits February 19, 2026 20:39

Execution backend - revamp

2da4907

async flow

41eeef8

Streaming progress to FE

f66dfb2

Removing multi hop in Prompt studio ide and structure tool

95c6592

Merge remote-tracking branch 'origin/main' into feat/execution-backend

44a2b3f

UN-3234 [FIX] Add beta tag to agentic prompt studio navigation item

2f4f2dc

Added executors for agentic prompt studio

d041201

Merge branch 'main' of github.com:Zipstack/unstract into feat/executi…

0a0cfb1

…on-backend

Merge branch 'main' of github.com:Zipstack/unstract into feat/executi…

a4e1fd7

…on-backend

Added executors for agentic prompt studio

ae77d6a

Added executors for agentic prompt studio

5c22956

Removed redundant envs

3cc3213

Removed redundant envs

d0532f8

Removed redundant envs

6173df5

[pre-commit.ci] auto fixes from pre-commit.com hooks

bbe6f58

for more information, see https://pre-commit.ci

Removed redundant envs

a3dc912

Merge branch 'main' of github.com:Zipstack/unstract into feat/executi…

98c8071

…on-backend

Merge branch 'feat/execution-backend' of github.com:Zipstack/unstract…

21157ac

… into feat/execution-backend

Removed redundant envs

0216b59

Removed redundant envs

db81b9d

Removed redundant envs

e1da202

Removed redundant envs

d119797

Removed redundant envs

fbadbf8

Removed redundant envs

882296e

Removed redundant envs

6d3bbbf

[pre-commit.ci] auto fixes from pre-commit.com hooks

292460b

for more information, see https://pre-commit.ci

Removed redundant envs

f35c0e6

Merge branch 'feat/execution-backend' of github.com:Zipstack/unstract…

9bcb458

… into feat/execution-backend

adding worker for callbacks

0cbd10a

pre-commit-ci bot and others added 12 commits April 6, 2026 13:58

[pre-commit.ci] auto fixes from pre-commit.com hooks

6f2ce13

for more information, see https://pre-commit.ci

Fix biome formatting in DisplayPromptResult

0533ced

Collapse multi-line `<Typography.Text>null</Typography.Text>` JSX to a single line so biome's formatter passes in CI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

095c7d1

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

7421f3b

for more information, see https://pre-commit.ci

Addressing greptile comments

1a79030

Addressing greptile comments

5c3b67c

[pre-commit.ci] auto fixes from pre-commit.com hooks

adda29e

for more information, see https://pre-commit.ci

Fixing re-indexing marker

10b2431

Fixing reindexing issue in Manage documents

2a381d0

harini-venkataraman and others added 2 commits April 9, 2026 15:21

Merge branch 'main' into fix/agentic-executor-queue

80b886e

[pre-commit.ci] auto fixes from pre-commit.com hooks

7012ebf

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Apr 9, 2026

View reviewed changes

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py Show resolved Hide resolved

harini-venkataraman changed the title ~~Fix/agentic executor queue~~ UN-3266 [FIX] Preventing re-extraction in managing documents Apr 9, 2026

harini-venkataraman marked this pull request as ready for review April 9, 2026 10:20

harini-venkataraman requested review from Deepak-Kesavan, chandrasekharan-zipstack and pk-zipstack April 9, 2026 10:20

Deepak-Kesavan approved these changes Apr 9, 2026

View reviewed changes

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

backend/prompt_studio/prompt_studio_core_v2/prompt_studio_helper.py Show resolved Hide resolved

backend/prompt_studio/prompt_studio_core_v2/tests/test_build_index_payload.py Show resolved Hide resolved

workers/executor/executors/legacy_executor.py Show resolved Hide resolved

chandrasekharan-zipstack approved these changes Apr 9, 2026

View reviewed changes

harini-venkataraman merged commit 345e920 into main Apr 9, 2026
10 checks passed

harini-venkataraman deleted the fix/agentic-executor-queue branch April 9, 2026 11:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UN-3266 [FIX] Preventing re-extraction in managing documents#1909

UN-3266 [FIX] Preventing re-extraction in managing documents#1909
harini-venkataraman merged 155 commits intomainfrom
fix/agentic-executor-queue

harini-venkataraman commented Apr 9, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Apr 9, 2026

Uh oh!

greptile-apps bot commented Apr 9, 2026

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Flowchart

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chandrasekharan-zipstack left a comment

Uh oh!

harini-venkataraman commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

harini-venkataraman commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)

Database Migrations

Env Config

Relevant Docs

Related Issues or PRs

Dependencies Versions

Notes on Testing

Automated

Manual QA in Prompt Studio

Screenshots

Checklist

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Uh oh!

sonarqubecloud bot commented Apr 9, 2026

Quality Gate passed

Uh oh!

greptile-apps bot commented Apr 9, 2026

Greptile Summary

Confidence Score: 5/5

Vulnerabilities

Important Files Changed

Flowchart

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2026

Test Results

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chandrasekharan-zipstack left a comment

Choose a reason for hiding this comment

Uh oh!

harini-venkataraman commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

harini-venkataraman commented Apr 9, 2026 •

edited

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading