Skip to content

Adding Agentic Retrieval as a new retrieveral mode#2018

Open
mahikaw wants to merge 6 commits into
mainfrom
dev/mahikaw/agentic_retrieval
Open

Adding Agentic Retrieval as a new retrieveral mode#2018
mahikaw wants to merge 6 commits into
mainfrom
dev/mahikaw/agentic_retrieval

Conversation

@mahikaw

@mahikaw mahikaw commented May 12, 2026

Copy link
Copy Markdown
Collaborator

Description

Agentic retrieval mode + BEIR / query-CSV evaluation

Summary

Adds an LLM-driven agentic retrieval strategy as an alternative to the single dense-retrieval pass, plus first-class evaluation for it (BEIR-style datasets and ad-hoc query CSVs). Additive — the standard retrieval path and outputs are unchanged; agentic mode reuses the existing Retriever/vector DB and is opt-in via --retrieval-mode agentic.

What's new

  • Agentic retrievalReActAgentOperator runs a per-query ReAct loop (issues retrieval sub-queries, accumulates candidates across steps, decides when to stop) → RRFAggregatorOperator fuses across steps (RRF, k=60) → SelectionAgentOperator does a final LLM selection, with a source-priority fallback chain (final_results → RRF → selection → candidate_ranking).
  • --evaluation-mode beir — score against a registered benchmark: vidore_hf (needs datasets) plus CSV/JSON loaders; recall@k / ndcg@k.
  • --evaluation-mode recall — score agentic retrieval against a query CSV (query + golden_answer), no dataset loader required (agentic-only; pdf_page/pdf_only).
  • CLI flags--retrieval-mode, --agentic-llm-model, --agentic-invoke-url, --agentic-react-max-steps (50), --agentic-backend-top-k (20), --agentic-text-truncation (0 = none), --agentic-reasoning-effort (high), --agentic-num-concurrent (1), and --beir-loader/-dataset-name/-doc-id-field/-split/-query-language.
  • Docs & tests — README "Agentic retrieval evaluation" section; agentic/README.md; test_agentic_eval.py + test_agentic_operators.py.

Results — ViDoRe v3

Benchmarked against the reference agentic pipeline (retrieval-bench) under an
identical, controlled setup so the comparison isolates the retrieval
framework: same page-level image+text index (llama-nemotron-embed-vl-1b-v2
embedder), same agent LLM (llama-3.3-nemotron-super-49b-v1.5), same agent
settings (reasoning_effort=high, retriever pool depth 20, target top-k 10,
max 50 ReAct steps), full query sets. The retrieval substrate is shared, so the
numbers reflect the agent framework only.

Domain recall@10 (ref / this PR) nDCG@10 (ref / this PR)
computer_science 0.7431 / 0.7234 0.7396 / 0.7182
energy 0.6975 / 0.6612 0.6369 / 0.6274
finance_en 0.6406 / 0.6134 0.6109 / 0.5951
finance_fr 0.4750 / 0.4491 0.4182 / 0.4008
hr 0.5775 / 0.5583 0.5631 / 0.5523
industrial 0.4695 / 0.4636 0.4543 / 0.4615
pharmaceuticals 0.6724 / 0.6711 0.6449 / 0.6439
physics 0.4560 / 0.4353 0.4373 / 0.4133
Macro avg 0.5914 / 0.5719 0.5632 / 0.5516

▎ Both runs share the same index, embedder, agent LLM, reasoning_effort, and top-k; the only setting not pinned is agent-LLM sampling temperature (this PR uses greedy 0.0; the reference uses its endpoint default).

The graph-operator implementation tracks the reference pipeline across all eight
domains on a shared substrate.

Scope

  • No changes to the standard retrieval path or shared modules; opt-in.
  • Metric/log format follows existing pipeline conventions.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@mahikaw mahikaw changed the title Agentic Retrieval integration into retriever pipeline Adding Agentic Retrieval as a new retrieveral mode May 12, 2026
@mahikaw mahikaw force-pushed the dev/mahikaw/agentic_retrieval branch from 44daf00 to 4faa3c6 Compare June 9, 2026 16:34
@mahikaw mahikaw marked this pull request as ready for review June 9, 2026 19:47
@mahikaw mahikaw requested review from a team as code owners June 9, 2026 19:47
@mahikaw mahikaw requested a review from ChrisJar June 9, 2026 19:47
@mahikaw mahikaw force-pushed the dev/mahikaw/agentic_retrieval branch from ed278c7 to 054256a Compare June 9, 2026 19:56
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces an opt-in agentic retrieval mode (--retrieval-mode agentic) that chains ReActAgentOperator -> RRFAggregatorOperator -> SelectionAgentOperator on top of the existing Retriever/VDB, plus two new evaluation modes (beir, recall) backed by the same graph. The standard retrieval path is fully unchanged.

  • New agentic pipeline (agentic/retrieval.py): AgenticRetriever wraps the existing Retriever behind a per-call lock (documented), assigns positional query IDs internally, then maps them back to caller IDs via _raw_hits_to_agentic_result; supports concurrent query processing via ThreadPoolExecutor.
  • Priority-based result selection (selection_agent_operator.py): a source-priority chain (ReAct final_results -> RRF ranking -> LLM selection -> candidate ranking) now governs final output; in the AgenticRetriever pipeline the RRF step always fires, so LLM selection acts only as a last-resort fallback.
  • CLI additions (pipeline/__main__.py): six new --agentic-* flags and two evaluation modes wired into _run_agentic_evaluation; input validation and env-var resolution follow the same patterns as existing flags.

Confidence Score: 5/5

Safe to merge; the change is fully additive, the standard retrieval path is untouched, and the new agentic path is well-tested with mocks.

All findings are non-blocking quality improvements: an unbounded log line in _run_agentic_evaluation, a missing num_concurrent guard in the config, a duplicated logging helper, and very verbose per-step INFO logging. No functional bugs were found in the retrieval logic, ID mapping, concurrency ordering, or fallback chain.

pipeline/main.py for the unbounded _qrels log and the missing --agentic-api-key flag; react_agent_operator.py for the per-step INFO verbosity.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/agentic/retrieval.py New file implementing the core agentic retrieval pipeline (AgenticRetrievalConfig, AgenticRetriever, evaluation helpers). Solid overall; the lock-serialized _retrieve_for_agent is now documented. Minor: num_concurrent is not validated in post_init.
nemo_retriever/src/nemo_retriever/agentic/init.py Public all now includes all six exported symbols including run_agentic_beir_evaluation; addresses the prior comment about missing exports.
nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py Adds reasoning_effort, backend_top_k cap, _validate_final_results_args, concurrent output ordering fix, and extensive INFO-level step logging. The INFO log for each ReAct loop step will produce up to 6 x max_steps log lines per query, which can be very noisy with the default 50-step limit.
nemo_retriever/src/nemo_retriever/graph/selection_agent_operator.py Adds source-priority fallback chain (final_results -> RRF -> LLM selection -> candidate_ranking), reasoning_effort forwarding, and result_source tracking. LLM selection is now effectively bypassed when rrf_score column is present (always the case in the AgenticRetriever pipeline), making it a last-resort-only path.
nemo_retriever/src/nemo_retriever/graph/rrf_aggregator_operator.py Adds react_final_rank tracking and has_valid_final_results propagation to the RRF output schema; clean and well-scoped change.
nemo_retriever/src/nemo_retriever/pipeline/main.py Adds --retrieval-mode, 6 agentic CLI flags, and _run_agentic_evaluation(). The agentic LLM always uses remote_api_key with no --agentic-api-key override flag. Unbounded _qrels logging at INFO could produce very large log lines on big BEIR datasets.
nemo_retriever/tests/test_agentic_eval.py New test file with 9 tests covering config validation, BEIR/recall evaluation, CLI flag wiring, and error paths. All external services are mocked.
nemo_retriever/tests/test_agentic_operators.py Adds 8 new operator-level tests covering backend_top_k cap, final_results validation, concurrent ordering fix, RRF priority bypass, and fallback behavior. Comprehensive and well-named.
nemo_retriever/tests/test_graph_pipeline_cli.py Renames one test to reflect the now-valid recall evaluation mode. Minimal, correct update.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    CLI["--retrieval-mode agentic"] --> AE["_run_agentic_evaluation()"]
    AE --> ARC["AgenticRetrievalConfig"]
    ARC --> AR["AgenticRetriever.retrieve()"]
    AR --> AQI["AgenticQueryInputOperator"]
    AQI --> REACT["ReActAgentOperator (max_steps=50)"]
    REACT --> RETR["_retrieve_for_agent() via _lock"]
    RETR --> VDB[(VectorDB)]
    REACT --> RRF["RRFAggregatorOperator k=60"]
    RRF --> SAO["SelectionAgentOperator"]
    SAO --> P1["1 final_results"]
    SAO --> P2["2 RRF ranking"]
    SAO --> P3["3 LLM selection"]
    SAO --> P4["4 candidate_ranking"]
    SAO --> OUT["pd.DataFrame result"]
    OUT --> METRICS["compute_beir_metrics()"]
Loading
Prompt To Fix All With AI
Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 5
nemo_retriever/src/nemo_retriever/agentic/retrieval.py:140-146
`num_concurrent` has no validation in `__post_init__`, but `react_max_steps` and `text_truncation` do. A programmatic caller who passes `num_concurrent=0` will not get an error from the config; the failure surfaces later as a `ValueError: max_workers must be greater than 0` from `ThreadPoolExecutor`, with no indication of which config field caused it.

```suggestion
    def __post_init__(self) -> None:
        if not str(self.llm_model).strip():
            raise ValueError("Agentic retrieval requires a non-empty llm_model.")
        if int(self.react_max_steps) < 1:
            raise ValueError("react_max_steps must be >= 1.")
        if int(self.text_truncation) < 0:
            raise ValueError("text_truncation must be >= 0.")
        if int(self.num_concurrent) < 1:
            raise ValueError("num_concurrent must be >= 1.")
```

### Issue 2 of 5
nemo_retriever/src/nemo_retriever/pipeline/__main__.py:641
**Unbounded qrels dict logged at INFO**

`_qrels` is logged in full without any size cap. On a ViDoRe domain with 300+ queries each with multiple relevant docs, this generates a single INFO line that can exceed log-aggregator limits and makes the log unreadable. The `_run` dict immediately below already applies `[:10]` per query; `_qrels` should do the same, or both should be dropped to DEBUG.

### Issue 3 of 5
nemo_retriever/src/nemo_retriever/pipeline/__main__.py:849
**Agentic LLM always uses the general `remote_api_key`**

There is no `--agentic-api-key` CLI flag; the agentic LLM endpoint always receives `remote_api_key`. For deployments where the embedding endpoint and the agentic LLM endpoint are at different services that require different credentials, the wrong key will be sent to the LLM, resulting in an authentication error that gives no hint about which flag to set. Consider adding `--agentic-api-key` (defaulting to `remote_api_key` for backward compatibility) and resolving it the same way `agentic_invoke_url` is resolved from an env var.

### Issue 4 of 5
nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py:534-535
**Per-step INFO logging produces very high log volume**

Each ReAct iteration now emits at least 2-3 INFO lines (`begin seen_docs`, `finish_reason`, and `retrieve` or `final_results`). With `max_steps=50` (the default) and `num_concurrent=N` queries, a single evaluation call produces up to `50 x 3 x N` INFO lines. The step-level loop control logs (`begin`, `finish_reason`, `no tool call; requesting continuation`) are better suited to DEBUG since they carry no actionable information beyond the retrieval and `final_results` calls that already log at INFO.

### Issue 5 of 5
nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py:13-21
**`_preview_text` / `_preview_doc_ids` duplicated in two modules**

Identical implementations of `_preview_text` (and near-identical `_preview_doc_ids`) are defined in both `react_agent_operator.py` and `selection_agent_operator.py` with the same module-level constants (`_LOG_PREVIEW_CHARS = 300`, `_LOG_DOC_ID_LIMIT = 20`). A shared `graph/_utils.py` would keep these in one place and prevent drift if the truncation limit ever needs adjusting.

Reviews (3): Last reviewed commit: "added review fixes" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/agentic/__init__.py
Comment thread nemo_retriever/src/nemo_retriever/agentic/__init__.py
Comment thread nemo_retriever/src/nemo_retriever/agentic/retrieval.py
Comment thread nemo_retriever/src/nemo_retriever/agentic/retrieval.py
Comment thread nemo_retriever/src/nemo_retriever/graph/react_agent_operator.py
@mahikaw mahikaw force-pushed the dev/mahikaw/agentic_retrieval branch from 054256a to ce71d17 Compare June 9, 2026 20:02
mahikaw added 6 commits June 9, 2026 20:26
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
Signed-off-by: Mahika Wason <mwason@nvidia.com>
@mahikaw mahikaw force-pushed the dev/mahikaw/agentic_retrieval branch from ce71d17 to 8c0af28 Compare June 9, 2026 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant