Fix peer recovery activity tracking #20178

maxlepikhin · 2025-12-06T22:43:21Z

Description

Prevent peer/segment recovery timeouts from being reset by mere status polling. Previously, every call to ReplicationCollection#get() bumped the recovery’s lastAccessTime, so a hung recovery could sit in INITIALIZING indefinitely. This change removes that implicit update and explicitly touches the timestamp only when meaningful work happens (start, file transfer, clean‑files, translog, finalize, etc.), ensuring indices.recovery.recovery_activity_timeout actually fails stuck recoveries.

Related Issues

Resolves #20177

Check List

Functionality includes testing. (./gradlew :server:test --tests org.opensearch.recovery.ReplicationCollectionTests)
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

Summary by CodeRabbit

Bug Fixes
- Ensure recovery and replication targets record last-access times when actively used, and prevent frequent polling from inadvertently extending timeouts.
- Stop updating last-access time on simple collection reference creation to avoid misleading activity metrics.
Tests
- Added tests validating last-access updates and that polling does not reset recovery/replication timeouts.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-06T22:43:43Z

Walkthrough

Centralized last-access-time updates: removed automatic updates from ReplicationCollection refs and added explicit setLastAccessTime() calls at recovery and replication checkpoints to record activity for timeout detection across recovery and replication flows.

Changes

Cohort / File(s)	Summary
Recovery handler access-time updates `server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java`	Added `recoveryTarget.setLastAccessTime()` calls at multiple recovery handler entry points (FilesInfo, FileChunk, CleanFiles, PrepareForTranslogOperations, FinalizeRecovery, HandoffPrimaryContext, TranslogOperations) and in the main doRecovery path.
Replication operation access-time updates `server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java`, `server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java`	Added `target.setLastAccessTime()` immediately after obtaining replication targets and at the start of `handleFileChunk` to mark activity when replication work begins.
Removed automatic access-time updates `server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java`	Removed the implicit `status.setLastAccessTime()` call from `ReplicationCollection.ReplicationRef` constructor; last-access updates are now explicit.
Tests: verify polling vs explicit updates `server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java`	Modified `testLastAccessTimeUpdate` to assert that `get()` does not auto-update lastAccessTime and that explicit `setLastAccessTime()` does; added `testRecoveryTimeoutNotResetByPolling` to ensure polling status does not reset recovery timeout.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Multiple similar edits lower cognitive load, but review needs to ensure coverage consistency.
Pay special attention to:
- All recovery and replication entry points where activity should reset the timeout (PeerRecoveryTargetService handlers, SegmentReplicator, ReplicationTarget).
- Any code paths that previously relied on automatic updates from ReplicationRef — ensure no gaps introduced.
- Tests for timing (flaky/wait) and their reliability across CI environments.

Poem

🐰 I hop through logs and tenderly trace,
Timestamps at checkpoints to quicken the race.
No more hidden peeks that silently stay,
Now every hard hop marks time on the way.
Freed slots and fresh retries — a happier place! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix peer recovery activity tracking' directly addresses the main change: fixing how recovery activity (last access time) is tracked to prevent timeouts from being reset by mere polling.
Description check	✅ Passed	The description clearly explains what the change achieves, identifies the related issue (#20177), and confirms testing. All key template sections are present and adequately filled.
Linked Issues check	✅ Passed	The PR directly addresses the first objective from issue #20177: auto-failing hung recoveries by removing implicit lastAccessTime updates on polling and explicitly updating it only during meaningful work, enabling indices.recovery.recovery_activity_timeout to function correctly.
Out of Scope Changes check	✅ Passed	All changes are strictly scoped to implementing the peer recovery timeout fix: removing implicit updates in ReplicationCollection constructor, adding explicit setLastAccessTime() calls in recovery handlers, and adding relevant tests.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Max Lepikhin <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (1)

239-247: Explicit activity bump on each received file chunk

Updating replicationTarget.setLastAccessTime() at the top of handleFileChunk makes chunk traffic reliably count as recovery activity, independent of status polling. Note that FileChunkTransportRequestHandler also calls setLastAccessTime() before invoking this method, so you now have two updates per chunk; that’s harmless but could be consolidated in one place later if you want to avoid duplication.

server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (1)

229-241: Good separation between “real recovery work” and status polling

The new recoveryTarget.setLastAccessTime() calls in doRecovery and all the recovery request handlers ensure that only meaningful events (start/restart, files info, file chunks, clean files, prepare/finalize, handoff primary context, and translog ops, including mapping‑retry runs) advance the activity timestamp. This matches the goal of letting idle/hung recoveries time out even while status is being polled via ReplicationCollection.get().

There is a bit of benign duplication (e.g., for translog ops and file chunks you now bump lastAccessTime in both the transport handler and the downstream helper), so if you later want to tighten things up you could centralize the update in a single layer per operation type.

Also applies to: 399-408, 414-425, 431-438, 447-457, 471-473, 535-552, 559-573, 585-595

server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (1)

96-127: Solid regression test; consider tightening latch assertion

testRecoveryTimeoutNotResetByPolling does a good job simulating frequent status polling and asserting that the recovery still times out (via the custom listener and failed flag), which directly protects against the original bug.

For extra clarity in failure cases, you might also assert that latch.await(30, TimeUnit.SECONDS) itself returns true before checking failed.get(), so a stuck test reports “recovery did not complete” rather than only “failed to timeout.”

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0449ce8 and afeebd6.

📒 Files selected for processing (5)

server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (9 hunks)
server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1 hunks)
server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java (0 hunks)
server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (1 hunks)
server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (2 hunks)

💤 Files with no reviewable changes (1)

server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: gradle-check

🔇 Additional comments (2)

server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1)

320-333: Aligns segment replication with explicit activity tracking

Calling target.setLastAccessTime() immediately after obtaining the ReplicationRef cleanly reintroduces the “replication just (re)started” activity bump that was previously hidden in ReplicationRef’s constructor. This keeps timeout behavior correct after the implicit update removal.

server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (1)

72-88: Tests now correctly enforce explicit lastAccessTime updates

The extra assertBusy before calling setLastAccessTime() is a nice regression guard: it proves collection.get(recoveryId) alone no longer mutates lastAccessTime, and that only an explicit setLastAccessTime() changes what subsequent observers see. This matches the new ReplicationRef behavior.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (1)

399-402: Preparing for translog phase now properly refreshes activity timestamp

Updating lastAccessTime right after resolving the RecoveryTarget for PREPARE_TRANSLOG ties the timeout to an actual protocol step rather than to generic collection access, which matches the stated goal.

You repeat the getSafe(...) → recoveryRef.get() → setLastAccessTime() pattern in several handlers; consider a small helper to centralize this so future handlers can’t forget the activity update.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between afeebd6 and 4263de5.

📒 Files selected for processing (5)

server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (9 hunks)
server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1 hunks)
server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java (0 hunks)
server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (1 hunks)
server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (2 hunks)

💤 Files with no reviewable changes (1)

server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java

✅ Files skipped from review due to trivial changes (1)

server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java

🚧 Files skipped from review as they are similar to previous changes (2)

server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java
server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: gradle-check

🔇 Additional comments (8)

server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (8)

239-241: Explicit activity bump at start/re-establish of recovery looks correct

Calling recoveryTarget.setLastAccessTime() immediately after acquiring the RecoveryTarget ensures long-running local prep (e.g., pre-recovery, remote segment sync) is treated as activity and prevents premature timeout, while still decoupling from bare status polling.

416-419: Finalize phase correctly counted as recovery activity

Bumping lastAccessTime before finalizeRecovery(...) ensures that a recovery doing only finalize/refresh work is not timed out as idle, which is consistent with “meaningful work” semantics.

435-438: Primary-context handoff now tracked as active recovery work

Setting lastAccessTime before handoffPrimaryContext(...) brings this path in line with the other phases; the handoff is a critical step and should indeed reset the inactivity timer.

449-452: Translog-ops request correctly refreshes activity timestamp

Updating lastAccessTime upon receipt of a TRANSLOG_OPS request ensures that inbound translog batches keep the recovery alive and avoids timeouts being reset by unrelated collection lookups.

471-473: Extra activity bump in performTranslogOps is justified for retry path

Calling setLastAccessTime() again inside performTranslogOps covers internal retries triggered after mapping updates (which re-enter via performTranslogOps without a new transport message), so those retries won’t be misclassified as idle; the double bump for the initial call is harmless.

536-539: Files-info phase now participates in recovery timeout semantics

Marking activity when handling FILES_INFO aligns this early phase with the rest of the recovery pipeline and prevents a recovery from timing out while the source is legitimately sending file metadata.

560-563: Clean-files step correctly refreshes last access time

Treating CLEAN_FILES as activity makes sense, since it can be non-trivial work and may be the only thing happening for a period; this fits well with the intent of recovery_activity_timeout.

585-588: File-chunk handling updates lastAccessTime at the right place

Updating lastAccessTime right before handleFileChunk(...) ties the timer to actual segment-bytes transfer, which is central to the “no bytes transferred within the interval” heuristic and complements the additional updates in the replication layer.

github-actions · 2025-12-06T23:35:29Z

❌ Gradle check result for 4263de5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

maxlepikhin requested a review from a team as a code owner December 6, 2025 22:43

github-actions bot added _No response_ bug Something isn't working labels Dec 6, 2025

Fix peer recovery activity tracking

4263de5

Signed-off-by: Max Lepikhin <[email protected]>

maxlepikhin force-pushed the bugfix/peer-recovery-timeout branch from afeebd6 to 4263de5 Compare December 6, 2025 22:46

coderabbitai bot reviewed Dec 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix peer recovery activity tracking #20178

Fix peer recovery activity tracking #20178

maxlepikhin commented Dec 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix peer recovery activity tracking #20178

Are you sure you want to change the base?

Fix peer recovery activity tracking #20178

Conversation

maxlepikhin commented Dec 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maxlepikhin commented Dec 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 6, 2025 •

edited

Loading