-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Fix peer recovery activity tracking #20178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix peer recovery activity tracking #20178
Conversation
WalkthroughCentralized last-access-time updates: removed automatic updates from ReplicationCollection refs and added explicit Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Max Lepikhin <[email protected]>
afeebd6 to
4263de5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java (1)
239-247: Explicit activity bump on each received file chunkUpdating
replicationTarget.setLastAccessTime()at the top ofhandleFileChunkmakes chunk traffic reliably count as recovery activity, independent of status polling. Note thatFileChunkTransportRequestHandleralso callssetLastAccessTime()before invoking this method, so you now have two updates per chunk; that’s harmless but could be consolidated in one place later if you want to avoid duplication.server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (1)
229-241: Good separation between “real recovery work” and status pollingThe new
recoveryTarget.setLastAccessTime()calls indoRecoveryand all the recovery request handlers ensure that only meaningful events (start/restart, files info, file chunks, clean files, prepare/finalize, handoff primary context, and translog ops, including mapping‑retry runs) advance the activity timestamp. This matches the goal of letting idle/hung recoveries time out even while status is being polled viaReplicationCollection.get().There is a bit of benign duplication (e.g., for translog ops and file chunks you now bump
lastAccessTimein both the transport handler and the downstream helper), so if you later want to tighten things up you could centralize the update in a single layer per operation type.Also applies to: 399-408, 414-425, 431-438, 447-457, 471-473, 535-552, 559-573, 585-595
server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (1)
96-127: Solid regression test; consider tightening latch assertion
testRecoveryTimeoutNotResetByPollingdoes a good job simulating frequent status polling and asserting that the recovery still times out (via the custom listener andfailedflag), which directly protects against the original bug.For extra clarity in failure cases, you might also assert that
latch.await(30, TimeUnit.SECONDS)itself returnstruebefore checkingfailed.get(), so a stuck test reports “recovery did not complete” rather than only “failed to timeout.”
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java(9 hunks)server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java(1 hunks)server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java(0 hunks)server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java(1 hunks)server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java(2 hunks)
💤 Files with no reviewable changes (1)
- server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: gradle-check
🔇 Additional comments (2)
server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java (1)
320-333: Aligns segment replication with explicit activity trackingCalling
target.setLastAccessTime()immediately after obtaining theReplicationRefcleanly reintroduces the “replication just (re)started” activity bump that was previously hidden inReplicationRef’s constructor. This keeps timeout behavior correct after the implicit update removal.server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java (1)
72-88: Tests now correctly enforce explicit lastAccessTime updatesThe extra
assertBusybefore callingsetLastAccessTime()is a nice regression guard: it provescollection.get(recoveryId)alone no longer mutateslastAccessTime, and that only an explicitsetLastAccessTime()changes what subsequent observers see. This matches the new ReplicationRef behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (1)
399-402: Preparing for translog phase now properly refreshes activity timestampUpdating
lastAccessTimeright after resolving theRecoveryTargetforPREPARE_TRANSLOGties the timeout to an actual protocol step rather than to generic collection access, which matches the stated goal.You repeat the
getSafe(...) → recoveryRef.get() → setLastAccessTime()pattern in several handlers; consider a small helper to centralize this so future handlers can’t forget the activity update.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java(9 hunks)server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java(1 hunks)server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java(0 hunks)server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java(1 hunks)server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java(2 hunks)
💤 Files with no reviewable changes (1)
- server/src/main/java/org/opensearch/indices/replication/common/ReplicationCollection.java
✅ Files skipped from review due to trivial changes (1)
- server/src/test/java/org/opensearch/recovery/ReplicationCollectionTests.java
🚧 Files skipped from review as they are similar to previous changes (2)
- server/src/main/java/org/opensearch/indices/replication/common/ReplicationTarget.java
- server/src/main/java/org/opensearch/indices/replication/SegmentReplicator.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: gradle-check
🔇 Additional comments (8)
server/src/main/java/org/opensearch/indices/recovery/PeerRecoveryTargetService.java (8)
239-241: Explicit activity bump at start/re-establish of recovery looks correctCalling
recoveryTarget.setLastAccessTime()immediately after acquiring theRecoveryTargetensures long-running local prep (e.g., pre-recovery, remote segment sync) is treated as activity and prevents premature timeout, while still decoupling from bare status polling.
416-419: Finalize phase correctly counted as recovery activityBumping
lastAccessTimebeforefinalizeRecovery(...)ensures that a recovery doing only finalize/refresh work is not timed out as idle, which is consistent with “meaningful work” semantics.
435-438: Primary-context handoff now tracked as active recovery workSetting
lastAccessTimebeforehandoffPrimaryContext(...)brings this path in line with the other phases; the handoff is a critical step and should indeed reset the inactivity timer.
449-452: Translog-ops request correctly refreshes activity timestampUpdating
lastAccessTimeupon receipt of aTRANSLOG_OPSrequest ensures that inbound translog batches keep the recovery alive and avoids timeouts being reset by unrelated collection lookups.
471-473: Extra activity bump inperformTranslogOpsis justified for retry pathCalling
setLastAccessTime()again insideperformTranslogOpscovers internal retries triggered after mapping updates (which re-enter viaperformTranslogOpswithout a new transport message), so those retries won’t be misclassified as idle; the double bump for the initial call is harmless.
536-539: Files-info phase now participates in recovery timeout semanticsMarking activity when handling
FILES_INFOaligns this early phase with the rest of the recovery pipeline and prevents a recovery from timing out while the source is legitimately sending file metadata.
560-563: Clean-files step correctly refreshes last access timeTreating
CLEAN_FILESas activity makes sense, since it can be non-trivial work and may be the only thing happening for a period; this fits well with the intent ofrecovery_activity_timeout.
585-588: File-chunk handling updates lastAccessTime at the right placeUpdating
lastAccessTimeright beforehandleFileChunk(...)ties the timer to actual segment-bytes transfer, which is central to the “no bytes transferred within the interval” heuristic and complements the additional updates in the replication layer.
|
❌ Gradle check result for 4263de5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Prevent peer/segment recovery timeouts from being reset by mere status polling. Previously, every call to ReplicationCollection#get() bumped the recovery’s lastAccessTime, so a hung recovery could sit in INITIALIZING indefinitely. This change removes that implicit update and explicitly touches the timestamp only when meaningful work happens (start, file transfer, clean‑files, translog, finalize, etc.), ensuring indices.recovery.recovery_activity_timeout actually fails stuck recoveries.
Related Issues
Resolves #20177
Check List
Summary by CodeRabbit
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.