downstreamadapter: harden table trigger takeover#5436
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request refactors the bootstrap request handling in DispatcherOrchestrator by extracting helper functions to verify and initialize table trigger event and redo dispatchers, and updates operator retrieval to properly handle stale remove operators. It also adds extensive unit tests and updates an integration test to search logs across all CDC nodes. The review feedback highlights a critical mutex leak in handlePostBootstrapRequest due to a missing unlock when fenced, as well as a regression in the new helper functions which fail to gracefully handle write path closed errors during shutdown.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if m.fenced.Load() { | ||
| manager.MaintainerFenceMu.Unlock() | ||
| manager.LocalFence() | ||
| return nil | ||
| } |
There was a problem hiding this comment.
In handlePostBootstrapRequest, if m.fenced.Load() is true, the function returns early without unlocking manager.MaintainerFenceMu. This will cause a mutex leak and potential deadlocks during shutdown or subsequent requests. We should unlock manager.MaintainerFenceMu before calling manager.LocalFence() and returning.
| if m.fenced.Load() { | |
| manager.MaintainerFenceMu.Unlock() | |
| manager.LocalFence() | |
| return nil | |
| } | |
| if m.fenced.Load() { | |
| manager.MaintainerFenceMu.Unlock() | |
| manager.LocalFence() | |
| return nil | |
| } |
| if err := manager.NewTableTriggerEventDispatcher(id, startTs, false); err != nil { | ||
| log.Error("failed to create table trigger event dispatcher", | ||
| zap.Stringer("changefeedID", cfId), zap.Error(err)) | ||
| return err | ||
| } |
There was a problem hiding this comment.
If manager.NewTableTriggerEventDispatcher fails because the write path is closed, it is an expected state during shutdown or fencing. Logging it as an Error and returning the error to the maintainer is a regression from the original behavior (where it was logged as Info and returned nil). We should handle IsWritePathClosedError specially and return nil to allow clean shutdown.
if err := manager.NewTableTriggerEventDispatcher(id, startTs, false); err != nil {
if dispatchermanager.IsWritePathClosedError(err) {
log.Info("dispatcher manager write path closed while creating table trigger event dispatcher",
zap.Stringer("changefeedID", cfId), zap.Error(err))
return nil
}
log.Error("failed to create table trigger event dispatcher",
zap.Stringer("changefeedID", cfId), zap.Error(err))
return err
}| if err := manager.NewTableTriggerRedoDispatcher(id, startTs, false); err != nil { | ||
| log.Error("failed to create table trigger redo dispatcher", | ||
| zap.Stringer("changefeedID", cfId), zap.Error(err)) | ||
| return err | ||
| } |
There was a problem hiding this comment.
Similarly, if manager.NewTableTriggerRedoDispatcher fails because the write path is closed, we should handle IsWritePathClosedError specially and return nil instead of logging an Error and returning the error to the maintainer.
if err := manager.NewTableTriggerRedoDispatcher(id, startTs, false); err != nil {
if dispatchermanager.IsWritePathClosedError(err) {
log.Info("dispatcher manager write path closed while creating table trigger redo dispatcher",
zap.Stringer("changefeedID", cfId), zap.Error(err))
return nil
}
log.Error("failed to create table trigger redo dispatcher",
zap.Stringer("changefeedID", cfId), zap.Error(err))
return err
}…2-trigger-takeover
…2-trigger-takeover
…2-trigger-takeover
What problem does this PR solve?
Issue Number: close #5083
This is PR 3 of 3 split from #5182 and is stacked on PR 2.
Background:
PR 1 persists maintainer epochs before ownership changes. PR 2 fences stale
maintainer control messages at the maintainer and dispatcher-manager boundary.
This final PR hardens the table trigger takeover cases that still depend on
dispatcher-manager bootstrap recovery.
Motivation:
During a same-capture higher-epoch maintainer replacement, a reused dispatcher
manager can still have table trigger event or redo dispatchers from the previous
owner. Replacing a mismatched trigger in place can duplicate DDL ownership, while
dropping stale remove operators during bootstrap can lose the cleanup intent for
an already reported dispatcher.
What is changed and how it works?
This PR adds takeover-specific hardening:
dispatcher yet or already has the trigger dispatcher requested by the current
bootstrap owner.
rejects mismatched trigger IDs instead of replacing them in place.
bootstrap snapshot reports the dispatcher they remove.
close acknowledgements, closed-epoch tombstones, and table trigger ID mismatch
handling.
node logs because the merge task may run on any dispatcher-manager owner after
scheduling.
Stack:
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No expected performance regression. The added checks run during maintainer
bootstrap and bootstrap response reconstruction, not on the event write path.
The behavior is compatible with rolling upgrades because it builds on the epoch
0 compatibility rules introduced in the previous PRs.
Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note
Validation
make fmtgo test ./downstreamadapter/dispatcherorchestrator ./downstreamadapter/dispatchermanager