downstreamadapter: harden table trigger takeover by hongyunyan · Pull Request #5436 · pingcap/ticdc

hongyunyan · 2026-06-18T01:10:58Z

What problem does this PR solve?

Issue Number: close #5083

This is PR 3 of 3 split from #5182 and is stacked on PR 2.

Background:

PR 1 persists maintainer epochs before ownership changes. PR 2 fences stale
maintainer control messages at the maintainer and dispatcher-manager boundary.
This final PR hardens the table trigger takeover cases that still depend on
dispatcher-manager bootstrap recovery.

Motivation:

During a same-capture higher-epoch maintainer replacement, a reused dispatcher
manager can still have table trigger event or redo dispatchers from the previous
owner. Replacing a mismatched trigger in place can duplicate DDL ownership, while
dropping stale remove operators during bootstrap can lose the cleanup intent for
an already reported dispatcher.

What is changed and how it works?

This PR adds takeover-specific hardening:

Verifies that a reused dispatcher manager either has no table trigger
dispatcher yet or already has the trigger dispatcher requested by the current
bootstrap owner.
Creates missing event and redo table trigger dispatchers during bootstrap, but
rejects mismatched trigger IDs instead of replacing them in place.
Restores stale remove operators in bootstrap responses only when the same
bootstrap snapshot reports the dispatcher they remove.
Keeps current-epoch bootstrap operators while filtering stale create operators.
Adds unit coverage for stale remove restoration, bootstrap snapshot filtering,
close acknowledgements, closed-epoch tombstones, and table trigger ID mismatch
handling.
Updates the in-flight syncpoint scheduling integration case to search all CDC
node logs because the merge task may run on any dispatcher-manager owner after
scheduling.

Stack:

#5434: coordinator epoch persistence and owner scheduling foundation.
#5435: maintainer and dispatcher-manager receiver-side epoch fence.
#5436: table trigger takeover hardening and integration coverage.

Check List

Tests

Unit test
Integration test

Questions

Will it cause performance regression or break compatibility?

No expected performance regression. The added checks run during maintainer
bootstrap and bootstrap response reconstruction, not on the event write path.

The behavior is compatible with rolling upgrades because it builds on the epoch
0 compatibility rules introduced in the previous PRs.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a bug where delayed stale maintainer requests could create duplicate dispatchers during maintainer failover.

Validation

make fmt
go test ./downstreamadapter/dispatcherorchestrator ./downstreamadapter/dispatchermanager

ti-chi-bot · 2026-06-18T01:11:01Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ti-chi-bot · 2026-06-18T01:11:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kennytm for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-06-18T01:11:06Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 002e83db-f623-4fde-9530-8ce03d0504d4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request refactors the bootstrap request handling in DispatcherOrchestrator by extracting helper functions to verify and initialize table trigger event and redo dispatchers, and updates operator retrieval to properly handle stale remove operators. It also adds extensive unit tests and updates an integration test to search logs across all CDC nodes. The review feedback highlights a critical mutex leak in handlePostBootstrapRequest due to a missing unlock when fenced, as well as a regression in the new helper functions which fail to gracefully handle write path closed errors during shutdown.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-18T01:12:47Z

 	if m.fenced.Load() {
-		manager.MaintainerFenceMu.Unlock()
 		manager.LocalFence()
 		return nil
 	}


In handlePostBootstrapRequest, if m.fenced.Load() is true, the function returns early without unlocking manager.MaintainerFenceMu. This will cause a mutex leak and potential deadlocks during shutdown or subsequent requests. We should unlock manager.MaintainerFenceMu before calling manager.LocalFence() and returning.

Suggested change

if m.fenced.Load() {

manager.MaintainerFenceMu.Unlock()

manager.LocalFence()

return nil

}

if m.fenced.Load() {

manager.MaintainerFenceMu.Unlock()

manager.LocalFence()

return nil

}

gemini-code-assist · 2026-06-18T01:12:47Z

+		if err := manager.NewTableTriggerEventDispatcher(id, startTs, false); err != nil {
+			log.Error("failed to create table trigger event dispatcher",
+				zap.Stringer("changefeedID", cfId), zap.Error(err))
+			return err
+		}


If manager.NewTableTriggerEventDispatcher fails because the write path is closed, it is an expected state during shutdown or fencing. Logging it as an Error and returning the error to the maintainer is a regression from the original behavior (where it was logged as Info and returned nil). We should handle IsWritePathClosedError specially and return nil to allow clean shutdown.

if err := manager.NewTableTriggerEventDispatcher(id, startTs, false); err != nil { if dispatchermanager.IsWritePathClosedError(err) { log.Info("dispatcher manager write path closed while creating table trigger event dispatcher", zap.Stringer("changefeedID", cfId), zap.Error(err)) return nil } log.Error("failed to create table trigger event dispatcher", zap.Stringer("changefeedID", cfId), zap.Error(err)) return err }

gemini-code-assist · 2026-06-18T01:12:47Z

+		if err := manager.NewTableTriggerRedoDispatcher(id, startTs, false); err != nil {
+			log.Error("failed to create table trigger redo dispatcher",
+				zap.Stringer("changefeedID", cfId), zap.Error(err))
+			return err
+		}


Similarly, if manager.NewTableTriggerRedoDispatcher fails because the write path is closed, we should handle IsWritePathClosedError specially and return nil instead of logging an Error and returning the error to the maintainer.

if err := manager.NewTableTriggerRedoDispatcher(id, startTs, false); err != nil { if dispatchermanager.IsWritePathClosedError(err) { log.Info("dispatcher manager write path closed while creating table trigger redo dispatcher", zap.Stringer("changefeedID", cfId), zap.Error(err)) return nil } log.Error("failed to create table trigger redo dispatcher", zap.Stringer("changefeedID", cfId), zap.Error(err)) return err }

…2-trigger-takeover

downstreamadapter: harden table trigger takeover

260e153

ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Jun 18, 2026

ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 18, 2026

This was referenced Jun 18, 2026

coordinator: persist maintainer epochs before ownership changes #5434

Merged

maintainer,dispatcher: fence stale maintainer epochs #5435

Draft

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

hongyunyan added 3 commits June 22, 2026 15:00

Merge branch 'codex/pr-5182-stale-maintainer-fence' into codex/pr-518…

1953b27

…2-trigger-takeover

Merge branch 'codex/pr-5182-stale-maintainer-fence' into codex/pr-518…

22291bc

…2-trigger-takeover

Merge branch 'codex/pr-5182-stale-maintainer-fence' into codex/pr-518…

5b69301

…2-trigger-takeover

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

downstreamadapter: harden table trigger takeover#5436

downstreamadapter: harden table trigger takeover#5436
hongyunyan wants to merge 4 commits into
pingcap:codex/pr-5182-stale-maintainer-fencefrom
hongyunyan:codex/pr-5182-trigger-takeover

hongyunyan commented Jun 18, 2026 •

edited

Loading

Uh oh!

ti-chi-bot Bot commented Jun 18, 2026

Uh oh!

ti-chi-bot Bot commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hongyunyan commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Validation

Uh oh!

ti-chi-bot Bot commented Jun 18, 2026

Uh oh!

ti-chi-bot Bot commented Jun 18, 2026

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hongyunyan commented Jun 18, 2026 •

edited

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading