logpuller: refactor resolve lock rate limit logic & add more resolve lock metrics by lidezhu · Pull Request #4954 · pingcap/ticdc

lidezhu · 2026-04-29T12:43:55Z

What problem does this PR solve?

Issue Number: close #5016

What is changed and how it works?

This pull request refactors the resolve-lock logic in the logpuller component to improve efficiency and observability. By centralizing the rate-limiting mechanism and introducing granular metrics, the system now better handles concurrent resolve-lock requests and provides clearer insights into the lock resolution process. These changes ensure that redundant tasks are avoided and that the performance of lock resolution can be effectively monitored via updated dashboards.

Highlights

Resolve Lock Throttling: Introduced a centralized resolveLockRateLimiter in subscriptionClient to deduplicate and throttle resolve-lock tasks per region, ensuring consistent behavior across multiple subscriptions.
Metrics Instrumentation: Added comprehensive metrics for resolve-lock success/failure and processed-lock counts, and updated Grafana dashboards to visualize these new metrics.
Task Handling Refactor: Refactored handleResolveLockTasks to directly invoke the lock resolver and improved task scheduling logic to prevent redundant execution.
Test Improvements: Added new unit tests for the rate limiter and integration test adjustments to ensure correct lock generation and throttling behavior.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

New Features
- Added resolve-lock execution metrics: success/failure counters and processed-lock (found/resolved) counts.
Improvements
- Per-region throttling moved to task gating to prevent duplicate work; task enqueueing more strictly validated and drop counting refined; resolve logging sampled and resolved counting improved.
Documentation
- Grafana dashboards updated to show resolve-lock success/failure and processed-locks.
Tests
- Added/updated tests for throttling, metrics, and handler behavior.

coderabbitai · 2026-04-29T12:44:01Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Centralizes per-region resolve-lock throttling into a new rate limiter on subscribedSpan, simplifies resolveLockTask payload, moves enqueue gating into tryResolveLock, refactors handleResolveLockTasks to call lockResolver.Resolve directly and emit success/failure metrics, instruments lock-resolver found/resolved counts, updates tests, dashboards, and an integration lock generator.

Changes

Lock resolution, throttling, and metrics

Layer / File(s)	Summary
Data Shape `logservice/logpuller/subscription_client.go`	`resolveLockTask` removed `create` timestamp (now: `keyspaceID`, `regionID`, `targetTs`, `state`).
Throttling Helper `logservice/logpuller/resolve_lock_rate_limiter.go`, `logservice/logpuller/resolve_lock_rate_limiter_test.go`	New `resolveLockRateLimiter` with `tryMark`/`unmark`/`gc`; tests added for `tryMark` and GC behavior.
Enqueue / Wiring `logservice/logpuller/subscription_client.go`	`subscribedSpan` gains `resolveLockLimiter`; `tryResolveLock` performs init/stale checks, uses `resolveLockLimiter.tryMark`, unmarks on cancel or channel-full, and increments task-drop counter when enqueue fails.
Task Consumption / Control Flow `logservice/logpuller/subscription_client.go`	`handleResolveLockTasks` refactored to call `s.lockResolver.Resolve(...)` per task, removed consumer-side last-run throttling/GC, and increments per-outcome metrics for success/failure.
Resolver Logic & Metrics `logservice/txnutil/lock_resolver.go`	Added package counters for locks “found” and “resolved”; limited sample logging of locks; refactored region reload logic; counts expired locks via `countExpiredLocks` and records processed-lock metrics.
Metric Declarations `pkg/metrics/log_puller.go`, `pkg/metrics/init.go`	Added `SubscriptionClientResolveLockCounter` and `SubscriptionClientResolveLockProcessedLockCounter` (both labeled by `status`) and registered them; removed `initPullerMetrics` call from `InitMetrics`.
Tests `logservice/logpuller/subscription_client_test.go`, `logservice/logpuller/resolve_lock_rate_limiter_test.go`	Added `mockLockResolver` (atomic counter), updated `TestGenerateResolveLockTask` to assert no duplicate enqueues within throttle interval and empty channel before close, added `TestHandleResolveLockTasksMetrics`, and tests for limiter behavior.
Dashboards `metrics/grafana/ticdc_new_arch.json`, `metrics/nextgengrafana/ticdc_new_arch_next_gen.json`	Replaced single dropped-task graph with `Resolve Lock / s` showing `status="success"`/`"failure"` series plus task-drop series; added `Resolve Lock Processed Locks / s` panel for `found`/`resolved`.
Integration test adjustments `tests/integration_tests/resolve_lock/main.go`	Lock generation rewritten to deterministic full-table batching, added table splits, extended wait, and made `PrewriteResponse` key errors fatal in `lockBatch`.

sequenceDiagram
    participant Client as SubscriptionClient
    participant Span as subscribedSpan
    participant Queue as resolveLockTaskCh
    participant Resolver as LockResolver
    participant Metrics as Prometheus

    Client->>Span: tryResolveLock(regionID, targetTs)
    Span->>Span: check Initialized / ResolvedTs vs targetTs
    Span->>Span: resolveLockLimiter.tryMark(regionID)
    alt allowed
        Span->>Queue: enqueue resolveLockTask
    else denied
        Span->>Span: (skip enqueue)
    end
    Note over Queue,Client: handler goroutine reads tasks
    Queue->>Client: handleResolveLockTasks reads task
    Client->>Resolver: Resolve(task.state, task.targetTs)
    Resolver->>Metrics: increment processed_lock_count{status="found"/"resolved"}
    Resolver-->>Client: (err / nil)
    Client->>Metrics: increment resolve_lock_count{status="success"/"failure"}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

pingcap/ticdc#4187: Overlaps on resolve-lock handling and logging changes in subscription_client.go.
pingcap/ticdc#4240: Related per-region resolve-lock throttling and last-run tracking refactors.

Suggested labels

lgtm, approved

Suggested reviewers

asddongmen
hongyunyan
flowbehappy

Poem

🐰 I hop where stale locks quietly lie,
Marks per-region keep duplicate tries shy,
Counters whisper found and fixed in store,
Tasks queue, resolve runs, and metrics soar.
A carrot-cheer for every resolved chore.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	PR description is incomplete. The Issue Number field contains a placeholder (`#5016`), but critical sections lack required details.	Provide a concrete problem statement in 'What problem does this PR solve?' section. Complete the 'Will it cause performance regression or break compatibility?' and 'Do you need to update user documentation?' questions. Fill the release-note section with an actual release note or 'None'.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title directly describes the main changes: refactoring resolve lock rate limit logic and adding resolve lock metrics in logpuller, which aligns with the substantial file modifications across subscription_client.go, metrics files, and new rate limiter implementation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ldz/add-delete-range-metrics

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request enhances the observability of the lock resolution process by introducing new Prometheus metrics and updating Grafana dashboards to track task statuses, durations, and scanned ranges. Key improvements include more granular tracking of skipped, throttled, and dropped tasks in the subscription client and transaction resolver. Feedback suggests correcting a unit mismatch in the LockResolveDuration metric, optimizing memory allocation in countLocksByTTL, and removing redundant metric increments for dropped tasks.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

logservice/logpuller/subscription_client.go (1)

1002-1064: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop the second, client-wide resolve-lock throttle.

tryResolveLock already throttles per subscribedSpan/region and records the enqueue time. The extra resolveLastRun[regionID] here is global to the whole client, so two subscriptions that share a TiKV region can suppress each other's valid resolve attempts for 10 seconds. That can skip real lock-resolution work and undercount the new execution metric.

Proposed change

 func (s *subscriptionClient) handleResolveLockTasks(ctx context.Context) error {
-	resolveLastRun := make(map[uint64]time.Time)
-
 	doResolve := func(task resolveLockTask) {
 		keyspaceID := task.keyspaceID
 		regionID := task.regionID
 		state := task.state
@@
-		lastRun, ok := resolveLastRun[regionID]
-		if ok {
-			if time.Since(lastRun) < resolveLockMinInterval {
-				return
-			}
-		}
-
 		if s.lockResolver == nil {
 			metrics.SubscriptionClientResolveLockCounter.WithLabelValues(resolveLockMetricFailed).Inc()
 			log.Warn("subscription client lock resolver is nil",
@@
 		if err != nil {
 			log.Warn("subscription client resolve lock fail",
 				zap.Uint32("keyspaceID", keyspaceID),
 				zap.Uint64("regionID", regionID),
 				zap.Uint64("targetTs", targetTs),
-				zap.Time("lastRun", lastRun),
 				zap.Any("state", state),
 				zap.Error(err))
 		}
-		resolveLastRun[regionID] = time.Now()
 	}
-
-	gcTicker := time.NewTicker(resolveLockMinInterval * 3 / 2)
-	defer gcTicker.Stop()
 	for {
 		select {
 		case <-ctx.Done():
 			return ctx.Err()
-		case <-gcTicker.C:
-			resolveLastRun = gcResolveLastRunMap(resolveLastRun, time.Now())
 		case task := <-s.resolveLockTaskCh:
 			doResolve(task)
 		}
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@logservice/logpuller/subscription_client.go` around lines 1002 - 1064, The
client currently applies a second, client-wide throttle via resolveLastRun
inside handleResolveLockTasks which can incorrectly suppress valid per-region
resolves; remove that global throttle by deleting the resolveLastRun map, the
"lastRun" check (time.Since(lastRun) < resolveLockMinInterval), the
resolveLastRun[regionID] = time.Now() assignment, the zap.Time("lastRun",
lastRun) field in the error log, and the gcTicker + gcResolveLastRunMap cleanup
logic; keep the rest of doResolve (including the s.lockResolver.Resolve call,
metrics increments, and existing state checks) so per-subscription/region
throttling from tryResolveLock remains the only throttle.

🧹 Nitpick comments (1)

logservice/txnutil/lock_resolver.go (1)

172-182: ⚡ Quick win

Snapshot the expired-lock lower bound before calling ResolveLocks.

countExpiredLocks runs after ResolveLocks returns, so a lock that expires while the RPC is in flight can be counted as "resolved" even though this invocation never handled it. That makes the "lower bound" in the comment inaccurate.

Proposed change

-		msBeforeTxnExpired, err1 := kvStorage.GetLockResolver().ResolveLocks(bo, 0, locks)
+		expiredLockCount := countExpiredLocks(kvStorage, locks)
+		msBeforeTxnExpired, err1 := kvStorage.GetLockResolver().ResolveLocks(bo, 0, locks)
 		if err1 != nil {
 			return errors.Trace(err1)
 		}
 		resolvedLockCount := len(locks)
 		if msBeforeTxnExpired > 0 {
 			// ResolveLocks only reports remaining TTL, so use expired locks as the
 			// best available lower bound when a batch is only partially resolved.
-			resolvedLockCount = countExpiredLocks(kvStorage, locks)
+			resolvedLockCount = expiredLockCount
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@logservice/txnutil/lock_resolver.go` around lines 172 - 182, The code
computes countExpiredLocks(kvStorage, locks) after calling
kvStorage.GetLockResolver().ResolveLocks(bo, 0, locks), which lets locks expire
during the RPC and misattributes those to this resolver; to fix, call
countExpiredLocks(kvStorage, locks) once before invoking ResolveLocks and store
it (e.g. expiredLowerBound), then call ResolveLocks, and use that precomputed
expiredLowerBound when msBeforeTxnExpired > 0 to compute resolvedLockCount
before passing it to recordLockResolveLockCount(lockResolveMetricResolved,
resolvedLockCount); keep the existing symbols ResolveLocks, countExpiredLocks,
msBeforeTxnExpired, resolvedLockCount, recordLockResolveLockCount and
kvStorage/locks to locate the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@logservice/logpuller/subscription_client.go`:
- Around line 1002-1064: The client currently applies a second, client-wide
throttle via resolveLastRun inside handleResolveLockTasks which can incorrectly
suppress valid per-region resolves; remove that global throttle by deleting the
resolveLastRun map, the "lastRun" check (time.Since(lastRun) <
resolveLockMinInterval), the resolveLastRun[regionID] = time.Now() assignment,
the zap.Time("lastRun", lastRun) field in the error log, and the gcTicker +
gcResolveLastRunMap cleanup logic; keep the rest of doResolve (including the
s.lockResolver.Resolve call, metrics increments, and existing state checks) so
per-subscription/region throttling from tryResolveLock remains the only
throttle.

---

Nitpick comments:
In `@logservice/txnutil/lock_resolver.go`:
- Around line 172-182: The code computes countExpiredLocks(kvStorage, locks)
after calling kvStorage.GetLockResolver().ResolveLocks(bo, 0, locks), which lets
locks expire during the RPC and misattributes those to this resolver; to fix,
call countExpiredLocks(kvStorage, locks) once before invoking ResolveLocks and
store it (e.g. expiredLowerBound), then call ResolveLocks, and use that
precomputed expiredLowerBound when msBeforeTxnExpired > 0 to compute
resolvedLockCount before passing it to
recordLockResolveLockCount(lockResolveMetricResolved, resolvedLockCount); keep
the existing symbols ResolveLocks, countExpiredLocks, msBeforeTxnExpired,
resolvedLockCount, recordLockResolveLockCount and kvStorage/locks to locate the
change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 471ff694-0242-46f4-a7b5-682dedb1a3c6

📥 Commits

Reviewing files that changed from the base of the PR and between 5518eb2 and a1a1562.

📒 Files selected for processing (7)

logservice/logpuller/subscription_client.go
logservice/logpuller/subscription_client_test.go
logservice/txnutil/lock_resolver.go
metrics/grafana/ticdc_new_arch.json
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
pkg/metrics/log_puller.go
pkg/metrics/puller.go

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

logservice/logpuller/subscription_client.go (1)
1016-1038: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Remove the second resolve-lock cooldown here.

tryResolveLock already throttles per region before enqueueing. Re-checking resolveLastRun in the worker resets the cooldown at execution time, so a task that sat in the channel or failed late can suppress the next retry for another full resolveLockMinInterval. In practice this stretches recovery to queue delay + resolve time + 10s instead of the configured interval.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/logpuller/subscription_client.go` around lines 1016 - 1038, The
worker currently re-applies a per-region cooldown using resolveLastRun and
resolveLockMinInterval and updates resolveLastRun after calling
s.lockResolver.Resolve; remove that second cooldown so only tryResolveLock
enforces throttling: delete the initial time.Since(lastRun) <
resolveLockMinInterval check (and its early return), stop setting
resolveLastRun[regionID] = time.Now() after Resolve, and clean up any logging
that references lastRun; keep the Resolve call, metric increments, and error
logging intact so tryResolveLock remains the single source of cooldown
enforcement.

🧹 Nitpick comments (2)

logservice/logpuller/subscription_client_test.go (1)
149-161: 💤 Low value

Consider isolating the metric assertion to avoid cross-test interference.

The >= comparison at line 160 could pass incorrectly if other tests in the suite also increment SubscriptionClientResolveLockSuccessCounter concurrently. Since Prometheus counters are global singletons, parallel test execution could cause flakiness.

Consider resetting/isolating the metric or using a more precise assertion that captures the exact delta.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/logpuller/subscription_client_test.go` around lines 149 - 161, The
test currently uses a >= check against
metrics.SubscriptionClientResolveLockSuccessCounter allowing other tests to
affect the result; instead capture the initial value via
testutil.ToFloat64(metrics.SubscriptionClientResolveLockSuccessCounter) and
assert the exact delta after the operation (i.e., final == initial+1) or isolate
the metric by resetting or using a private Prometheus registry for
SubscriptionClientResolveLockSuccessCounter before sending to
client.resolveLockTaskCh; update the require.Eventually assertion to compare
equality (resolver.calls.Load() == 1 &&
testutil.ToFloat64(metrics.SubscriptionClientResolveLockSuccessCounter) ==
successBefore+1) or call a metric-reset helper targeted at
SubscriptionClientResolveLockSuccessCounter to avoid cross-test interference.
metrics/grafana/ticdc_new_arch.json (1)
8292-8316: ⚡ Quick win

Consider separate Y-axis or panel split — targets A/B and C have different granularities

Targets A and B count resolve-lock task executions (one increment per subscribedSpan attempt), while target C counts individual TiKV locks found/resolved (one increment per lock). In high-lock-rate scenarios, C can be orders of magnitude larger than A and B, which will compress the task-success/failure lines to near-invisible on a shared linear Y-axis.

Two remediation options:

Assign target C to a right Y-axis (yaxis: 2) so the two scales are independent.

Split the panel into two: one for task-level success/failure rates, and one for lock counts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metrics/grafana/ticdc_new_arch.json` around lines 8292 - 8316, Targets A/B
(refId "A","B" – resolve_lock success/failure expressions) are task-level counts
and target C (refId "C" – ticdc_kvclient_lock_resolve_lock_count) is per-lock
and can be orders of magnitude larger; fix by either assigning target C to the
right Y-axis (set yaxis: 2 for the target object with refId "C") so A/B stay on
the left axis, or split into two panels—one panel containing targets A and B and
a second panel containing target C—so the scales don’t compress the task-level
lines.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@metrics/nextgengrafana/ticdc_new_arch_next_gen.json`:
- Around line 8295-8316: The PromQL metric names used by refId A, B, and C are
missing the OpenMetrics `_total` suffix and must be updated: change the
expression in refId A from ticdc_subscription_client_resolve_lock_success_count
to ticdc_subscription_client_resolve_lock_success_count_total, change refId B
from ticdc_subscription_client_resolve_lock_failure_count to
ticdc_subscription_client_resolve_lock_failure_count_total, and change refId C
from ticdc_kvclient_lock_resolve_lock_count to
ticdc_kvclient_lock_resolve_lock_count_total so the queries match the metrics
exposed by the Go/OpenMetrics client.

---

Outside diff comments:
In `@logservice/logpuller/subscription_client.go`:
- Around line 1016-1038: The worker currently re-applies a per-region cooldown
using resolveLastRun and resolveLockMinInterval and updates resolveLastRun after
calling s.lockResolver.Resolve; remove that second cooldown so only
tryResolveLock enforces throttling: delete the initial time.Since(lastRun) <
resolveLockMinInterval check (and its early return), stop setting
resolveLastRun[regionID] = time.Now() after Resolve, and clean up any logging
that references lastRun; keep the Resolve call, metric increments, and error
logging intact so tryResolveLock remains the single source of cooldown
enforcement.

---

Nitpick comments:
In `@logservice/logpuller/subscription_client_test.go`:
- Around line 149-161: The test currently uses a >= check against
metrics.SubscriptionClientResolveLockSuccessCounter allowing other tests to
affect the result; instead capture the initial value via
testutil.ToFloat64(metrics.SubscriptionClientResolveLockSuccessCounter) and
assert the exact delta after the operation (i.e., final == initial+1) or isolate
the metric by resetting or using a private Prometheus registry for
SubscriptionClientResolveLockSuccessCounter before sending to
client.resolveLockTaskCh; update the require.Eventually assertion to compare
equality (resolver.calls.Load() == 1 &&
testutil.ToFloat64(metrics.SubscriptionClientResolveLockSuccessCounter) ==
successBefore+1) or call a metric-reset helper targeted at
SubscriptionClientResolveLockSuccessCounter to avoid cross-test interference.

In `@metrics/grafana/ticdc_new_arch.json`:
- Around line 8292-8316: Targets A/B (refId "A","B" – resolve_lock
success/failure expressions) are task-level counts and target C (refId "C" –
ticdc_kvclient_lock_resolve_lock_count) is per-lock and can be orders of
magnitude larger; fix by either assigning target C to the right Y-axis (set
yaxis: 2 for the target object with refId "C") so A/B stay on the left axis, or
split into two panels—one panel containing targets A and B and a second panel
containing target C—so the scales don’t compress the task-level lines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 626703dd-3000-4fad-89a4-59e319b268db

📥 Commits

Reviewing files that changed from the base of the PR and between a1a1562 and 8ed30c6.

📒 Files selected for processing (7)

logservice/logpuller/subscription_client.go
logservice/logpuller/subscription_client_test.go
logservice/txnutil/lock_resolver.go
metrics/grafana/ticdc_new_arch.json
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
pkg/metrics/log_puller.go
tests/integration_tests/resolve_lock/main.go

…nge-metrics # Conflicts: # logservice/logpuller/subscription_client_test.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

logservice/txnutil/lock_resolver.go (1)
98-109: ⚡ Quick win

Wrap LocateRegionByID errors inside reloadRegion.

reloadRegion now returns the raw LocateRegionByID error and both callers wrap it later. That loses the “wrap at source” convention and makes future callers easy to get wrong. Wrap the third-party error in the closure and let callers return it as-is.
♻️ Suggested change
 reloadRegion := func() error {
 	var err error
 	loc, err = kvStorage.GetRegionCache().LocateRegionByID(bo, regionID)
 	if err != nil {
-		return err
+		return errors.Trace(err)
 	}
 	key = loc.StartKey
 	endKey = loc.EndKey
 	return nil
 }
 if err = reloadRegion(); err != nil {
-	return errors.Trace(err)
+	return err
 }
 ...
 if regionErr != nil {
 	err = bo.Backoff(tikv.BoRegionMiss(), errors.New(regionErr.String()))
 	if err != nil {
 		return errors.Trace(err)
 	}
 	if err := reloadRegion(); err != nil {
-		return errors.Trace(err)
+		return err
 	}
 	continue
 }
As per coding guidelines, "When an error comes from a third party/library call, wrap it immediately with errors.Trace(err) or errors.WrapError(...) to attach a stack trace; upstream callers should propagate wrapped errors without wrapping again".

Also applies to: 132-134
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@logservice/txnutil/lock_resolver.go` around lines 98 - 109, The closure
reloadRegion currently returns the raw error from
kvStorage.GetRegionCache().LocateRegionByID and callers wrap it later; change
reloadRegion so it immediately wraps any error from LocateRegionByID (e.g.,
return errors.Trace(err) or errors.WrapError(err)) before returning, so callers
can propagate it unchanged; apply the same fix for the other closure/usage
around the later LocateRegionByID call (the block referenced at 132-134) so all
third-party errors are wrapped at the source (refer to reloadRegion,
LocateRegionByID, kvStorage.GetRegionCache()).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@logservice/logpuller/subscription_client.go`:
- Around line 138-139: The resolve-lock limiter is currently scoped per
subscribedSpan which allows multiple enqueues for the same TiKV region and
records time in tryMark too early; move the resolveLockRateLimiter out of
subscribedSpan and into subscriptionClient as a shared limiter keyed by
(keyspaceID, regionID) so throttling is per-region, not per-span; also separate
“pending/in-flight” dedupe from the “last execution time” (used by
resolveLockMinInterval) and update that timestamp when the actual resolve starts
or finishes (e.g., in handleResolveLockTasks) rather than in tryMark to prevent
admitting another task for the same region while one is still pending.

---

Nitpick comments:
In `@logservice/txnutil/lock_resolver.go`:
- Around line 98-109: The closure reloadRegion currently returns the raw error
from kvStorage.GetRegionCache().LocateRegionByID and callers wrap it later;
change reloadRegion so it immediately wraps any error from LocateRegionByID
(e.g., return errors.Trace(err) or errors.WrapError(err)) before returning, so
callers can propagate it unchanged; apply the same fix for the other
closure/usage around the later LocateRegionByID call (the block referenced at
132-134) so all third-party errors are wrapped at the source (refer to
reloadRegion, LocateRegionByID, kvStorage.GetRegionCache()).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b62055f8-d659-4fd9-a7b9-7d991cb5c1cd

📥 Commits

Reviewing files that changed from the base of the PR and between 0a0f215 and d0c17a4.

📒 Files selected for processing (10)

logservice/logpuller/resolve_lock_rate_limiter.go
logservice/logpuller/resolve_lock_rate_limiter_test.go
logservice/logpuller/subscription_client.go
logservice/logpuller/subscription_client_test.go
logservice/txnutil/lock_resolver.go
metrics/grafana/ticdc_new_arch.json
metrics/nextgengrafana/ticdc_new_arch_next_gen.json
pkg/metrics/init.go
pkg/metrics/log_puller.go
pkg/metrics/puller.go

💤 Files with no reviewable changes (2)

pkg/metrics/puller.go
pkg/metrics/init.go

🚧 Files skipped from review as they are similar to previous changes (1)

logservice/logpuller/subscription_client_test.go

lidezhu · 2026-05-09T06:15:46Z

/gemini summary

gemini-code-assist · 2026-05-09T06:15:56Z

Summary of Changes

This pull request refactors the resolve-lock logic in the logpuller component to improve efficiency and observability. By centralizing the rate-limiting mechanism and introducing granular metrics, the system now better handles concurrent resolve-lock requests and provides clearer insights into the lock resolution process. These changes ensure that redundant tasks are avoided and that the performance of lock resolution can be effectively monitored via updated dashboards.

Highlights

Resolve Lock Throttling: Introduced a centralized resolveLockRateLimiter in subscriptionClient to deduplicate and throttle resolve-lock tasks per region, ensuring consistent behavior across multiple subscriptions.
Metrics Instrumentation: Added comprehensive metrics for resolve-lock success/failure and processed-lock counts, and updated Grafana dashboards to visualize these new metrics.
Task Handling Refactor: Refactored handleResolveLockTasks to directly invoke the lock resolver and improved task scheduling logic to prevent redundant execution.
Test Improvements: Added new unit tests for the rate limiter and integration test adjustments to ensure correct lock generation and throttling behavior.

New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Activity

Pull request created by lidezhu.
CodeRabbit automated review initiated and subsequently paused due to active development.
Code review comments provided regarding metric naming conventions and throttling logic.
Code changes implemented to address review feedback, including moving the rate limiter to the client scope and fixing metric naming.
Summary requested by the author.

lidezhu · 2026-05-09T06:48:04Z

/test all

lidezhu · 2026-05-09T07:46:08Z

/retest

ti-chi-bot · 2026-05-09T08:16:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-05-09T08:16:14Z

[LGTM Timeline notifier]

Timeline:

2026-05-09 08:16:13.988350503 +0000 UTC m=+514846.861700475: ☑️ agreed by wk989898.

logpuller: add resolve lock metrics

dcc13a8

ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 29, 2026

ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 29, 2026

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread logservice/txnutil/lock_resolver.go Outdated

Comment thread logservice/txnutil/lock_resolver.go Outdated

Comment thread logservice/logpuller/subscription_client.go Outdated

f

a1a1562

coderabbitai Bot reviewed May 2, 2026

View reviewed changes

improve test

8ed30c6

coderabbitai Bot reviewed May 3, 2026

View reviewed changes

Comment thread metrics/nextgengrafana/ticdc_new_arch_next_gen.json Outdated

lidezhu added 5 commits May 3, 2026 17:52

f

fc70b17

f

8f0b876

fix

513ef61

f

0a0f215

fix

8686af0

ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 3, 2026

lidezhu added 2 commits May 3, 2026 19:53

f

334bf2e

f

f565e7b

lidezhu changed the title ~~logpuller: add resolve lock metrics~~ logpuller: add more resolve lock metrics May 3, 2026

lidezhu added 5 commits May 3, 2026 20:34

f

2ab4669

f

795e1ab

f

812efed

update grafana

890ca38

update grafana

fd0d805

lidezhu changed the title ~~logpuller: add more resolve lock metrics~~ logpuller: refactor resolve lock logic & add more resolve lock metrics May 3, 2026

Merge remote-tracking branch 'upstream/master' into ldz/add-delete-ra…

d0c17a4

…nge-metrics # Conflicts: # logservice/logpuller/subscription_client_test.go

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

Comment thread logservice/logpuller/subscription_client.go Outdated

ti-chi-bot Bot added do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue labels May 9, 2026

try fix

d0ccce9

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label May 9, 2026

lidezhu added 7 commits May 9, 2026 10:34

f

78daf68

fix

ef506eb

f

a22e722

f

8e113db

f

7d471a5

f

ad2b1d2

f

04d877b

lidezhu changed the title ~~logpuller: refactor resolve lock logic & add more resolve lock metrics~~ logpuller: refactor resolve lock rate limit logic & add more resolve lock metrics May 9, 2026

wk989898 approved these changes May 9, 2026

View reviewed changes

ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 9, 2026

ti-chi-bot Bot added the approved label May 9, 2026

Conversation

lidezhu commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Highlights

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lidezhu commented May 9, 2026

Uh oh!

gemini-code-assist Bot commented May 9, 2026

Summary of Changes

Highlights

Uh oh!

lidezhu commented May 9, 2026

Uh oh!

lidezhu commented May 9, 2026

Uh oh!

ti-chi-bot Bot commented May 9, 2026

Uh oh!

ti-chi-bot Bot commented May 9, 2026

[LGTM Timeline notifier]

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lidezhu commented Apr 29, 2026 •

edited

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading