Skip to content

logpuller: refactor resolve lock rate limit logic & add more resolve lock metrics#4954

Open
lidezhu wants to merge 24 commits intomasterfrom
ldz/add-delete-range-metrics
Open

logpuller: refactor resolve lock rate limit logic & add more resolve lock metrics#4954
lidezhu wants to merge 24 commits intomasterfrom
ldz/add-delete-range-metrics

Conversation

@lidezhu
Copy link
Copy Markdown
Collaborator

@lidezhu lidezhu commented Apr 29, 2026

What problem does this PR solve?

Issue Number: close #5016

What is changed and how it works?

This pull request refactors the resolve-lock logic in the logpuller component to improve efficiency and observability. By centralizing the rate-limiting mechanism and introducing granular metrics, the system now better handles concurrent resolve-lock requests and provides clearer insights into the lock resolution process. These changes ensure that redundant tasks are avoided and that the performance of lock resolution can be effectively monitored via updated dashboards.

Highlights

  • Resolve Lock Throttling: Introduced a centralized resolveLockRateLimiter in subscriptionClient to deduplicate and throttle resolve-lock tasks per region, ensuring consistent behavior across multiple subscriptions.
  • Metrics Instrumentation: Added comprehensive metrics for resolve-lock success/failure and processed-lock counts, and updated Grafana dashboards to visualize these new metrics.
  • Task Handling Refactor: Refactored handleResolveLockTasks to directly invoke the lock resolver and improved task scheduling logic to prevent redundant execution.
  • Test Improvements: Added new unit tests for the rate limiter and integration test adjustments to ensure correct lock generation and throttling behavior.
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • New Features

    • Added resolve-lock execution metrics: success/failure counters and processed-lock (found/resolved) counts.
  • Improvements

    • Per-region throttling moved to task gating to prevent duplicate work; task enqueueing more strictly validated and drop counting refined; resolve logging sampled and resolved counting improved.
  • Documentation

    • Grafana dashboards updated to show resolve-lock success/failure and processed-locks.
  • Tests

    • Added/updated tests for throttling, metrics, and handler behavior.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 29, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Centralizes per-region resolve-lock throttling into a new rate limiter on subscribedSpan, simplifies resolveLockTask payload, moves enqueue gating into tryResolveLock, refactors handleResolveLockTasks to call lockResolver.Resolve directly and emit success/failure metrics, instruments lock-resolver found/resolved counts, updates tests, dashboards, and an integration lock generator.

Changes

Lock resolution, throttling, and metrics

Layer / File(s) Summary
Data Shape
logservice/logpuller/subscription_client.go
resolveLockTask removed create timestamp (now: keyspaceID, regionID, targetTs, state).
Throttling Helper
logservice/logpuller/resolve_lock_rate_limiter.go, logservice/logpuller/resolve_lock_rate_limiter_test.go
New resolveLockRateLimiter with tryMark/unmark/gc; tests added for tryMark and GC behavior.
Enqueue / Wiring
logservice/logpuller/subscription_client.go
subscribedSpan gains resolveLockLimiter; tryResolveLock performs init/stale checks, uses resolveLockLimiter.tryMark, unmarks on cancel or channel-full, and increments task-drop counter when enqueue fails.
Task Consumption / Control Flow
logservice/logpuller/subscription_client.go
handleResolveLockTasks refactored to call s.lockResolver.Resolve(...) per task, removed consumer-side last-run throttling/GC, and increments per-outcome metrics for success/failure.
Resolver Logic & Metrics
logservice/txnutil/lock_resolver.go
Added package counters for locks “found” and “resolved”; limited sample logging of locks; refactored region reload logic; counts expired locks via countExpiredLocks and records processed-lock metrics.
Metric Declarations
pkg/metrics/log_puller.go, pkg/metrics/init.go
Added SubscriptionClientResolveLockCounter and SubscriptionClientResolveLockProcessedLockCounter (both labeled by status) and registered them; removed initPullerMetrics call from InitMetrics.
Tests
logservice/logpuller/subscription_client_test.go, logservice/logpuller/resolve_lock_rate_limiter_test.go
Added mockLockResolver (atomic counter), updated TestGenerateResolveLockTask to assert no duplicate enqueues within throttle interval and empty channel before close, added TestHandleResolveLockTasksMetrics, and tests for limiter behavior.
Dashboards
metrics/grafana/ticdc_new_arch.json, metrics/nextgengrafana/ticdc_new_arch_next_gen.json
Replaced single dropped-task graph with Resolve Lock / s showing status="success"/"failure" series plus task-drop series; added Resolve Lock Processed Locks / s panel for found/resolved.
Integration test adjustments
tests/integration_tests/resolve_lock/main.go
Lock generation rewritten to deterministic full-table batching, added table splits, extended wait, and made PrewriteResponse key errors fatal in lockBatch.
sequenceDiagram
    participant Client as SubscriptionClient
    participant Span as subscribedSpan
    participant Queue as resolveLockTaskCh
    participant Resolver as LockResolver
    participant Metrics as Prometheus

    Client->>Span: tryResolveLock(regionID, targetTs)
    Span->>Span: check Initialized / ResolvedTs vs targetTs
    Span->>Span: resolveLockLimiter.tryMark(regionID)
    alt allowed
        Span->>Queue: enqueue resolveLockTask
    else denied
        Span->>Span: (skip enqueue)
    end
    Note over Queue,Client: handler goroutine reads tasks
    Queue->>Client: handleResolveLockTasks reads task
    Client->>Resolver: Resolve(task.state, task.targetTs)
    Resolver->>Metrics: increment processed_lock_count{status="found"/"resolved"}
    Resolver-->>Client: (err / nil)
    Client->>Metrics: increment resolve_lock_count{status="success"/"failure"}
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • pingcap/ticdc#4187: Overlaps on resolve-lock handling and logging changes in subscription_client.go.
  • pingcap/ticdc#4240: Related per-region resolve-lock throttling and last-run tracking refactors.

Suggested labels

lgtm, approved

Suggested reviewers

  • asddongmen
  • hongyunyan
  • flowbehappy

Poem

🐰 I hop where stale locks quietly lie,
Marks per-region keep duplicate tries shy,
Counters whisper found and fixed in store,
Tasks queue, resolve runs, and metrics soar.
A carrot-cheer for every resolved chore.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is incomplete. The Issue Number field contains a placeholder (#5016), but critical sections lack required details. Provide a concrete problem statement in 'What problem does this PR solve?' section. Complete the 'Will it cause performance regression or break compatibility?' and 'Do you need to update user documentation?' questions. Fill the release-note section with an actual release note or 'None'.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title directly describes the main changes: refactoring resolve lock rate limit logic and adding resolve lock metrics in logpuller, which aligns with the substantial file modifications across subscription_client.go, metrics files, and new rate limiter implementation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ldz/add-delete-range-metrics

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 29, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the observability of the lock resolution process by introducing new Prometheus metrics and updating Grafana dashboards to track task statuses, durations, and scanned ranges. Key improvements include more granular tracking of skipped, throttled, and dropped tasks in the subscription client and transaction resolver. Feedback suggests correcting a unit mismatch in the LockResolveDuration metric, optimizing memory allocation in countLocksByTTL, and removing redundant metric increments for dropped tasks.

Comment thread logservice/txnutil/lock_resolver.go Outdated
Comment thread logservice/txnutil/lock_resolver.go Outdated
Comment thread logservice/logpuller/subscription_client.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
logservice/logpuller/subscription_client.go (1)

1002-1064: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop the second, client-wide resolve-lock throttle.

tryResolveLock already throttles per subscribedSpan/region and records the enqueue time. The extra resolveLastRun[regionID] here is global to the whole client, so two subscriptions that share a TiKV region can suppress each other's valid resolve attempts for 10 seconds. That can skip real lock-resolution work and undercount the new execution metric.

Proposed change
 func (s *subscriptionClient) handleResolveLockTasks(ctx context.Context) error {
-	resolveLastRun := make(map[uint64]time.Time)
-
 	doResolve := func(task resolveLockTask) {
 		keyspaceID := task.keyspaceID
 		regionID := task.regionID
 		state := task.state
@@
-		lastRun, ok := resolveLastRun[regionID]
-		if ok {
-			if time.Since(lastRun) < resolveLockMinInterval {
-				return
-			}
-		}
-
 		if s.lockResolver == nil {
 			metrics.SubscriptionClientResolveLockCounter.WithLabelValues(resolveLockMetricFailed).Inc()
 			log.Warn("subscription client lock resolver is nil",
@@
 		if err != nil {
 			log.Warn("subscription client resolve lock fail",
 				zap.Uint32("keyspaceID", keyspaceID),
 				zap.Uint64("regionID", regionID),
 				zap.Uint64("targetTs", targetTs),
-				zap.Time("lastRun", lastRun),
 				zap.Any("state", state),
 				zap.Error(err))
 		}
-		resolveLastRun[regionID] = time.Now()
 	}
-
-	gcTicker := time.NewTicker(resolveLockMinInterval * 3 / 2)
-	defer gcTicker.Stop()
 	for {
 		select {
 		case <-ctx.Done():
 			return ctx.Err()
-		case <-gcTicker.C:
-			resolveLastRun = gcResolveLastRunMap(resolveLastRun, time.Now())
 		case task := <-s.resolveLockTaskCh:
 			doResolve(task)
 		}
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/logpuller/subscription_client.go` around lines 1002 - 1064, The
client currently applies a second, client-wide throttle via resolveLastRun
inside handleResolveLockTasks which can incorrectly suppress valid per-region
resolves; remove that global throttle by deleting the resolveLastRun map, the
"lastRun" check (time.Since(lastRun) < resolveLockMinInterval), the
resolveLastRun[regionID] = time.Now() assignment, the zap.Time("lastRun",
lastRun) field in the error log, and the gcTicker + gcResolveLastRunMap cleanup
logic; keep the rest of doResolve (including the s.lockResolver.Resolve call,
metrics increments, and existing state checks) so per-subscription/region
throttling from tryResolveLock remains the only throttle.
🧹 Nitpick comments (1)
logservice/txnutil/lock_resolver.go (1)

172-182: ⚡ Quick win

Snapshot the expired-lock lower bound before calling ResolveLocks.

countExpiredLocks runs after ResolveLocks returns, so a lock that expires while the RPC is in flight can be counted as "resolved" even though this invocation never handled it. That makes the "lower bound" in the comment inaccurate.

Proposed change
-		msBeforeTxnExpired, err1 := kvStorage.GetLockResolver().ResolveLocks(bo, 0, locks)
+		expiredLockCount := countExpiredLocks(kvStorage, locks)
+		msBeforeTxnExpired, err1 := kvStorage.GetLockResolver().ResolveLocks(bo, 0, locks)
 		if err1 != nil {
 			return errors.Trace(err1)
 		}
 		resolvedLockCount := len(locks)
 		if msBeforeTxnExpired > 0 {
 			// ResolveLocks only reports remaining TTL, so use expired locks as the
 			// best available lower bound when a batch is only partially resolved.
-			resolvedLockCount = countExpiredLocks(kvStorage, locks)
+			resolvedLockCount = expiredLockCount
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/txnutil/lock_resolver.go` around lines 172 - 182, The code
computes countExpiredLocks(kvStorage, locks) after calling
kvStorage.GetLockResolver().ResolveLocks(bo, 0, locks), which lets locks expire
during the RPC and misattributes those to this resolver; to fix, call
countExpiredLocks(kvStorage, locks) once before invoking ResolveLocks and store
it (e.g. expiredLowerBound), then call ResolveLocks, and use that precomputed
expiredLowerBound when msBeforeTxnExpired > 0 to compute resolvedLockCount
before passing it to recordLockResolveLockCount(lockResolveMetricResolved,
resolvedLockCount); keep the existing symbols ResolveLocks, countExpiredLocks,
msBeforeTxnExpired, resolvedLockCount, recordLockResolveLockCount and
kvStorage/locks to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@logservice/logpuller/subscription_client.go`:
- Around line 1002-1064: The client currently applies a second, client-wide
throttle via resolveLastRun inside handleResolveLockTasks which can incorrectly
suppress valid per-region resolves; remove that global throttle by deleting the
resolveLastRun map, the "lastRun" check (time.Since(lastRun) <
resolveLockMinInterval), the resolveLastRun[regionID] = time.Now() assignment,
the zap.Time("lastRun", lastRun) field in the error log, and the gcTicker +
gcResolveLastRunMap cleanup logic; keep the rest of doResolve (including the
s.lockResolver.Resolve call, metrics increments, and existing state checks) so
per-subscription/region throttling from tryResolveLock remains the only
throttle.

---

Nitpick comments:
In `@logservice/txnutil/lock_resolver.go`:
- Around line 172-182: The code computes countExpiredLocks(kvStorage, locks)
after calling kvStorage.GetLockResolver().ResolveLocks(bo, 0, locks), which lets
locks expire during the RPC and misattributes those to this resolver; to fix,
call countExpiredLocks(kvStorage, locks) once before invoking ResolveLocks and
store it (e.g. expiredLowerBound), then call ResolveLocks, and use that
precomputed expiredLowerBound when msBeforeTxnExpired > 0 to compute
resolvedLockCount before passing it to
recordLockResolveLockCount(lockResolveMetricResolved, resolvedLockCount); keep
the existing symbols ResolveLocks, countExpiredLocks, msBeforeTxnExpired,
resolvedLockCount, recordLockResolveLockCount and kvStorage/locks to locate the
change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 471ff694-0242-46f4-a7b5-682dedb1a3c6

📥 Commits

Reviewing files that changed from the base of the PR and between 5518eb2 and a1a1562.

📒 Files selected for processing (7)
  • logservice/logpuller/subscription_client.go
  • logservice/logpuller/subscription_client_test.go
  • logservice/txnutil/lock_resolver.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • pkg/metrics/log_puller.go
  • pkg/metrics/puller.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
logservice/logpuller/subscription_client.go (1)

1016-1038: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Remove the second resolve-lock cooldown here.

tryResolveLock already throttles per region before enqueueing. Re-checking resolveLastRun in the worker resets the cooldown at execution time, so a task that sat in the channel or failed late can suppress the next retry for another full resolveLockMinInterval. In practice this stretches recovery to queue delay + resolve time + 10s instead of the configured interval.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/logpuller/subscription_client.go` around lines 1016 - 1038, The
worker currently re-applies a per-region cooldown using resolveLastRun and
resolveLockMinInterval and updates resolveLastRun after calling
s.lockResolver.Resolve; remove that second cooldown so only tryResolveLock
enforces throttling: delete the initial time.Since(lastRun) <
resolveLockMinInterval check (and its early return), stop setting
resolveLastRun[regionID] = time.Now() after Resolve, and clean up any logging
that references lastRun; keep the Resolve call, metric increments, and error
logging intact so tryResolveLock remains the single source of cooldown
enforcement.
🧹 Nitpick comments (2)
logservice/logpuller/subscription_client_test.go (1)

149-161: 💤 Low value

Consider isolating the metric assertion to avoid cross-test interference.

The >= comparison at line 160 could pass incorrectly if other tests in the suite also increment SubscriptionClientResolveLockSuccessCounter concurrently. Since Prometheus counters are global singletons, parallel test execution could cause flakiness.

Consider resetting/isolating the metric or using a more precise assertion that captures the exact delta.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/logpuller/subscription_client_test.go` around lines 149 - 161, The
test currently uses a >= check against
metrics.SubscriptionClientResolveLockSuccessCounter allowing other tests to
affect the result; instead capture the initial value via
testutil.ToFloat64(metrics.SubscriptionClientResolveLockSuccessCounter) and
assert the exact delta after the operation (i.e., final == initial+1) or isolate
the metric by resetting or using a private Prometheus registry for
SubscriptionClientResolveLockSuccessCounter before sending to
client.resolveLockTaskCh; update the require.Eventually assertion to compare
equality (resolver.calls.Load() == 1 &&
testutil.ToFloat64(metrics.SubscriptionClientResolveLockSuccessCounter) ==
successBefore+1) or call a metric-reset helper targeted at
SubscriptionClientResolveLockSuccessCounter to avoid cross-test interference.
metrics/grafana/ticdc_new_arch.json (1)

8292-8316: ⚡ Quick win

Consider separate Y-axis or panel split — targets A/B and C have different granularities

Targets A and B count resolve-lock task executions (one increment per subscribedSpan attempt), while target C counts individual TiKV locks found/resolved (one increment per lock). In high-lock-rate scenarios, C can be orders of magnitude larger than A and B, which will compress the task-success/failure lines to near-invisible on a shared linear Y-axis.

Two remediation options:

  • Assign target C to a right Y-axis (yaxis: 2) so the two scales are independent.
  • Split the panel into two: one for task-level success/failure rates, and one for lock counts.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@metrics/grafana/ticdc_new_arch.json` around lines 8292 - 8316, Targets A/B
(refId "A","B" – resolve_lock success/failure expressions) are task-level counts
and target C (refId "C" – ticdc_kvclient_lock_resolve_lock_count) is per-lock
and can be orders of magnitude larger; fix by either assigning target C to the
right Y-axis (set yaxis: 2 for the target object with refId "C") so A/B stay on
the left axis, or split into two panels—one panel containing targets A and B and
a second panel containing target C—so the scales don’t compress the task-level
lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@metrics/nextgengrafana/ticdc_new_arch_next_gen.json`:
- Around line 8295-8316: The PromQL metric names used by refId A, B, and C are
missing the OpenMetrics `_total` suffix and must be updated: change the
expression in refId A from ticdc_subscription_client_resolve_lock_success_count
to ticdc_subscription_client_resolve_lock_success_count_total, change refId B
from ticdc_subscription_client_resolve_lock_failure_count to
ticdc_subscription_client_resolve_lock_failure_count_total, and change refId C
from ticdc_kvclient_lock_resolve_lock_count to
ticdc_kvclient_lock_resolve_lock_count_total so the queries match the metrics
exposed by the Go/OpenMetrics client.

---

Outside diff comments:
In `@logservice/logpuller/subscription_client.go`:
- Around line 1016-1038: The worker currently re-applies a per-region cooldown
using resolveLastRun and resolveLockMinInterval and updates resolveLastRun after
calling s.lockResolver.Resolve; remove that second cooldown so only
tryResolveLock enforces throttling: delete the initial time.Since(lastRun) <
resolveLockMinInterval check (and its early return), stop setting
resolveLastRun[regionID] = time.Now() after Resolve, and clean up any logging
that references lastRun; keep the Resolve call, metric increments, and error
logging intact so tryResolveLock remains the single source of cooldown
enforcement.

---

Nitpick comments:
In `@logservice/logpuller/subscription_client_test.go`:
- Around line 149-161: The test currently uses a >= check against
metrics.SubscriptionClientResolveLockSuccessCounter allowing other tests to
affect the result; instead capture the initial value via
testutil.ToFloat64(metrics.SubscriptionClientResolveLockSuccessCounter) and
assert the exact delta after the operation (i.e., final == initial+1) or isolate
the metric by resetting or using a private Prometheus registry for
SubscriptionClientResolveLockSuccessCounter before sending to
client.resolveLockTaskCh; update the require.Eventually assertion to compare
equality (resolver.calls.Load() == 1 &&
testutil.ToFloat64(metrics.SubscriptionClientResolveLockSuccessCounter) ==
successBefore+1) or call a metric-reset helper targeted at
SubscriptionClientResolveLockSuccessCounter to avoid cross-test interference.

In `@metrics/grafana/ticdc_new_arch.json`:
- Around line 8292-8316: Targets A/B (refId "A","B" – resolve_lock
success/failure expressions) are task-level counts and target C (refId "C" –
ticdc_kvclient_lock_resolve_lock_count) is per-lock and can be orders of
magnitude larger; fix by either assigning target C to the right Y-axis (set
yaxis: 2 for the target object with refId "C") so A/B stay on the left axis, or
split into two panels—one panel containing targets A and B and a second panel
containing target C—so the scales don’t compress the task-level lines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 626703dd-3000-4fad-89a4-59e319b268db

📥 Commits

Reviewing files that changed from the base of the PR and between a1a1562 and 8ed30c6.

📒 Files selected for processing (7)
  • logservice/logpuller/subscription_client.go
  • logservice/logpuller/subscription_client_test.go
  • logservice/txnutil/lock_resolver.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • pkg/metrics/log_puller.go
  • tests/integration_tests/resolve_lock/main.go

Comment thread metrics/nextgengrafana/ticdc_new_arch_next_gen.json Outdated
@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 3, 2026
@lidezhu lidezhu changed the title logpuller: add resolve lock metrics logpuller: add more resolve lock metrics May 3, 2026
@lidezhu lidezhu changed the title logpuller: add more resolve lock metrics logpuller: refactor resolve lock logic & add more resolve lock metrics May 3, 2026
…nge-metrics

# Conflicts:
#	logservice/logpuller/subscription_client_test.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
logservice/txnutil/lock_resolver.go (1)

98-109: ⚡ Quick win

Wrap LocateRegionByID errors inside reloadRegion.

reloadRegion now returns the raw LocateRegionByID error and both callers wrap it later. That loses the “wrap at source” convention and makes future callers easy to get wrong. Wrap the third-party error in the closure and let callers return it as-is.

♻️ Suggested change
 reloadRegion := func() error {
 	var err error
 	loc, err = kvStorage.GetRegionCache().LocateRegionByID(bo, regionID)
 	if err != nil {
-		return err
+		return errors.Trace(err)
 	}
 	key = loc.StartKey
 	endKey = loc.EndKey
 	return nil
 }
 if err = reloadRegion(); err != nil {
-	return errors.Trace(err)
+	return err
 }
 ...
 if regionErr != nil {
 	err = bo.Backoff(tikv.BoRegionMiss(), errors.New(regionErr.String()))
 	if err != nil {
 		return errors.Trace(err)
 	}
 	if err := reloadRegion(); err != nil {
-		return errors.Trace(err)
+		return err
 	}
 	continue
 }

As per coding guidelines, "When an error comes from a third party/library call, wrap it immediately with errors.Trace(err) or errors.WrapError(...) to attach a stack trace; upstream callers should propagate wrapped errors without wrapping again".

Also applies to: 132-134

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@logservice/txnutil/lock_resolver.go` around lines 98 - 109, The closure
reloadRegion currently returns the raw error from
kvStorage.GetRegionCache().LocateRegionByID and callers wrap it later; change
reloadRegion so it immediately wraps any error from LocateRegionByID (e.g.,
return errors.Trace(err) or errors.WrapError(err)) before returning, so callers
can propagate it unchanged; apply the same fix for the other closure/usage
around the later LocateRegionByID call (the block referenced at 132-134) so all
third-party errors are wrapped at the source (refer to reloadRegion,
LocateRegionByID, kvStorage.GetRegionCache()).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@logservice/logpuller/subscription_client.go`:
- Around line 138-139: The resolve-lock limiter is currently scoped per
subscribedSpan which allows multiple enqueues for the same TiKV region and
records time in tryMark too early; move the resolveLockRateLimiter out of
subscribedSpan and into subscriptionClient as a shared limiter keyed by
(keyspaceID, regionID) so throttling is per-region, not per-span; also separate
“pending/in-flight” dedupe from the “last execution time” (used by
resolveLockMinInterval) and update that timestamp when the actual resolve starts
or finishes (e.g., in handleResolveLockTasks) rather than in tryMark to prevent
admitting another task for the same region while one is still pending.

---

Nitpick comments:
In `@logservice/txnutil/lock_resolver.go`:
- Around line 98-109: The closure reloadRegion currently returns the raw error
from kvStorage.GetRegionCache().LocateRegionByID and callers wrap it later;
change reloadRegion so it immediately wraps any error from LocateRegionByID
(e.g., return errors.Trace(err) or errors.WrapError(err)) before returning, so
callers can propagate it unchanged; apply the same fix for the other
closure/usage around the later LocateRegionByID call (the block referenced at
132-134) so all third-party errors are wrapped at the source (refer to
reloadRegion, LocateRegionByID, kvStorage.GetRegionCache()).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b62055f8-d659-4fd9-a7b9-7d991cb5c1cd

📥 Commits

Reviewing files that changed from the base of the PR and between 0a0f215 and d0c17a4.

📒 Files selected for processing (10)
  • logservice/logpuller/resolve_lock_rate_limiter.go
  • logservice/logpuller/resolve_lock_rate_limiter_test.go
  • logservice/logpuller/subscription_client.go
  • logservice/logpuller/subscription_client_test.go
  • logservice/txnutil/lock_resolver.go
  • metrics/grafana/ticdc_new_arch.json
  • metrics/nextgengrafana/ticdc_new_arch_next_gen.json
  • pkg/metrics/init.go
  • pkg/metrics/log_puller.go
  • pkg/metrics/puller.go
💤 Files with no reviewable changes (2)
  • pkg/metrics/puller.go
  • pkg/metrics/init.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • logservice/logpuller/subscription_client_test.go

Comment thread logservice/logpuller/subscription_client.go Outdated
@lidezhu
Copy link
Copy Markdown
Collaborator Author

lidezhu commented May 9, 2026

/gemini summary

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

This pull request refactors the resolve-lock logic in the logpuller component to improve efficiency and observability. By centralizing the rate-limiting mechanism and introducing granular metrics, the system now better handles concurrent resolve-lock requests and provides clearer insights into the lock resolution process. These changes ensure that redundant tasks are avoided and that the performance of lock resolution can be effectively monitored via updated dashboards.

Highlights

  • Resolve Lock Throttling: Introduced a centralized resolveLockRateLimiter in subscriptionClient to deduplicate and throttle resolve-lock tasks per region, ensuring consistent behavior across multiple subscriptions.
  • Metrics Instrumentation: Added comprehensive metrics for resolve-lock success/failure and processed-lock counts, and updated Grafana dashboards to visualize these new metrics.
  • Task Handling Refactor: Refactored handleResolveLockTasks to directly invoke the lock resolver and improved task scheduling logic to prevent redundant execution.
  • Test Improvements: Added new unit tests for the rate limiter and integration test adjustments to ensure correct lock generation and throttling behavior.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Activity
  • Pull request created by lidezhu.
  • CodeRabbit automated review initiated and subsequently paused due to active development.
  • Code review comments provided regarding metric naming conventions and throttling logic.
  • Code changes implemented to address review feedback, including moving the rate limiter to the client scope and fixing metric naming.
  • Summary requested by the author.

@lidezhu lidezhu changed the title logpuller: refactor resolve lock logic & add more resolve lock metrics logpuller: refactor resolve lock rate limit logic & add more resolve lock metrics May 9, 2026
@lidezhu
Copy link
Copy Markdown
Collaborator Author

lidezhu commented May 9, 2026

/test all

@lidezhu
Copy link
Copy Markdown
Collaborator Author

lidezhu commented May 9, 2026

/retest

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 9, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 9, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-09 08:16:13.988350503 +0000 UTC m=+514846.861700475: ☑️ agreed by wk989898.

@ti-chi-bot ti-chi-bot Bot added the approved label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve resolve lock observability and reduce duplicate attempts

2 participants