Skip to content

add OpenTelemetry metrics instrumentation#3110

Merged
PavelPashov merged 52 commits intoredis:masterfrom
PavelPashov:feat/add-opentelemetry-metrics
Mar 27, 2026
Merged

add OpenTelemetry metrics instrumentation#3110
PavelPashov merged 52 commits intoredis:masterfrom
PavelPashov:feat/add-opentelemetry-metrics

Conversation

@PavelPashov
Copy link
Copy Markdown
Contributor

@PavelPashov PavelPashov commented Oct 24, 2025

Description

TODO

  • Add docs
  • Add examples

Checklist

  • Does npm test pass with this change (including linting)?
  • Is the new or changed code fully tested?
  • Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?

Note

Medium Risk
Touches core request/connection paths (RedisClient, RedisSocket, queues, cluster redirection) to emit metrics and maintain per-client identity/registry, so mistakes could add overhead or skew counts. Functional behavior should be unchanged, but instrumentation around errors/closures and cache invalidation needs careful review.

Overview
Adds first-class OpenTelemetry metrics support via a new OpenTelemetry.init() entrypoint and an internal metrics/registry layer that can be enabled before client creation.

Instruments core client operations to emit metrics for command durations (including MULTI/PIPELINE batches), connection lifecycle (create time, active count, close reasons, relaxed timeouts, handoff), resiliency/errors (including cluster redirections and maintenance notifications), pub/sub message in/out, stream lag from XREAD/XREADGROUP, and client-side caching hits/misses/evictions.

Introduces stable per-client identity (_clientId) and a ClientRegistry so observable gauges and metric attribution can track standalone clients, pools, and cluster nodes; updates pool/cluster/node creation to set roles/parent IDs and unregisters clients on close/destroy/quit. Documentation and an otel-metrics.js example are added, along with new OpenTelemetry dev dependencies and extensive tests for the metrics implementation.

Written by Cursor Bugbot for commit 6b04331. This will update automatically on new commits. Configure here.

@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 4 times, most recently from a11117e to 5fb1791 Compare October 31, 2025 11:39
@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 2 times, most recently from 1be4565 to 03db1a2 Compare November 5, 2025 11:03
Copy link
Copy Markdown

@jit-ci jit-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ The following Jit checks failed to run:

  • license-compliance-checker
  • secret-detection-trufflehog
  • software-component-analysis-js
  • static-code-analysis-semgrep-pro

#jit_bypass_commit in this PR to bypass, Jit Admin privileges required.

More info in the Jit platform.

@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 3 times, most recently from 1106f8a to 242e98d Compare November 6, 2025 13:55
@elimelt
Copy link
Copy Markdown
Contributor

elimelt commented Dec 8, 2025

This is an exciting feature, I hope it makes it to main!

@jit-ci
Copy link
Copy Markdown

jit-ci bot commented Jan 19, 2026

❌ Security scan failed

Security scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch from 364ff86 to 30b0e8c Compare January 19, 2026 08:35
@jit-ci
Copy link
Copy Markdown

jit-ci bot commented Jan 19, 2026

❌ Security scan failed

Security scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

1 similar comment
@jit-ci
Copy link
Copy Markdown

jit-ci bot commented Jan 19, 2026

❌ Security scan failed

Security scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository


💡 Need to bypass this check? Comment @sera bypass to override.

@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch 2 times, most recently from 0e91b34 to e2cc888 Compare February 20, 2026 15:32
@PavelPashov PavelPashov marked this pull request as ready for review February 25, 2026 11:28
TRANSFORM_LEGACY_REPLY?: boolean;
transformReply: TransformReply | Record<RespVersions, TransformReply>;
unstableResp3?: boolean;
onSuccess?: (args: ReadonlyArray<RedisArgument>, reply: unknown, clientId: string) => void;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkaradzhov we might want to rename onSuccess to something more metrics/otel related

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets see if we can remove it altogether and wether its worth it

Comment on lines +1127 to +1128
if (command.onSuccess) {
command.onSuccess(parser.redisArgs, finalReply, this._self._clientId);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkaradzhov there might be a better way to record specific command related metrics that require the server response

Comment on lines +65 to +73
// Build the appropriate function based on options
if (options.hasIncludeCommands || options.hasExcludeCommands) {
// Version with filtering
this.createRecordOperationDuration = this.#createWithFiltering.bind(this);
} else {
this.createRecordOperationDuration =
this.#createWithoutFiltering.bind(this);
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have two fns?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkaradzhov to make sure we don't do unnecessary checks after initiating the metrics singleton if the user hasn't provided any included or excluded commands

@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch from bf97457 to adbe066 Compare March 5, 2026 14:57
@PavelPashov PavelPashov force-pushed the feat/add-opentelemetry-metrics branch from adbe066 to bc14732 Compare March 5, 2026 15:01
@PavelPashov
Copy link
Copy Markdown
Contributor Author

PavelPashov commented Mar 23, 2026

@nkaradzhov & @vladvildanov

Created a separated PR for conversion of ObservableGauges to up UpDownCounters.

After the current PR is released I'll need to create 3 issues for:

  1. redis.client.csc.network_saved - disabled, due to incorrect values computed from inside the metrics class
  2. db.client.connection.count - state is always used, this should be renamed and converted to ObservableGauge in order to be more accurate.
  3. db.client.connection.pending_requests - disabled, should be renamed and remain a ObservableGauge, as converting it to UpDownCounter will likely result in a performance hit when enabled.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

logaretm added a commit to logaretm/node-redis that referenced this pull request Mar 26, 2026
No subscriber existed for this event. The original redis#3110 code only
recorded at wait end, not start. Remove the speculative event.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@nkaradzhov nkaradzhov dismissed vladvildanov’s stale review March 27, 2026 08:55

Already addressed

@nkaradzhov nkaradzhov self-requested a review March 27, 2026 08:56
TRANSFORM_LEGACY_REPLY?: boolean;
transformReply: TransformReply | Record<RespVersions, TransformReply>;
unstableResp3?: boolean;
onSuccess?: (args: ReadonlyArray<RedisArgument>, reply: unknown, clientId: string) => void;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets see if we can remove it altogether and wether its worth it

@PavelPashov PavelPashov merged commit f1a9b4a into redis:master Mar 27, 2026
26 of 28 checks passed
logaretm added a commit to logaretm/node-redis that referenced this pull request Mar 27, 2026
No subscriber existed for this event. The original redis#3110 code only
recorded at wait end, not start. Remove the speculative event.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PavelPashov added a commit that referenced this pull request Apr 2, 2026
* feat: add TracingChannel support with argument sanitization

Add diagnostics_channel TracingChannel instrumentation for commands
and connections. Wraps sendCommand, connect, pipeline, and MULTI
with traceCommand/traceConnect alongside existing OTel metrics.

- Sanitize args before emission using OTel redis-common rules
- Include clientId from identity system in trace context
- Add .catch(noop) for pipeline/MULTI to prevent unhandled rejections
- Gate on hasSubscribers for zero-cost when no APM subscribes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: add comprehensive unit tests for sanitizeArgs

Cover all serialization subsets (args=0, 1, 2, -1, default),
edge cases (empty args, AUTH, HELLO, unknown commands, prefix
matching like SETEX), case insensitivity, and non-string arg
stringification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: use ? placeholder for redacted args

Matches the SQL parameterization convention used by db.query.text
across APM tools (Sentry, Datadog, OTel).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: move OTel command metrics to TracingChannel subscribers

Replace inline createRecordOperationDuration/createRecordBatchOperationDuration
closures in sendCommand, _executePipeline, and _executeMulti with TracingChannel
subscriptions in OTelCommandMetrics.

- OTelCommandMetrics subscribes to node-redis:command and node-redis:batch channels
- Command filtering (include/exclude) handled in the start subscriber
- Batch metrics use new node-redis:batch channel wrapping pipeline/multi
- Zero inline OTel metric closures remain in core client code
- Noop classes no longer need closure methods
- Tests rewritten to verify filtering through TC events

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: remove NoopCommandMetrics class

No longer needed — when command metric group is disabled, no TC
subscriptions are created, so hasSubscribers is false and
traceCommand skips TracingChannel entirely. Zero cost without noops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: move all inline OTel metrics to diagnostics_channel events

Replace every OTelMetrics.instance.* call in core code with
publish(CHANNELS.*, factory) point events. All observability data
now flows through diagnostics_channel:

- TracingChannel for async lifecycles (command, batch, connect)
- Plain dc.channel for point events (connection ready/closed,
  errors, maintenance, pubsub, cache, command replies)

Core code has zero OTelMetrics imports. The publish() function uses
factory callbacks + hasSubscribers checks for zero-cost when no
APM subscribes. Channel names are centralized in CHANNELS const map
with a type-safe ChannelEvents interface.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: consolidate all OTel metric classes into channel subscribers

Replace 6 individual metric classes (OTelConnectionBasicMetrics,
OTelConnectionAdvancedMetrics, OTelResiliencyMetrics,
OTelClientSideCacheMetrics, OTelPubSubMetrics, OTelStreamMetrics)
with a single OTelChannelSubscribers class that subscribes to
diagnostics_channel events.

- All metric recording now happens via channel subscriptions
- Noop classes eliminated — when a metric group is disabled, no
  subscription is created, zero overhead
- noop-metrics.ts reduced to a single NoopOTelMetrics shell
- IOTelMetrics interface simplified to just { commandMetrics }
- recordCommandReplyMetrics removed — pubsub out and stream lag
  metrics now handled by channel subscribers on node-redis:command:reply

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test: remove no-op channel publish tests

These tests only verified that dc.channel().publish() doesn't crash,
which is a Node.js guarantee, not our responsibility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: use CHANNELS map consistently and tracingChannel API for subscriptions

- Add TRACE_COMMAND, TRACE_BATCH, TRACE_CONNECT to CHANNELS map
- Replace raw dc.subscribe() on sub-channel strings with
  dc.tracingChannel().subscribe() for proper TracingChannel usage
- Replace all remaining string literals in metrics.ts with CHANNELS.*

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove dead types and interfaces from opentelemetry module

Remove types that are no longer implemented by any class after the
channel subscriber refactor:

- RecordClientErrorContext, CommandReplyMetricHandler, ClientErrorOrigin,
  MetricErrorType, ConnectionCloseReason, CscResult, CscEvictionReason,
  ErrorCategory (type aliases)
- METRIC_INSTRUMENT_TYPE (unused const)
- RedisArgument import (no longer needed in types.ts)
- Dead re-exports from opentelemetry/index.ts

Const maps (CONNECTION_CLOSE_REASON, CSC_RESULT, etc.) are kept as
they're used by tests and may be useful for external consumers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: deduplicate diagnostics_channel loading

Export dc from tracing.ts and import it in metrics.ts instead of
loading it independently in both modules.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: rename BatchTraceContext to BatchOperationContext

Clarifies the distinction: BatchOperationContext is the batch
operation as a whole (MULTI/PIPELINE), BatchCommandTraceContext
is a single command within a batch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove unused CONNECTION_WAIT_START event

No subscriber existed for this event. The original #3110 code only
recorded at wait end, not start. Remove the speculative event.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: restore wasReady guard on connection count decrement

The original code only decremented db.client.connection.count when
the socket was previously ready. Our refactoring lost this gate in
destroySocket(), which could decrement below zero if destroy was
called before the socket became ready. Add wasReady to the
ConnectionClosedEvent so the subscriber can guard correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: export all channel names and event types for APM consumers

Export CHANNELS map, ChannelEvents type map, and all individual
event types so APM libraries can subscribe with proper typing:

  import { CHANNELS, type CommandTraceContext } from 'redis';
  dc.subscribe(CHANNELS.ERROR, (ctx: ClientErrorEvent) => { ... });

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: convert pool wait time to TracingChannel

Replace CONNECTION_WAIT_END point event with TRACE_CONNECTION_WAIT
TracingChannel. Pool wait is an async operation (start when no idle
client, end when one becomes available) — TracingChannel captures
the full lifecycle so APMs can create spans or measure duration.

Core code emits lifecycle events only. Timing is computed by
subscribers (OTel uses start/asyncEnd to measure duration).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: consistent serverPort typing across all event interfaces

Use number | undefined everywhere instead of mixing with number?.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: add getTracingChannel helper with auto-resolved context types

Eliminates repetitive ternary pattern. TracingChannelContextMap maps
channel names to their context types so getTracingChannel infers
the correct type without explicit generics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: stop exporting dc, use getTracingChannel and getChannel instead

metrics.ts no longer imports dc directly. All channel acquisition
goes through the exported helpers which handle caching and
undefined checks. dc remains private to tracing.ts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: unify trace functions into single generic trace()

Replace traceCommand, traceBatch, traceConnect, traceConnectionWait
with a single trace(CHANNELS.TRACE_*, fn, contextFactory) function.
Channel name resolves the context type via TracingChannelContextMap.

Also adds cache for getTracingChannel to avoid re-acquiring on
every call, matching getChannel's caching pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor: store unsubscriber closures instead of handler references

Replace separate #subscriptions and #tracingChannels arrays with a
single #unsubscribers list of cleanup closures. Simpler destroy(),
no need to store channel/handler pairs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: delete noop-metrics.ts

Inline the single remaining noop as an object literal in the
default #instance. No noop classes or files remain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: deduplicate COMMAND_REPLY events and merge subscriber

Remove duplicate COMMAND_REPLY publish from sendCommand — typed
commands already publish via _executeCommand. Also merge the two
separate COMMAND_REPLY subscribers (pubsub out + streaming) into
a single #subscribeCommandReply method.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: reject tracing promise on pool acquire timeout

When the acquire timeout fires, rejectWait() is called so the
TracingChannel error channel fires — APMs see the timeout as an
error instead of an orphaned span that never closes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: remove noop-meter.ts

When OTel is not initialized or disabled, skip instrument registration
and subscriber creation entirely. No noop instruments needed — if
there are no subscribers, channels are never fired.

Removes 139-line noop-meter.ts and all noop fallback branches from
instrument creation helpers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: sanitize COMMAND_REPLY args and catch pool trace promise

- COMMAND_REPLY in _executeCommand now passes sanitizeArgs(parser.redisArgs)
  instead of raw args, preventing sensitive values from leaking to subscribers
- Pool connection wait trace promise gets .catch(noop) to prevent
  unhandled rejection if a tracing subscriber throws

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: reject pending wait traces on pool destroy

When pool.destroy() is called with tasks still queued, call
rejectWait() on each pending task so the TracingChannel error
channel fires and APM subscribers close their spans cleanly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: move const noop after imports in pool.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: E2E test failures from refactor

- Tracing test: check isOpen before destroying already-closed client
- OTel E2E tests: rewrite 5 tests that called removed methods
  (resiliencyMetrics.recordClientErrors, streamMetrics.recordStreamLag)
  to publish via channels instead
- Connection closed metric: split subscriber for basic (connection count)
  and advanced (close reason) so tests enabling only one group see
  correct metrics

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: record connection wait time when client is immediately available

The original code always recorded wait time, even for 0ms waits
when a client was available. The TracingChannel refactor only traced
the "no client, must wait" path. Add trace for immediate availability
so the metric is always emitted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: clean up comments, remove em dashes, be concise

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix(perf): optimize channel acquisition

* refactor: deduplicate command error reporting, use TracingChannel error event

Remove redundant publish(CHANNELS.ERROR) from sendCommand — the
TracingChannel already emits error events for command failures.
Wire OTel resiliency subscriber to commandTC.error for command-level
error counts; CHANNELS.ERROR now only carries cluster/internal errors.

Extract subscribeTC() helper to reduce TracingChannel subscription
boilerplate and #recordError() to share error metric recording logic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: filter redirection errors at call site, not in #recordError

The previous approach unconditionally dropped all MOVED/ASK errors in
#recordError, including cluster-origin ones that indicate slot migration.
Move the filter to each call site so client-origin redirections are
skipped while cluster-origin ones are still recorded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use ctx.origin instead of ctx.internal for redirection deduplication

internal and origin are separate fields — an error can be non-internal
but still originate from the cluster. Use ctx.origin === 'client' as
Pavel suggested to correctly identify client-origin redirections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: guard against disabled metrics in OTelMetrics constructor

Short-circuit with noop stubs when enabled is false, so metrics are not
recorded even when meterProvider and enabledMetricGroups are provided.
Update the disabled-metrics test to supply those options, verifying the
guard works rather than passing vacuously.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: treat MULTI as atomic operation, remove per-command tracing

Redis replies QUEUED for each command inside MULTI before EXEC runs,
so per-command traces resolve as successful even when EXEC fails. Only
the batch-level trace reflects the real outcome. Remove per-command
tracing from _executeMulti and keep only the TRACE_BATCH span.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* refactor(otel): publish pool connection wait as a point event

* chore: raise minimum supported Node.js version to 18.19.0

* fix(client): guard OTel metrics reset and gauge callbacks

* docs: add diagnostics_channel docs

* docs: clarify diagnostics tracing channel lifecycle

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Pavel Pashov <pavel.pashov@redis.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants