add OpenTelemetry metrics instrumentation#3110
Conversation
a11117e to
5fb1791
Compare
1be4565 to
03db1a2
Compare
There was a problem hiding this comment.
❌ The following Jit checks failed to run:
- license-compliance-checker
- secret-detection-trufflehog
- software-component-analysis-js
- static-code-analysis-semgrep-pro
#jit_bypass_commit in this PR to bypass, Jit Admin privileges required.
More info in the Jit platform.
1106f8a to
242e98d
Compare
|
This is an exciting feature, I hope it makes it to main! |
❌ Security scan failedSecurity scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository 💡 Need to bypass this check? Comment |
364ff86 to
30b0e8c
Compare
❌ Security scan failedSecurity scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository 💡 Need to bypass this check? Comment |
1 similar comment
❌ Security scan failedSecurity scan failed: Branch feat/add-opentelemetry-metrics does not exist in the remote repository 💡 Need to bypass this check? Comment |
0e91b34 to
e2cc888
Compare
packages/client/lib/RESP/types.ts
Outdated
| TRANSFORM_LEGACY_REPLY?: boolean; | ||
| transformReply: TransformReply | Record<RespVersions, TransformReply>; | ||
| unstableResp3?: boolean; | ||
| onSuccess?: (args: ReadonlyArray<RedisArgument>, reply: unknown, clientId: string) => void; |
There was a problem hiding this comment.
@nkaradzhov we might want to rename onSuccess to something more metrics/otel related
There was a problem hiding this comment.
lets see if we can remove it altogether and wether its worth it
packages/client/lib/client/index.ts
Outdated
| if (command.onSuccess) { | ||
| command.onSuccess(parser.redisArgs, finalReply, this._self._clientId); |
There was a problem hiding this comment.
@nkaradzhov there might be a better way to record specific command related metrics that require the server response
| // Build the appropriate function based on options | ||
| if (options.hasIncludeCommands || options.hasExcludeCommands) { | ||
| // Version with filtering | ||
| this.createRecordOperationDuration = this.#createWithFiltering.bind(this); | ||
| } else { | ||
| this.createRecordOperationDuration = | ||
| this.#createWithoutFiltering.bind(this); | ||
| } | ||
| } |
There was a problem hiding this comment.
@nkaradzhov to make sure we don't do unnecessary checks after initiating the metrics singleton if the user hasn't provided any included or excluded commands
… in command and CSC metrics
bf97457 to
adbe066
Compare
adbe066 to
bc14732
Compare
|
Created a separated PR for conversion of ObservableGauges to up UpDownCounters. After the current PR is released I'll need to create 3 issues for:
|
No subscriber existed for this event. The original redis#3110 code only recorded at wait end, not start. Remove the speculative event. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
packages/client/lib/RESP/types.ts
Outdated
| TRANSFORM_LEGACY_REPLY?: boolean; | ||
| transformReply: TransformReply | Record<RespVersions, TransformReply>; | ||
| unstableResp3?: boolean; | ||
| onSuccess?: (args: ReadonlyArray<RedisArgument>, reply: unknown, clientId: string) => void; |
There was a problem hiding this comment.
lets see if we can remove it altogether and wether its worth it
No subscriber existed for this event. The original redis#3110 code only recorded at wait end, not start. Remove the speculative event. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add TracingChannel support with argument sanitization
Add diagnostics_channel TracingChannel instrumentation for commands
and connections. Wraps sendCommand, connect, pipeline, and MULTI
with traceCommand/traceConnect alongside existing OTel metrics.
- Sanitize args before emission using OTel redis-common rules
- Include clientId from identity system in trace context
- Add .catch(noop) for pipeline/MULTI to prevent unhandled rejections
- Gate on hasSubscribers for zero-cost when no APM subscribes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: add comprehensive unit tests for sanitizeArgs
Cover all serialization subsets (args=0, 1, 2, -1, default),
edge cases (empty args, AUTH, HELLO, unknown commands, prefix
matching like SETEX), case insensitivity, and non-string arg
stringification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: use ? placeholder for redacted args
Matches the SQL parameterization convention used by db.query.text
across APM tools (Sentry, Datadog, OTel).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: move OTel command metrics to TracingChannel subscribers
Replace inline createRecordOperationDuration/createRecordBatchOperationDuration
closures in sendCommand, _executePipeline, and _executeMulti with TracingChannel
subscriptions in OTelCommandMetrics.
- OTelCommandMetrics subscribes to node-redis:command and node-redis:batch channels
- Command filtering (include/exclude) handled in the start subscriber
- Batch metrics use new node-redis:batch channel wrapping pipeline/multi
- Zero inline OTel metric closures remain in core client code
- Noop classes no longer need closure methods
- Tests rewritten to verify filtering through TC events
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: remove NoopCommandMetrics class
No longer needed — when command metric group is disabled, no TC
subscriptions are created, so hasSubscribers is false and
traceCommand skips TracingChannel entirely. Zero cost without noops.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: move all inline OTel metrics to diagnostics_channel events
Replace every OTelMetrics.instance.* call in core code with
publish(CHANNELS.*, factory) point events. All observability data
now flows through diagnostics_channel:
- TracingChannel for async lifecycles (command, batch, connect)
- Plain dc.channel for point events (connection ready/closed,
errors, maintenance, pubsub, cache, command replies)
Core code has zero OTelMetrics imports. The publish() function uses
factory callbacks + hasSubscribers checks for zero-cost when no
APM subscribes. Channel names are centralized in CHANNELS const map
with a type-safe ChannelEvents interface.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: consolidate all OTel metric classes into channel subscribers
Replace 6 individual metric classes (OTelConnectionBasicMetrics,
OTelConnectionAdvancedMetrics, OTelResiliencyMetrics,
OTelClientSideCacheMetrics, OTelPubSubMetrics, OTelStreamMetrics)
with a single OTelChannelSubscribers class that subscribes to
diagnostics_channel events.
- All metric recording now happens via channel subscriptions
- Noop classes eliminated — when a metric group is disabled, no
subscription is created, zero overhead
- noop-metrics.ts reduced to a single NoopOTelMetrics shell
- IOTelMetrics interface simplified to just { commandMetrics }
- recordCommandReplyMetrics removed — pubsub out and stream lag
metrics now handled by channel subscribers on node-redis:command:reply
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: remove no-op channel publish tests
These tests only verified that dc.channel().publish() doesn't crash,
which is a Node.js guarantee, not our responsibility.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: use CHANNELS map consistently and tracingChannel API for subscriptions
- Add TRACE_COMMAND, TRACE_BATCH, TRACE_CONNECT to CHANNELS map
- Replace raw dc.subscribe() on sub-channel strings with
dc.tracingChannel().subscribe() for proper TracingChannel usage
- Replace all remaining string literals in metrics.ts with CHANNELS.*
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: remove dead types and interfaces from opentelemetry module
Remove types that are no longer implemented by any class after the
channel subscriber refactor:
- RecordClientErrorContext, CommandReplyMetricHandler, ClientErrorOrigin,
MetricErrorType, ConnectionCloseReason, CscResult, CscEvictionReason,
ErrorCategory (type aliases)
- METRIC_INSTRUMENT_TYPE (unused const)
- RedisArgument import (no longer needed in types.ts)
- Dead re-exports from opentelemetry/index.ts
Const maps (CONNECTION_CLOSE_REASON, CSC_RESULT, etc.) are kept as
they're used by tests and may be useful for external consumers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: deduplicate diagnostics_channel loading
Export dc from tracing.ts and import it in metrics.ts instead of
loading it independently in both modules.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: rename BatchTraceContext to BatchOperationContext
Clarifies the distinction: BatchOperationContext is the batch
operation as a whole (MULTI/PIPELINE), BatchCommandTraceContext
is a single command within a batch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: remove unused CONNECTION_WAIT_START event
No subscriber existed for this event. The original #3110 code only
recorded at wait end, not start. Remove the speculative event.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: restore wasReady guard on connection count decrement
The original code only decremented db.client.connection.count when
the socket was previously ready. Our refactoring lost this gate in
destroySocket(), which could decrement below zero if destroy was
called before the socket became ready. Add wasReady to the
ConnectionClosedEvent so the subscriber can guard correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: export all channel names and event types for APM consumers
Export CHANNELS map, ChannelEvents type map, and all individual
event types so APM libraries can subscribe with proper typing:
import { CHANNELS, type CommandTraceContext } from 'redis';
dc.subscribe(CHANNELS.ERROR, (ctx: ClientErrorEvent) => { ... });
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: convert pool wait time to TracingChannel
Replace CONNECTION_WAIT_END point event with TRACE_CONNECTION_WAIT
TracingChannel. Pool wait is an async operation (start when no idle
client, end when one becomes available) — TracingChannel captures
the full lifecycle so APMs can create spans or measure duration.
Core code emits lifecycle events only. Timing is computed by
subscribers (OTel uses start/asyncEnd to measure duration).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: consistent serverPort typing across all event interfaces
Use number | undefined everywhere instead of mixing with number?.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: add getTracingChannel helper with auto-resolved context types
Eliminates repetitive ternary pattern. TracingChannelContextMap maps
channel names to their context types so getTracingChannel infers
the correct type without explicit generics.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: stop exporting dc, use getTracingChannel and getChannel instead
metrics.ts no longer imports dc directly. All channel acquisition
goes through the exported helpers which handle caching and
undefined checks. dc remains private to tracing.ts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: unify trace functions into single generic trace()
Replace traceCommand, traceBatch, traceConnect, traceConnectionWait
with a single trace(CHANNELS.TRACE_*, fn, contextFactory) function.
Channel name resolves the context type via TracingChannelContextMap.
Also adds cache for getTracingChannel to avoid re-acquiring on
every call, matching getChannel's caching pattern.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: store unsubscriber closures instead of handler references
Replace separate #subscriptions and #tracingChannels arrays with a
single #unsubscribers list of cleanup closures. Simpler destroy(),
no need to store channel/handler pairs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: delete noop-metrics.ts
Inline the single remaining noop as an object literal in the
default #instance. No noop classes or files remain.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: deduplicate COMMAND_REPLY events and merge subscriber
Remove duplicate COMMAND_REPLY publish from sendCommand — typed
commands already publish via _executeCommand. Also merge the two
separate COMMAND_REPLY subscribers (pubsub out + streaming) into
a single #subscribeCommandReply method.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: reject tracing promise on pool acquire timeout
When the acquire timeout fires, rejectWait() is called so the
TracingChannel error channel fires — APMs see the timeout as an
error instead of an orphaned span that never closes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: remove noop-meter.ts
When OTel is not initialized or disabled, skip instrument registration
and subscriber creation entirely. No noop instruments needed — if
there are no subscribers, channels are never fired.
Removes 139-line noop-meter.ts and all noop fallback branches from
instrument creation helpers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: sanitize COMMAND_REPLY args and catch pool trace promise
- COMMAND_REPLY in _executeCommand now passes sanitizeArgs(parser.redisArgs)
instead of raw args, preventing sensitive values from leaking to subscribers
- Pool connection wait trace promise gets .catch(noop) to prevent
unhandled rejection if a tracing subscriber throws
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: reject pending wait traces on pool destroy
When pool.destroy() is called with tasks still queued, call
rejectWait() on each pending task so the TracingChannel error
channel fires and APM subscribers close their spans cleanly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: move const noop after imports in pool.ts
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: E2E test failures from refactor
- Tracing test: check isOpen before destroying already-closed client
- OTel E2E tests: rewrite 5 tests that called removed methods
(resiliencyMetrics.recordClientErrors, streamMetrics.recordStreamLag)
to publish via channels instead
- Connection closed metric: split subscriber for basic (connection count)
and advanced (close reason) so tests enabling only one group see
correct metrics
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: record connection wait time when client is immediately available
The original code always recorded wait time, even for 0ms waits
when a client was available. The TracingChannel refactor only traced
the "no client, must wait" path. Add trace for immediate availability
so the metric is always emitted.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* style: clean up comments, remove em dashes, be concise
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(perf): optimize channel acquisition
* refactor: deduplicate command error reporting, use TracingChannel error event
Remove redundant publish(CHANNELS.ERROR) from sendCommand — the
TracingChannel already emits error events for command failures.
Wire OTel resiliency subscriber to commandTC.error for command-level
error counts; CHANNELS.ERROR now only carries cluster/internal errors.
Extract subscribeTC() helper to reduce TracingChannel subscription
boilerplate and #recordError() to share error metric recording logic.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: filter redirection errors at call site, not in #recordError
The previous approach unconditionally dropped all MOVED/ASK errors in
#recordError, including cluster-origin ones that indicate slot migration.
Move the filter to each call site so client-origin redirections are
skipped while cluster-origin ones are still recorded.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: use ctx.origin instead of ctx.internal for redirection deduplication
internal and origin are separate fields — an error can be non-internal
but still originate from the cluster. Use ctx.origin === 'client' as
Pavel suggested to correctly identify client-origin redirections.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: guard against disabled metrics in OTelMetrics constructor
Short-circuit with noop stubs when enabled is false, so metrics are not
recorded even when meterProvider and enabledMetricGroups are provided.
Update the disabled-metrics test to supply those options, verifying the
guard works rather than passing vacuously.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: treat MULTI as atomic operation, remove per-command tracing
Redis replies QUEUED for each command inside MULTI before EXEC runs,
so per-command traces resolve as successful even when EXEC fails. Only
the batch-level trace reflects the real outcome. Remove per-command
tracing from _executeMulti and keep only the TRACE_BATCH span.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor(otel): publish pool connection wait as a point event
* chore: raise minimum supported Node.js version to 18.19.0
* fix(client): guard OTel metrics reset and gauge callbacks
* docs: add diagnostics_channel docs
* docs: clarify diagnostics tracing channel lifecycle
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Pavel Pashov <pavel.pashov@redis.com>

Description
TODO
Checklist
npm testpass with this change (including linting)?Note
Medium Risk
Touches core request/connection paths (
RedisClient,RedisSocket, queues, cluster redirection) to emit metrics and maintain per-client identity/registry, so mistakes could add overhead or skew counts. Functional behavior should be unchanged, but instrumentation around errors/closures and cache invalidation needs careful review.Overview
Adds first-class OpenTelemetry metrics support via a new
OpenTelemetry.init()entrypoint and an internal metrics/registry layer that can be enabled before client creation.Instruments core client operations to emit metrics for command durations (including
MULTI/PIPELINEbatches), connection lifecycle (create time, active count, close reasons, relaxed timeouts, handoff), resiliency/errors (including cluster redirections and maintenance notifications), pub/sub message in/out, stream lag fromXREAD/XREADGROUP, and client-side caching hits/misses/evictions.Introduces stable per-client identity (
_clientId) and aClientRegistryso observable gauges and metric attribution can track standalone clients, pools, and cluster nodes; updates pool/cluster/node creation to set roles/parent IDs and unregisters clients onclose/destroy/quit. Documentation and anotel-metrics.jsexample are added, along with new OpenTelemetry dev dependencies and extensive tests for the metrics implementation.Written by Cursor Bugbot for commit 6b04331. This will update automatically on new commits. Configure here.