-
Notifications
You must be signed in to change notification settings - Fork 104
Description
Environment details
- OS type and version: macOS / Linux
- Python version: 3.13
google-cloud-spannerversion: 3.63.0 (currentmain)
Description
Every Spanner operation that goes through trace_call() produces orphan OpenTelemetry metric data points with incomplete resource labels (missing project_id and instance_id). These orphan data points persist for the process lifetime due to cumulative aggregation and are re-exported to Cloud Monitoring every 60 seconds, which rejects them with:
INVALID_ARGUMENT: One or more TimeSeries could not be written:
timeSeries[...]: the set of resource labels is incomplete, missing (instance_id)
Root cause
trace_call() in _opentelemetry_tracing.py wraps every operation with a bare MetricsCapture() (no resource_info). Meanwhile, every caller of trace_call already provides its own MetricsCapture(self._resource_info) with correct labels.
When Python evaluates with trace_call(...) as span, MetricsCapture(self._resource_info):, two separate MetricsTracer instances are created:
- tracer_A (from
trace_call's internalMetricsCapture()): hasinstance_config,location,client_hash,client_uid,client_namefrom the factory, but never receivesproject_idorinstance_id - tracer_B (from the caller's
MetricsCapture(resource_info)): has correct labels, overwrites tracer_A in the context var
On exit, tracer_B records correct metrics first, then tracer_A records metrics with incomplete labels. Since the SpannerMetricsTracerFactory never has project_id/instance_id in its _client_attributes (only set per-tracer via resource_info or MetricsInterceptor), tracer_A always starts without them and is never populated because the MetricsInterceptor only touches the current context-var tracer (tracer_B).
With OpenTelemetry's cumulative aggregation, once these orphan aggregation buckets are created, they persist for the process lifetime and are re-exported every 60 seconds.
History
- PR feat: Add Attempt, Operation and GFE Metrics #1302 introduced the metrics system. All
MetricsCapture()instances were bare, including the one intrace_call. The design relied onMetricsInterceptorto populate labels during gRPC calls. - PR feat: implement native asyncio support via Cross-Sync #1509 added the
_resource_infoproperty and changed all caller sites fromMetricsCapture()toMetricsCapture(self._resource_info)for eager label propagation. However, the bareMetricsCapture()insidetrace_callwas not removed, making it redundant and harmful.
Impact
- Affects every Spanner operation (~27 code paths) on every invocation
- Creates persistent orphan metric aggregation buckets
- Produces repeated
INVALID_ARGUMENTerror logs every 60 seconds - Wastes CPU/network on exporting invalid TimeSeries
- Application functionality is unaffected; valid metrics from the caller's
MetricsCapturestill work
Steps to reproduce
- Create a
spanner.Client()with metrics enabled (default) - Perform any Spanner operation (e.g.,
session.create(),snapshot.execute_sql()) - Observe
INVALID_ARGUMENTerrors logged from the metrics exporter every 60 seconds
Suggested fix
Remove the bare MetricsCapture() from trace_call — it is redundant since every caller already provides its own. See PR #1522.