Skip to content

feat(storage): implement App-centric Observability (ACO) OpenTelemetry tracing#17149

Draft
chandra-siri wants to merge 18 commits into
googleapis:mainfrom
chandra-siri:feat/gcs-aco-tracing
Draft

feat(storage): implement App-centric Observability (ACO) OpenTelemetry tracing#17149
chandra-siri wants to merge 18 commits into
googleapis:mainfrom
chandra-siri:feat/gcs-aco-tracing

Conversation

@chandra-siri
Copy link
Copy Markdown
Contributor

@chandra-siri chandra-siri commented May 15, 2026

feat(storage): implement App-centric Observability (ACO) OpenTelemetry tracing

This PR implements App-centric Observability (ACO) tracing compatibility for the GCS Python SDK (google-cloud-storage). All OpenTelemetry trace spans produced by bucket and blob operations now seamlessly incorporate mandatory destination resource annotations (gcp.resource.destination.id and gcp.resource.destination.location).


Core Architecture & Design

1. Centralized, DRY Telemetry Helper (_helpers.py)

  • All OpenTelemetry span context generation, attribute injection, and exception trapping are centralized in a module-level context manager create_trace_span_helper in _helpers.py.
  • Zero modifications to the core tracing module: _opentelemetry_tracing.py remains completely pristine and identical to main.
  • Seamlessly wrapped all critical read/write operations across blob.py, bucket.py, and client.py (e.g., download_as_bytes, upload_from_string, get_bucket, lookup_bucket, etc.).

2. Bounded LRU Metadata Cache (_lru_cache.py, _bucket_metadata_cache.py)

  • LRU Capacity Bounding: Implemented LRUCache utilizing an OrderedDict to support O(1) operations and strict capacity bounding to eliminate memory leaks in long-running applications.
  • Concurrent Singleflight Warming: Implemented BucketMetadataCache to store bucket locations and project numbers. On cache misses, it spawns background threads (_fetch_background) using singleflight tracking (_inflight_fetches) to prevent server stampedes / thundering herds.
  • Fallback Annotations on 403: On GCS 403 Forbidden permissions errors, the cache permanently registers fallback annotations (projects/_/buckets/{name}) to completely avoid retry storms on subsequent API calls.

3. Resilient 404 Existence Eviction (_http.py, _helpers.py, bucket.py)

  • Smart Out-of-band 404 Verification: When a 404 NotFound error occurs during media transfers or REST calls, a background thread is spawned (with concurrency protection via _inflight_checks) to check if the bucket was deleted out-of-band (bucket.exists()). If exists() returns False, the bucket is cleanly evicted from the cache.
  • Instant Synchronous Eviction: Direct Bucket.delete() calls synchronously and instantly evict the bucket name from the cache, ensuring real-time consistency.

Extensive Testing Suite

1. 100% Sleep-Free System Tests (test_aco_observability.py)

Added a comprehensive system test suite test_aco_observability.py executing against a live GCS backend:

  • Sequential Priming: Verifies cache miss return times, background priming, and subsequent span enrichment.
  • 403 Fallback: Verifies minimal fallback registration on Forbidden responses.
  • Cache Stampede Protection: Simulates 15 concurrent threads on a cache miss and asserts only 1 GCS call is fired.
  • Smart 404 Eviction: Deletes a bucket out-of-band and verifies async cache clean-up on 404.
  • Synchronous Delete Eviction: Asserts immediate cache eviction on SDK deletion.
  • LRU Capacity Bounding: Populates the cache beyond its limits and verifies proper LRU eviction.
  • Deterministic Synchronization: Uses threading.Event (zero static sleeps) for thread coordination, guaranteeing thundering-fast execution and completely eliminating timing flakiness.

2. Robust Unit Tests

  • Added test__lru_cache.py (LRU correctness, bounding, eviction).
  • Added test__bucket_metadata_cache.py (concurrency, location resolution, 403 fallback, singleflight).
  • Added test_delete_hit_evicts_from_cache inside test_bucket.py.

Validation Results

All checks, unit tests, and live GCS system tests pass flawlessly:

  • Unit Tests: 835 passed in 17.82s
  • System Tests: 8 passed in 26.94s
  • Format & Linter: 100% clean (black / flake8)

@chandra-siri chandra-siri requested a review from a team as a code owner May 15, 2026 09:02
@chandra-siri chandra-siri force-pushed the feat/gcs-aco-tracing branch from 56e225f to cab26e6 Compare May 15, 2026 09:05
@chandra-siri chandra-siri marked this pull request as draft May 15, 2026 09:11
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an in-memory LRU cache for GCS bucket metadata to support App-centric Observability (ACO) within OpenTelemetry tracing. Key changes include the implementation of a thread-safe BucketMetadataCache with background fetching, updates to the Client class to support context management and cache lifecycle, and the re-enabling of system tests in Cloud Build. Feedback identifies a potential StopIteration error in the cache logic if max_size is zero and suggests pre-compiling regular expressions in the tracing module to improve performance.

Comment thread packages/google-cloud-storage/google/cloud/storage/_bucket_metadata_cache.py Outdated
Comment thread packages/google-cloud-storage/google/cloud/storage/_opentelemetry_tracing.py Outdated
@chandra-siri chandra-siri force-pushed the feat/gcs-aco-tracing branch 3 times, most recently from 5544ee4 to d037656 Compare May 15, 2026 14:08
…y tracing and lockless bucket metadata caching
@chandra-siri chandra-siri force-pushed the feat/gcs-aco-tracing branch from d037656 to d0f4980 Compare May 15, 2026 14:21
…n behind OpenTelemetry enablement

Ensure that BucketMetadataCache lookups and asynchronous cache eviction threads only execute when OpenTelemetry tracing is installed and enabled. Also safely access _extra_headers on Client objects.

TAG=agy
CONV=c671fa00-7189-45b9-a5af-12f4c7a7c486
TAG=agy
CONV=c671fa00-7189-45b9-a5af-12f4c7a7c486
…oiding background metadata fetches on reload and exists
…ation, split eviction tests, fail on OTel missing
…, add dedicated synchronous cache warming tests
…ket.reload, add unit tests and pickle check script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant