Skip to content

Retry transient network errors in background token monitor#4281

Open
reyortiz3 wants to merge 17 commits intomainfrom
retry-transient-token-refresh-errors
Open

Retry transient network errors in background token monitor#4281
reyortiz3 wants to merge 17 commits intomainfrom
retry-transient-token-refresh-errors

Conversation

@reyortiz3
Copy link
Contributor

@reyortiz3 reyortiz3 commented Mar 19, 2026

Summary

  • OAuth token refresh fails when WARP VPN disconnects overnight — the refresh endpoint becomes temporarily unreachable from the home ISP IP, causing the workload to be immediately marked as unauthenticated and requiring manual intervention.
  • Adds isTransientNetworkError to distinguish DNS failures, TCP-level errors, and timeouts (transient) from OAuth 4xx / invalid_grant responses (non-transient). The background monitor now retries transient failures with exponential backoff (30 s initial → 5 min max, no deadline) until the network recovers. Context cancellation exits the loop cleanly without marking the workload as unauthenticated. Non-transient errors still fail immediately as before.

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)

Changes

File Change
pkg/auth/monitored_token_source.go Add isTransientNetworkError, exponential backoff retry loop in onTick, injectable newRetryBackOff for tests
pkg/auth/monitored_token_source_test.go Tests for isTransientNetworkError and the three retry scenarios: recovery, context cancellation, transient→non-transient escalation
go.mod / go.sum Add github.com/cenkalti/backoff/v5 (direct dep), bump otel exporters and grpc-gateway (incidental go mod tidy updates)

Does this introduce a user-facing change?

No — the workload stays authenticated through VPN reconnects instead of going unauthenticated overnight. No CLI or config changes.

Special notes for reviewers

  • MaxElapsedTime is intentionally left at zero (no deadline). The retry loop runs until the context is cancelled or the network recovers. The motivation: we don't know how long a VPN disconnect will last, and a hard cutoff would reproduce the original problem.
  • The newRetryBackOff field is a test seam; it is nil in production and replaced with a fast backoff in tests to avoid multi-second sleeps.

Generated with Claude Code

When a VPN disconnects overnight the OAuth token refresh endpoint becomes
temporarily unreachable. Previously the background monitor treated any
refresh failure as fatal and immediately marked the workload as
unauthenticated, requiring manual intervention.

This change distinguishes transient network errors (DNS failures,
TCP-level errors, timeouts) from real auth failures (OAuth 4xx,
invalid_grant) and retries the former with exponential backoff (30 s
initial, 5 min max, no deadline) until the network recovers. Context
cancellation stops the loop cleanly without marking the workload as
unauthenticated. Non-transient errors still fail immediately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 19, 2026
@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 56.70103% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.95%. Comparing base (04ae197) to head (2b4364d).

Files with missing lines Patch % Lines
pkg/auth/monitored_token_source.go 56.70% 34 Missing and 8 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4281      +/-   ##
==========================================
+ Coverage   68.77%   68.95%   +0.17%     
==========================================
  Files         473      473              
  Lines       47917    48001      +84     
==========================================
+ Hits        32956    33098     +142     
+ Misses      12300    12292       -8     
+ Partials     2661     2611      -50     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 19, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 19, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
jhrozek
jhrozek previously approved these changes Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
jhrozek
jhrozek previously approved these changes Mar 20, 2026
@reyortiz3 reyortiz3 requested a review from amirejaz March 20, 2026 17:33

b := mts.getRetryBackOff()

tok, retryErr := backoff.Retry(mts.monitoringCtx, func() (*oauth2.Token, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry logic is fully synchronous and per-request, which has important implications during transient failures:

  • backoff.Retry(...) does not spawn a goroutine; it blocks the caller.
  • As written, every incoming request during a transient outage will enter its own retry loop, each with its own backoff instance.

Implications

  • Multiple concurrent callers will repeatedly call mts.tokenSource.Token() independently.
  • There is no coordination or deduplication of token refresh attempts.
  • When the network recovers, this can lead to a burst of simultaneous retries (“thundering herd”) rather than a single shared recovery.
  • All callers remain blocked until retry succeeds, context is canceled, or backoff stops.

Add WithMaxTries and WithMaxElapsedTime to the Token() retry loop so
retries are capped at 5 attempts and 5 minutes rather than running until
context cancellation or the implicit 15-minute Retry default.

Also expose both limits via environment variables and tighten the default
initial/max intervals (10s/2m instead of 30s/5m) to recover faster after
short network blips.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
Parse with strconv.IntSize instead of 64 so ParseUint rejects values
that would overflow uint on 32-bit platforms, resolving the CodeQL
high-severity finding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants