Retry transient network errors in background token monitor#4281
Open
Retry transient network errors in background token monitor#4281
Conversation
When a VPN disconnects overnight the OAuth token refresh endpoint becomes temporarily unreachable. Previously the background monitor treated any refresh failure as fatal and immediately marked the workload as unauthenticated, requiring manual intervention. This change distinguishes transient network errors (DNS failures, TCP-level errors, timeouts) from real auth failures (OAuth 4xx, invalid_grant) and retries the former with exponential backoff (30 s initial, 5 min max, no deadline) until the network recovers. Context cancellation stops the loop cleanly without marking the workload as unauthenticated. Non-transient errors still fail immediately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
reyortiz3
commented
Mar 19, 2026
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4281 +/- ##
==========================================
+ Coverage 68.77% 68.95% +0.17%
==========================================
Files 473 473
Lines 47917 48001 +84
==========================================
+ Hits 32956 33098 +142
+ Misses 12300 12292 -8
+ Partials 2661 2611 -50 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jhrozek
previously approved these changes
Mar 20, 2026
amirejaz
requested changes
Mar 20, 2026
jhrozek
previously approved these changes
Mar 20, 2026
amirejaz
reviewed
Mar 20, 2026
|
|
||
| b := mts.getRetryBackOff() | ||
|
|
||
| tok, retryErr := backoff.Retry(mts.monitoringCtx, func() (*oauth2.Token, error) { |
Contributor
There was a problem hiding this comment.
This retry logic is fully synchronous and per-request, which has important implications during transient failures:
backoff.Retry(...)does not spawn a goroutine; it blocks the caller.- As written, every incoming request during a transient outage will enter its own retry loop, each with its own backoff instance.
Implications
- Multiple concurrent callers will repeatedly call
mts.tokenSource.Token()independently. - There is no coordination or deduplication of token refresh attempts.
- When the network recovers, this can lead to a burst of simultaneous retries (“thundering herd”) rather than a single shared recovery.
- All callers remain blocked until retry succeeds, context is canceled, or backoff stops.
Add WithMaxTries and WithMaxElapsedTime to the Token() retry loop so retries are capped at 5 attempts and 5 minutes rather than running until context cancellation or the implicit 15-minute Retry default. Also expose both limits via environment variables and tighten the default initial/max intervals (10s/2m instead of 30s/5m) to recover faster after short network blips. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Parse with strconv.IntSize instead of 64 so ParseUint rejects values that would overflow uint on 32-bit platforms, resolving the CodeQL high-severity finding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
isTransientNetworkErrorto distinguish DNS failures, TCP-level errors, and timeouts (transient) from OAuth 4xx /invalid_grantresponses (non-transient). The background monitor now retries transient failures with exponential backoff (30 s initial → 5 min max, no deadline) until the network recovers. Context cancellation exits the loop cleanly without marking the workload as unauthenticated. Non-transient errors still fail immediately as before.Type of change
Test plan
task test)Changes
pkg/auth/monitored_token_source.goisTransientNetworkError, exponential backoff retry loop inonTick, injectablenewRetryBackOfffor testspkg/auth/monitored_token_source_test.goisTransientNetworkErrorand the three retry scenarios: recovery, context cancellation, transient→non-transient escalationgo.mod/go.sumgithub.com/cenkalti/backoff/v5(direct dep), bump otel exporters and grpc-gateway (incidentalgo mod tidyupdates)Does this introduce a user-facing change?
No — the workload stays authenticated through VPN reconnects instead of going unauthenticated overnight. No CLI or config changes.
Special notes for reviewers
MaxElapsedTimeis intentionally left at zero (no deadline). The retry loop runs until the context is cancelled or the network recovers. The motivation: we don't know how long a VPN disconnect will last, and a hard cutoff would reproduce the original problem.newRetryBackOfffield is a test seam; it isnilin production and replaced with a fast backoff in tests to avoid multi-second sleeps.Generated with Claude Code