Skip to content

fix: allow app to start when read replica is unreachable#3136

Open
danoswaltCL wants to merge 4 commits into
release/6.4from
bugfix/read-replica-config
Open

fix: allow app to start when read replica is unreachable#3136
danoswaltCL wants to merge 4 commits into
release/6.4from
bugfix/read-replica-config

Conversation

@danoswaltCL
Copy link
Copy Markdown
Collaborator

when read-replica env var is not empty and has a bad value and can't connect, the app doesn't start. this has caused issues in staging with name changes; this should be a rare occurrence but it could still definitely happen in prod some day, so we can allow the app to start and log in background. any calls that would normally use read-replica would fail (instead of re-routing to main db, which may be undesirable).

- Replace Promise.all with Promise.allSettled in typeormLoader so a
  replica connection failure no longer prevents the app from booting.
  Main DB failure still throws (critical). Replica failure is logged
  and the app continues; affected analytics/export requests will error
  at query time rather than at startup.
- Resolve two TODO comments by wiring up UpgradeLogger (winstonLoader
  runs before typeormLoader so Winston is always ready).
- Add 22 unit tests covering all key scenarios: both succeed, replica
  only fails, main only fails (DB_UNREACHABLE), migration fails
  (MIGRATION_ERROR), both fail, and edge cases (undefined settings,
  synchronize mode, ECS mode).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the backend TypeORM bootstrap logic so a failure to connect to the read-replica/export DataSource no longer prevents the application from starting, and adds unit tests to validate the new startup behavior.

Changes:

  • Make replica/export DataSource initialization non-fatal by logging failures instead of aborting boot.
  • Add structured error logging in typeormLoader for DB startup failures.
  • Add unit tests covering main DB failures, replica failures, and migration failure classification.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
packages/backend/src/loaders/typeormLoader.ts Switches replica init handling to be optional (log on failure) and adds loader-level logging/error classification.
packages/backend/test/unit/loaders/typeormLoader.test.ts Adds focused unit tests for startup behavior across main DB vs replica failures and migration errors.
Comments suppressed due to low confidence (1)

packages/backend/src/loaders/typeormLoader.ts:115

  • The catch block logs { message: 'Database connection failed' } for all failures, including migration errors (code === '42P07') and any other runtime errors thrown in the loader. This makes the log message misleading and harder to triage. Consider logging a message that matches the classified error (e.g., "Database migration failed" for 42P07) or a more general "Database initialization failed".
    const error = err as any;
    log.error({ message: 'Database connection failed', error });
    if (error.code === 'ECONNREFUSED') {
      error.type = SERVER_ERROR.DB_UNREACHABLE;
      throw error;
    } else if (error.code === '42P07') {
      error.type = SERVER_ERROR.MIGRATION_ERROR;
      throw error;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/backend/src/loaders/typeormLoader.ts
Comment on lines +77 to +93
const [mainResult, replicaResult] = await Promise.allSettled([
appDataSourceInstance.initialize(),
exportDataSourceInstance.initialize(),
]);

// Main DB is required — rethrow so the outer catch can classify and re-throw.
if (mainResult.status === 'rejected') {
throw mainResult.reason;
}

// Read replica is optional — log the failure but let the app start.
// Requests that attempt to use the replica connection will receive an error
// at query time instead of preventing the whole app from booting.
if (replicaResult.status === 'rejected') {
const replicaErr = replicaResult.reason as any;
log.error({ message: 'Read replica connection failed — continuing without replica', error: replicaErr });
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcb37 i think this suggestion makes sense, but I'm not sure how to actually test it, or if it is worth fussing over to try to test. worst case scenario is a slight delay in startup. what do you think.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing this locally it seems to wait 75 seconds before initialize() gives up, but research turned up that that's likely an OS level default, which could be very different in deployed environments. That said, it's probably unlikely that the discrepancy between the main and replica connections being established will be very long, given that they are being created in parallel. So I think it makes sense to follow the suggestion and not await the replica connection. Avoiding potentially minutes of downtime is probably worth the risk of a possible slight lag in the availability of the read replica whenever the connections are initially established.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more note: we can configure the db connection timeout in the typeorm DataSource options, but I'd want to determine what a reasonable value might be somehow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants