Skip to content

FEAT: token-expiry capture and identity-aware pool key#660

Open
jahnvi480 wants to merge 11 commits into
mainfrom
jahnvi/identity-aware-pooling
Open

FEAT: token-expiry capture and identity-aware pool key#660
jahnvi480 wants to merge 11 commits into
mainfrom
jahnvi/identity-aware-pooling

Conversation

@jahnvi480

@jahnvi480 jahnvi480 commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Work Item / Issue Reference

AB#46159
AB#46160

GitHub Issue: #651
GitHub Issue: #659


Summary

Connection pooling currently keys only on the sanitized connection string. When two callers target the same server with different Entra (Azure AD) identities, they collide onto the same pool and can be handed a connection authenticated as someone else. This PR makes pooling identity-aware so connections are isolated per identity, and adds lazy (deferred) token acquisition so a token is minted only when a physical connection is actually opened — never on a pool hit.

It also handles the token lifecycle inside the pool: near-expiry connections are refreshed on checkout, and pools belonging to identities that are no longer active are reclaimed.

The problem

  • Identity mix-up (security): pool key = connection string ⇒ different Entra users on the same server share one pool. A pooled connection opened by user A could be handed to user B.
  • Wasted token acquisition (perf): a token was acquired on every connect, even when the connect was going to be satisfied by an existing pooled connection.
  • Stale tokens in the pool: a long-lived pooled connection could be reused after its access token had effectively expired.

The approach

1. Identity-aware pool key. The pool key becomes connStr for non-token auth (unchanged), or connStr + "\x00" + <security-context> when a token is involved. compute_identity_key() (auth.py) derives the discriminator per auth type:

Auth type Pool key discriminator Notes
SQL / Trusted / Service Principal / Windows Interactive none → bare connStr Already isolated by the connection string (SP is handled natively by msodbcsql; UID/PWD stay in the connection string). Fully backward compatible.
Managed Identity (MSI) msi:<client_id> / msi:system Derived from connection params without acquiring a token, so a pool hit skips token acquisition entirely.
Interactive / Device-code acct:<home_account_id> Keyed on the stable signed-in account, so a silent refresh reuses the pool while a different account gets its own. Falls back to the token hash if the account id is unavailable.
DefaultAzureCredential / raw token tok:<sha256(token_struct)> Fail-safe hash when identity isn't derivable from params.

Design invariant: when a token is present, the pool key is never the bare connection string. The fail-safe is always the token hash.

2. Lazy token acquisition (#659). connection.py builds a _token_factory closure and passes it to the native layer. The factory is invoked only when a physical connection is opened (pool miss / non-pooled). Same-identity pool hits acquire no token.

3. Silent-first interactive auth. For interactive / device-code, _acquire_token attempts a silent get_token() from the cached account first and only falls back to an interactive authenticate() prompt when the SDK raises AuthenticationRequiredError. The resolved home_account_id is cached so subsequent acquisitions stay silent — pooled reconnects refresh without re-prompting.

4. Expiry-aware checkout. Token expiry is captured (TokenInfo, setTokenExpiry) and checked on checkout. When a pooled connection is within the threshold (<= 5 min), the factory mints a fresh token and compares it to the one the connection holds: same token ⇒ refresh recorded expiry and reuse; rotated token ⇒ discard and reopen with the freshly minted token (no second factory call). Sibling connections holding the now-stale token are drained in the same pass.

5. Lazy eviction of idle identity pools. Single-use identities would otherwise leave their pools alive forever. canEvict() now evaluates the idle timeout, and acquireConnection sweeps aged-out pools. The sweep is throttled to once per idle-timeout window to keep the hot connect path cheap under many-identity workloads, and evicted pools are closed outside the manager mutex to avoid a mutex/GIL deadlock.

Backward compatibility

  • Non-token auth (SQL, trusted, Service Principal, Windows Interactive) keeps the bare connection-string key — no behavior change.
  • The native token factory accepts both the new (attrs, expires_on) tuple contract and the legacy bare-dict contract.
  • This is internal pooling behavior; no public API changes.

Documented limitations (README)

  • Interactive / device-code in multi-user processes share one signed-in account per process (the first to authenticate). Multi-user apps must supply a per-user token provider.
  • Bring-your-own token (token_provider / raw token): token lifetime is the application's responsibility; the driver's expiry-aware checkout only applies to tokens it mints itself.
  • DefaultAzureCredential is not recommended for multi-user pooling (it resolves to a single ambient identity); a one-time warning is emitted.

Testing

  • tests/test_008_auth.py (unit, mocked azure-identity): identity-key computation for every auth type, silent-first regression/branch coverage (silent success never prompts; AuthenticationRequiredError triggers authenticate() exactly once; subsequent acquisitions stay silent), token-expiry capture, lazy token-factory edge cases, cross-identity isolation.
  • tests/test_009_pooling.py (integration, live DB — skips without one): native lazy token factory, and a regression test proving an idle identity pool is evicted by a later acquire on a different key.

Out of scope / follow-ups

  • MSI client-id vs. object-id alignment (msodbcsql uses object id; most drivers use client id) is a tracked breaking-change discussion, not addressed here.

)

Foundation for identity-aware connection pooling. No behavior change; all additions are dormant until the connection/native layers consume them.

- Capture token expiry: _acquire_token returns (struct, raw, expires_on); add TokenInfo, AADAuth.get_token_info, module-level get_auth_token_info. Public get_token/get_raw_token/get_token_struct signatures unchanged.

- compute_identity_key(): per-identity pool-key discriminator. None for non-token auth (bare-connStr key, backward compatible); msi:<client_id>/msi:system derived without a token (enables #659 skip-on-hit); tok:<sha256> fail-safe hash otherwise.

- is_token_near_expiry(): 5-min single-threshold check; None expiry treated as near (fail safe).

- Tests: TestTokenExpiryCapture, TestComputeIdentityKey, TestIsTokenNearExpiry (110 auth tests pass).
@github-actions github-actions Bot added the pr-size: medium Moderate update size label Jul 3, 2026
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

📊 Code Coverage Report

🔥 Diff Coverage

75%


🎯 Overall Coverage

80%


📈 Total Lines Covered: 6949 out of 8582
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/auth.py (100%)
  • mssql_python/connection.py (100%)
  • mssql_python/pybind/connection/connection.cpp (97.7%): Missing lines 574
  • mssql_python/pybind/connection/connection_pool.cpp (52.1%): Missing lines 24-32,102-103,110,127-128,143-156,267-268,272-275,277-279,287-296,349-358,370-375,407-408,438-441

Summary

  • Total: 284 lines
  • Missing: 70 lines
  • Coverage: 75%

mssql_python/pybind/connection/connection.cpp

Lines 570-578

  570             _conn->connect(connect_attrs);
  571             _conn->setTokenExpiry(expiry);
  572         } else {
  573             _conn->connect(attrsBefore);
! 574         }
  575     }
  576 }
  577 
  578 ConnectionHandle::~ConnectionHandle() {

mssql_python/pybind/connection/connection_pool.cpp

Lines 20-36

  20 // hold the GIL. Keys in the dict are attribute ids (ints). SQL_COPT_SS_ACCESS_TOKEN
  21 // is defined once in connection.h (shared with connection.cpp).
  22 static std::string extractAccessToken(const py::dict& attrs) {
  23     for (auto item : attrs) {
! 24         if (py::isinstance<py::int_>(item.first) &&
! 25             item.first.cast<long>() == SQL_COPT_SS_ACCESS_TOKEN) {
! 26             try {
! 27                 return item.second.cast<std::string>();
! 28             } catch (const py::cast_error&) {
! 29                 return std::string();
! 30             }
! 31         }
! 32     }
  33     return std::string();
  34 }
  35 
  36 ConnectionPool::ConnectionPool(size_t max_size, int idle_timeout_secs)

Lines 98-107

   98                 // Pool is full — throw immediately. Another thread may be
   99                 // validating a popped candidate outside the mutex right now, so
  100                 // a transient "pool full" is an acceptable trade-off that
  101                 // callers can retry.
! 102                 throw std::runtime_error(
! 103                     "ConnectionPool::acquire: pool size limit reached");
  104             }
  105             candidate = _pool.front();
  106             _pool.pop_front();
  107         }

Lines 106-114

  106             _pool.pop_front();
  107         }
  108 
  109         // Validate the candidate outside the mutex.
! 110         bool reuse_candidate = false;
  111         try {
  112             if (token_factory && !token_factory.is_none() &&
  113                 candidate->isTokenNearExpiry(TOKEN_EXPIRY_THRESHOLD_SECS)) {
  114                 // Expiry-aware checkout with token compare: the pooled token is

Lines 123-132

  123                 py::dict fresh_attrs =
  124                     Connection::invokeTokenFactory(token_factory, fresh_expiry);
  125                 std::string fresh_token = extractAccessToken(fresh_attrs);
  126                 if (!fresh_token.empty() && fresh_token == candidate->currentAccessToken()) {
! 127                     candidate->setTokenExpiry(fresh_expiry);
! 128                     reuse_candidate = candidate->isAlive() && candidate->reset();
  129                 } else {
  130                     // Token rotated — remember the fresh token to reopen with,
  131                     // and eagerly drain the sibling idle connections that still
  132                     // hold the now-stale token. They were all minted from the

Lines 139-160

  139                     pending_expiry = fresh_expiry;
  140                     have_pending_token = true;
  141                     const std::string stale_token = candidate->currentAccessToken();
  142                     if (!stale_token.empty()) {
! 143                         std::lock_guard<std::mutex> lock(_mutex);
! 144                         _pool.erase(
! 145                             std::remove_if(
! 146                                 _pool.begin(), _pool.end(),
! 147                                 [&](const std::shared_ptr<Connection>& sibling) {
! 148                                     if (sibling->currentAccessToken() == stale_token) {
! 149                                         to_disconnect.push_back(sibling);
! 150                                         if (_current_size > 0) --_current_size;
! 151                                         return true;
! 152                                     }
! 153                                     return false;
! 154                                 }),
! 155                             _pool.end());
! 156                     }
  157                 }
  158             } else {
  159                 reuse_candidate = candidate->isAlive() && candidate->reset();
  160             }

Lines 263-283

  263             --_current_size;
  264     }
  265 }
  266 
! 267 bool ConnectionPool::canEvict() {
! 268     std::lock_guard<std::mutex> lock(_mutex);
  269     // Never evict while any connection is checked out or in-flight. Reserved
  270     // capacity (_current_size) beyond what is sitting idle in _pool means a
  271     // caller still holds one, so the pool must stay.
! 272     size_t checked_out = (_current_size > _pool.size()) ? (_current_size - _pool.size()) : 0;
! 273     if (checked_out > 0) {
! 274         return false;
! 275     }
  276     // Nothing checked out and the pool is empty: safe to drop immediately.
! 277     if (_pool.empty()) {
! 278         return true;
! 279     }
  280     // Nothing checked out but idle connections remain. Evict the whole pool
  281     // only once EVERY pooled connection has been idle longer than the idle
  282     // timeout. This is what reclaims pools for rotating / single-use identities
  283     // (e.g. per-request Entra users keyed by token hash): such a pool is never

Lines 283-300

  283     // (e.g. per-request Entra users keyed by token hash): such a pool is never
  284     // acquired again, so its idle connection is never pruned by acquire() and
  285     // _current_size would otherwise stay > 0 forever. Evaluating the idle
  286     // timeout here lets the next acquireConnection() on any key sweep it away.
! 287     auto now = std::chrono::steady_clock::now();
! 288     for (const auto& conn : _pool) {
! 289         auto idle_time =
! 290             std::chrono::duration_cast<std::chrono::seconds>(now - conn->lastUsed()).count();
! 291         if (idle_time <= _idle_timeout_secs) {
! 292             return false;
! 293         }
! 294     }
! 295     return true;
! 296 }
  297 
  298 void ConnectionPool::close() {
  299     std::vector<std::shared_ptr<Connection>> to_close;
  300     {

Lines 345-362

  345         // lookup, keeping the hot path cheap under a many-identity connect load.
  346         auto now = std::chrono::steady_clock::now();
  347         auto sweep_interval = std::chrono::seconds(std::max(1, _default_idle_secs));
  348         if (now - _last_sweep >= sweep_interval) {
! 349             _last_sweep = now;
! 350             for (auto it = _pools.begin(); it != _pools.end();) {
! 351                 if (it->first != key && it->second && it->second->canEvict()) {
! 352                     evicted.push_back(it->second);
! 353                     it = _pools.erase(it);
! 354                 } else {
! 355                     ++it;
! 356                 }
! 357             }
! 358         }
  359         auto& pool_ref = _pools[key];
  360         if (!pool_ref) {
  361             LOG("Creating new connection pool");
  362             pool_ref = std::make_shared<ConnectionPool>(_default_max_size, _default_idle_secs);

Lines 366-379

  366     // Close evicted pools outside _manager_mutex: close() disconnects ODBC
  367     // handles (releasing the GIL), which must never run while holding
  368     // _manager_mutex or we risk a mutex/GIL lock-ordering deadlock.
  369     for (auto& evicted_pool : evicted) {
! 370         try {
! 371             evicted_pool->close();
! 372         } catch (const std::exception& ex) {
! 373             LOG("ConnectionPoolManager: closing evicted pool failed: %s", ex.what());
! 374         }
! 375     }
  376     // Call acquire() outside _manager_mutex.  acquire() may release the GIL
  377     // during the ODBC connect call; holding _manager_mutex across that would
  378     // create a mutex/GIL lock-ordering deadlock. connStr (not key) is used to
  379     // establish new physical connections.

Lines 403-412

  403         if (conn) {
  404             try {
  405                 conn->disconnect();
  406             } catch (const std::exception& ex) {
! 407                 LOG("ConnectionPoolManager::returnConnection: disconnect of orphaned "
! 408                     "connection failed: %s",
  409                     ex.what());
  410             }
  411         }
  412     }

Lines 434-444

  434     _pools.clear();
  435     // Nothing left to sweep; reset the throttle so a fresh pool set after this
  436     // is swept on its next acquireConnection().
  437     _last_sweep = std::chrono::steady_clock::time_point{};
! 438 }
! 439 
! 440 size_t ConnectionPoolManager::poolCount() {
! 441     std::lock_guard<std::mutex> lock(_manager_mutex);
  442     return _pools.size();
  443 }


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.9%
mssql_python.pybind.connection.connection_pool.cpp: 68.8%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 76.4%
mssql_python.__init__.py: 77.3%
mssql_python.row.py: 77.6%
mssql_python.pybind.connection.connection.cpp: 79.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.logging.py: 85.5%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

jahnvi480 added 2 commits July 3, 2026 11:47
Pooled Entra connections were keyed only on the sanitized connection string, so two callers hitting the same server with different Entra identities could reuse each other's authenticated connection (cross-identity reuse, #651).

connection.py now computes an identity-aware pool key (connStr + NUL + security-context) whenever an access token is present: system/user-assigned MSI keyed on client_id, other token auth keyed on a SHA-256 of the token struct. Non-token auth (SQL, trusted, native ServicePrincipal, Windows Interactive) keeps the legacy bare-connStr key, so behavior is unchanged.

Native pool threads an optional pool_key through ConnectionPoolManager::acquireConnection/returnConnection and the pybind Connection init (defaulted, backward compatible); the physical connect still uses the connection string. Adds 6 TestConnectionPoolKey unit tests.
…#659)

For MSI auth the identity (client id) is known without a token, so the pool key can be computed up front. Acquisition of the access token is deferred to an internal token_factory callback that the native layer invokes only when it opens a physical connection. A same-identity pool hit therefore reuses the pooled connection and never pays for a token.

Token-dependent identities (DefaultAzureCredential / raw token / interactive / device-code) still acquire eagerly because the pool key is the token hash.

The internal token_factory is intentionally distinct from the public token_provider= credential API proposed in PR #603 (issue #577).
@github-actions github-actions Bot added pr-size: large Substantial code update and removed pr-size: medium Moderate update size labels Jul 3, 2026
jahnvi480 added 7 commits July 3, 2026 13:25
Add unit tests for auth.get_auth_token_info Windows/interactive and error paths, and is_token_near_expiry wall-clock default. Add native integration tests driving ddbc_bindings.Connection with a token_factory to exercise the C++ lazy-token branches (pool miss vs reuse, non-pooled with/without factory).
… isolation test

- Correct SP claims: ServicePrincipal is ODBC-native (creds retained in connStr), so it is already isolated and needs no dedicated key; fix design doc s3.1/s4/s12.
- Add an 'Implementation status (PR #660)' section clarifying wired vs deferred (home_account_id, expiry-aware checkout, returnConnection-on-missing, eviction).
- Clarify get_token_info docstring: expiry capture is foundation, not yet consumed on checkout.
- Add native integration test proving distinct pool keys do not share connections (and that the \x00 key separator survives str->u16string).
…re pooling

Remove unreachable acquire-timeout/condition_variable path from the native connection pool, drop the unused is_token_near_expiry helper and orphaned get_auth_token, and delete their now-redundant tests. Clean up code comments.
…pools

auth: try get_token() silently first for interactive/device-code credentials and only call authenticate() when the SDK raises AuthenticationRequiredError; cache the resolved home_account_id so subsequent acquisitions stay silent.

pool: compare a freshly minted token against a near-expiry connection's current token so a matching token reuses the connection; make canEvict() evaluate the idle timeout so pools holding only idle connections are reclaimed, and close evicted pools outside the manager mutex.

tests: add silent-first regression/branch tests (auth) and a real-DB idle identity-pool eviction test (pooling).
pool: throttle the lazy-eviction sweep to at most once per idle-timeout window instead of running on every acquireConnection, cutting global-lock contention under many-identity workloads; eagerly drain sibling idle connections holding a rotated (stale) token in one pass; consolidate the SQL_COPT_SS_ACCESS_TOKEN attribute id into connection.h as a single source of truth.

docs: document in README that bring-your-own token lifetime is the application's responsibility, that DefaultAzureCredential is not recommended for multi-user pooling, and that interactive/device-code auth shares one signed-in account per process.
Clear _last_sweep in configure() and closePools() so the first acquireConnection() after a pooling reconfigure or shutdown performs a fresh sweep instead of honoring a stale throttle interval.
@jahnvi480 jahnvi480 marked this pull request as ready for review July 3, 2026 12:56
Copilot AI review requested due to automatic review settings July 3, 2026 12:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens and improves connection pooling for Entra ID (Azure AD) authentication by making the native pool key identity-aware (preventing cross-identity reuse of authenticated connections), deferring token acquisition until a physical connection is actually opened, and adding token-expiry awareness plus lazy eviction of idle identity pools.

Changes:

  • Add TokenInfo (token struct + expiry + optional account id), get_auth_token_info(), and compute_identity_key() to support identity-aware pooling and expiry-aware behavior.
  • Extend the Python Connection layer to compute an identity-aware pool_key and optionally pass a lazy token_factory callback down to the native layer.
  • Update native pooling to key by pool_key, invoke token_factory only on pool misses, refresh/reopen near-expiry token connections, and lazily evict idle identity pools; add integration/diagnostic tests and document limitations.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_009_pooling.py Adds integration/regression tests for idle identity-pool eviction and native lazy token-factory behavior.
tests/test_008_auth.py Adds unit coverage for silent-first interactive/device-code flow, token-expiry capture, identity key computation, and pool-key wiring.
README.md Documents multi-user caveats for interactive/device-code, BYO token lifetime responsibility, and DefaultAzureCredential pooling guidance.
mssql_python/pybind/ddbc_bindings.cpp Extends the pybind Connection constructor to accept pool_key and token_factory, and exposes _pool_count() for tests.
mssql_python/pybind/connection/connection.h Introduces token expiry tracking APIs and a shared SQL_COPT_SS_ACCESS_TOKEN constant used by pooling/connection code.
mssql_python/pybind/connection/connection.cpp Implements token-factory invocation, token-expiry recording, and routes pooling acquire/return via identity-aware keys.
mssql_python/pybind/connection/connection_pool.h Extends pool acquire/manager APIs to take pool_key and token_factory; adds canEvict()/poolCount().
mssql_python/pybind/connection/connection_pool.cpp Implements lazy token acquisition, expiry-aware checkout (token compare + reopen), and throttled idle identity-pool eviction.
mssql_python/connection.py Builds identity-aware pool keys and defers token acquisition via _token_factory where possible; forwards both to native.
mssql_python/auth.py Adds TokenInfo, silent-first interactive/device-code acquisition with cached home_account_id, DefaultAzureCredential warning, and identity key computation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +13 to +16
// Refresh threshold for expiry-aware checkout: a pooled connection whose access
// token expires within this many seconds is discarded and reopened with a fresh
// token rather than handed out (<=10 min).
static constexpr int TOKEN_EXPIRY_THRESHOLD_SECS = 300;
Comment thread tests/test_008_auth.py
Comment on lines +1707 to +1717
def test_same_token_produces_stable_key(self):
k1 = compute_identity_key("default", token_struct=b"same")
k2 = compute_identity_key("default", token_struct=b"same")
assert k1 == k2

def test_none_when_token_required_but_absent(self):
# default/interactive/devicecode need the token to form the key;
# without it, signal the caller to acquire then recompute.
assert compute_identity_key("default") is None
assert compute_identity_key("interactive") is None
assert compute_identity_key("devicecode") is None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants