FEAT: Add ActiveDirectoryMSI support for bulk copy#573
Conversation
Adds Authentication=ActiveDirectoryMSI to the auth pipeline: - Zero-arg ManagedIdentityCredential() for system-assigned MSI. - ManagedIdentityCredential(client_id=UID) for user-assigned MSI, matching ODBC's convention where UID carries the identity's client_id under MSI. - Threads optional credential_kwargs through get_auth_token / get_raw_token / _acquire_token so future auth methods that need constructor args (e.g. ClientSecretCredential) can plug in via the same channel. - Cache key remains a plain string for zero-arg auth types and becomes a tuple when kwargs are present, so different client_ids get separate cached credentials. Partial fix for #534. ServicePrincipal and Password to follow as separate PRs.
There was a problem hiding this comment.
Pull request overview
Adds support for Authentication=ActiveDirectoryMSI (managed identity) token acquisition for the bulk copy (mssql-py-core) path, including user-assigned MSI via UID=<client_id>.
Changes:
- Add
AuthType.MSI(activedirectorymsi) and map it toManagedIdentityCredential. - Thread optional
credential_kwargsthrough token acquisition and update the credential-instance cache key to incorporate kwargs. - Attempt to extract MSI
client_idfrom the connection string and pass it into bulk copy token acquisition; add tests and a changelog entry.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
mssql_python/constants.py |
Adds AuthType.MSI enum value. |
mssql_python/auth.py |
Adds MSI credential support, credential kwargs plumbing, and cache key changes; adds extract_credential_kwargs. |
mssql_python/cursor.py |
Bulk copy now tries to extract MSI credential kwargs and pass them to get_raw_token. |
tests/test_008_auth.py |
Adds tests for MSI auth type parsing, credential construction, cache isolation, and connection-string processing. |
CHANGELOG.md |
Documents MSI support for bulk copy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Connection.__init__ overwrites self.connection_str with the sanitized (UID-stripped) string returned by process_connection_string. The original implementation re-parsed self.connection_str at bulkcopy time via extract_credential_kwargs, which silently dropped the user-assigned MSI client_id and degraded to system-assigned a wrong-identity bug.MSI Changes: - process_connection_string now returns a 4-tuple including the captured credential_kwargs so callers can persist them. - Connection.__init__ stores _credential_kwargs alongside _auth_type. - cursor.bulkcopy() reads self.connection._credential_kwargs instead of re-parsing self.connection_str. - The public extract_credential_kwargs helper is removed (it only existed to support the broken re-parse path; nothing else needs it). - black --line-length=100 reformats (CI was red). Tests: - test_bulkcopy_path_preserves_user_assigned_msi_client_id: invokes cursor.bulkcopy() with a mocked mssql_py_core, patches AADAuth.get_raw_token to capture the args it receives, and asserts the captured credential_kwargs match Connection._credential_kwargs. Fails if cursor reverts to re-parsing self.connection.connection_str. - test_credential_kwargs_persisted_for_user_assigned_msi: asserts Connection.__init__ stores _credential_kwargs from the 4-tuple. - test_credential_kwargs_none_for_system_assigned_msi. - test_credential_kwargs_none_for_non_msi_auth. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bbda2a2 to
3ca2165
Compare
| identity's client_id. Returns None for system-assigned MSI. | ||
| """ | ||
| for param in parameters: | ||
| key, _, value = param.strip().partition("=") |
There was a problem hiding this comment.
Is it possible to get this from the conn str parser map? Assuming we need the conn str UID
There was a problem hiding this comment.
I am not sure what will happen if someone provides a UID={hello=world}
There was a problem hiding this comment.
thanks - missed this, parser should be used as a standard across
that handles {hello=world} correctly
Per @saurabh500's review: braced ODBC values like UID={hello=world} need the canonical parser, not naive partition('='). Without this, the helper returns '{hello=world}' verbatim and ManagedIdentityCredential rejects it. Worse, a UID containing a literal ';' would be truncated. _extract_msi_client_id now delegates to _ConnectionStringParser, which handles braces, escaped '}}' inside braces, and '=' inside braced values correctly. validate_keywords=False so the helper never raises on keys the auth flow doesn't care about. Tests: - test_msi_braced_uid_value_is_unwrapped: UID={hello=world} -> 'hello=world' - test_msi_braced_uid_with_semicolon_is_preserved: UID={abc;def;ghi} Note: process_connection_string and extract_auth_type still use naive split(';') for Authentication= detection across all Entra ID auth types. That's pre-existing and tracked separately for a parser-wide refactor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…microsoft/mssql-python into bewithgaurav/gh534-bulkcopy-msi
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| except Exception: # noqa: BLE001 — parser raises ConnectionStringParseError on malformed input; | ||
| # absence of UID is the safe answer for credential extraction. | ||
| return None | ||
| uid = (parsed.get("uid") or "").strip() |
There was a problem hiding this comment.
I dont know if this is the right place, but we are kind of downgrading to SAMI if UAMI uid is not found.
Since we are falling back, do you think it makes sense to add a logging statement to help debug if needed?
There was a problem hiding this comment.
yes you're correct - the fallback to SAMI is expected and as per the ODBC behaviour
added logger statements for better debugging help, thanks for the suggestion!
| try: | ||
| with _credential_cache_lock: | ||
| if auth_type not in _credential_cache: | ||
| if cache_key not in _credential_cache: |
There was a problem hiding this comment.
There is no pruning of credential cache. I don't think its needed right now, but I would encourage adding a comment about it.
This case is not usually possible but in some crazy scenario where the same app is serving requests for Multi-tenants (I don't think ODBC supports multi tenant Managed Identity yet).
There was a problem hiding this comment.
added a comment to reference this just above the _credential_cache declaration
- Debug log distinguishing user-assigned vs system-assigned MSI when the user passes Authentication=ActiveDirectoryMSI. Helps diagnose which branch was taken when token acquisition fails. Logs client_id length, not value (still identity material). - Comment above _credential_cache explains the cache key shape so the unbounded growth is understood as a deliberate choice rather than an oversight. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.4%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 25.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.2%🔗 Quick Links
|
Connection.__init__ already parses the same connection string through _ConnectionStringParser via _construct_connection_string (connection.py line 253) before process_connection_string is ever called. By the time _extract_msi_client_id runs, the input is guaranteed parseable. The try/except was dead code. A real parse failure here would indicate an upstream bug and should propagate, not silently degrade user-assigned MSI to system-assigned (which is the wrong-identity failure mode this PR exists to prevent). Brings diff coverage to 100%. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| # time we get here the input is guaranteed parseable. No defensive | ||
| # try/except: a parse failure now means a real bug upstream and should | ||
| # propagate, not silently degrade user-assigned MSI to system-assigned. | ||
| parsed = _ConnectionStringParser(validate_keywords=False)._parse(connection_string) |
There was a problem hiding this comment.
Functionally correct but a tiny perf hit due to double parsing of connections string.
Perfect solution would require parsing done earlier in the connect api and passing around the map to this function. That's a bigger change. I recommend a follow up PR and tracking with a GH issue.
An adhoc fix is to maintain a hashmap of connection string and uid. But that's prone to other problems esp concurrency.
Work Item / Issue Reference
Summary
Adds
Authentication=ActiveDirectoryMSIsupport to bulk copy. Partial fix for #534.ManagedIdentityCredential()for system-assigned MSI.ManagedIdentityCredential(client_id=UID)for user-assigned MSI. Matches ODBC convention whereUIDcarries the identity'sclient_idunder MSI.credential_kwargsthroughget_auth_token/get_raw_token/_acquire_tokenso future auth methods that need constructor args plug in via the same channel.(auth_type, sorted_kwargs)when kwargs are present, so differentclient_ids get separate cached credentials.mssql-py-core(the Rust path used by bulk copy) doesn't itself acquire Entra tokens. Python pre-acquires a JWT and passes it asaccess_token. Today this works forDefault,DeviceCode, andInteractive. MSI was missing, the most common Azure-hosted-service auth method.ServicePrincipal and Password are explicitly out of scope for this PR. They need different design work and ship separately.
How user-assigned MSI survives the bulkcopy fresh-token path
Connection.__init__callsprocess_connection_string, which sanitizesself.connection_str(stripsUID=,Authentication=,PWD=). Bulkcopy needs a fresh token at copy time. Re-parsingself.connection_strat that point would lose UID and silently degrade user-assigned MSI to system-assigned (wrong-identity bug).Fix shape:
process_connection_stringreturns a 4-tuple including the capturedcredential_kwargs.Connection.__init__persists_credential_kwargsalongside_auth_type.cursor.bulkcopy()readsself.connection._credential_kwargsdirectly. No re-parse, no chance for the sanitized string to drop theclient_id.Connection string examples