PERF: Handle utf16 strings using simdutf and std::u16string#526
PERF: Handle utf16 strings using simdutf and std::u16string#526ffelixg wants to merge 21 commits into
Conversation
|
I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit. This brings arrow performance on linux for nvarchars in line with what it should be. If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace |
There was a problem hiding this comment.
Pull request overview
This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).
Changes:
- Add
simdutf(viafind_packageorFetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths. - Replace a number of
std::wstringusages withstd::u16stringfor connection strings, queries, and SQL_C_WCHAR parameter buffers. - Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
mssql_python/pybind/CMakeLists.txt |
Adds simdutf dependency resolution and links it into the ddbc_bindings module. |
mssql_python/pybind/ddbc_bindings.h |
Introduces UTF-16 helpers (dupeSqlWCharAsUtf16Le, utf16LeToUtf8Alloc) and changes ErrorInfo to store UTF-8 std::string. |
mssql_python/pybind/ddbc_bindings.cpp |
Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + simdutf. |
mssql_python/pybind/connection/connection.h |
Updates connection APIs/state to store connection strings as std::u16string. |
mssql_python/pybind/connection/connection.cpp |
Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly. |
mssql_python/pybind/connection/connection_pool.h |
Updates pooling key types and APIs to use std::u16string. |
mssql_python/pybind/connection/connection_pool.cpp |
Implements pooling with std::u16string connection string keys. |
mssql_python/pybind/unix_utils.h |
Removes the SQLWCHARToWString declaration. |
mssql_python/pybind/unix_utils.cpp |
Removes the SQLWCHARToWString implementation. |
Comments suppressed due to low confidence (1)
mssql_python/pybind/ddbc_bindings.cpp:583
- In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses
data()+SQL_NTSand setsbufferLengthtosize() * sizeof(SQLWCHAR). ForSQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (preferc_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.
Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.
std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
paramBuffers, param.cast<std::u16string>());
LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
"length=%zu characters, buffer=%zu bytes",
paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
dataPtr = sqlwcharBuffer->data();
bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
*strLenOrIndPtr = SQL_NTS;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 1545-1553 1545 LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
1546 ErrorInfo errorInfo;
1547 if (retcode == SQL_INVALID_HANDLE) {
1548 LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1549 errorInfo.ddbcErrorMsg = "Invalid handle!";
1550 return errorInfo;
1551 }
1552 assert(handle != 0);
1553 SQLHANDLE rawHandle = handle->get();Lines 1624-1632 1624 return records;
1625 }
1626
1627 // Wrap SQLExecDirect
! 1628 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
1629 LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
1630 "query_length=%zu chars",
1631 (void*)StatementHandle->get(), Query.length());
1632 if (!SQLExecDirect_ptr) {Lines 1641-1649 1641 SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
1642 (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
1643 }
1644
! 1645 SQLWCHAR* queryPtr = reinterpretU16stringAsSqlWChar(Query);
1646 SQLRETURN ret;
1647 {
1648 // Release the GIL during the blocking ODBC call so that other Python
1649 // threads (e.g. asyncio event loop, heartbeat threads) can run whileLines 3759-3767 3759 sizeof(DateTimeOffset) * fetchSize,
3760 buffers.indicators[col - 1].data());
3761 break;
3762 default:
! 3763 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
3764 std::ostringstream errorString;
3765 errorString << "Unsupported data type for column - " << columnName.c_str()
3766 << ", Type - " << dataType << ", column ID - " << col;
3767 LOG("SQLBindColums: %s", errorString.str().c_str());Lines 3768-3776 3768 ThrowStdException(errorString.str());
3769 break;
3770 }
3771 if (!SQL_SUCCEEDED(ret)) {
! 3772 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
3773 std::ostringstream errorString;
3774 errorString << "Failed to bind column - " << columnName.c_str() << ", Type - "
3775 << dataType << ", column ID - " << col;
3776 LOG("SQLBindColums: %s", errorString.str().c_str());Lines 4101-4109 4101 break;
4102 }
4103 default: {
4104 const auto& columnMeta = columnNames[col - 1].cast<py::dict>();
! 4105 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
4106 std::ostringstream errorString;
4107 errorString << "Unsupported data type for column - " << columnName.c_str()
4108 << ", Type - " << dataType << ", column ID - " << col;
4109 LOG("FetchBatchData: %s", errorString.str().c_str());Lines 4203-4211 4203 case SQL_SS_TIMESTAMPOFFSET:
4204 rowSize += sizeof(DateTimeOffset);
4205 break;
4206 default:
! 4207 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
4208 std::ostringstream errorString;
4209 errorString << "Unsupported data type for column - " << columnName.c_str()
4210 << ", Type - " << dataType << ", column ID - " << col;
4211 LOG("calculateRowSize: %s", errorString.str().c_str());mssql_python/pybind/utf_utils.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.4%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 28.7%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 61.1%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.5%🔗 Quick Links
|
|
Only issue seems to be that some Linux CI containers don't have git. I'm trying to fetch simdutf via url instead. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@gargsaumya The second CI error was due to the ubuntu image using an older version of cmake. Should be fine now I hope. I have also replaced the remaining std::wstring occurences with std::u16string and eliminated helper functions as well as platform specific code accordingly. Let me know if you are happy with this holistic update to utf16 string handling or if you would prefer to keep it contained to the arrow fetch path. |
|
Thanks for the update @ffelixg. I actually prefer this and it LGTM overall. ButI’ll still take a closer look at the changes and get back to you. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
I think that this latest CI failure was due to a flaky test. The same test / platform passed in a past CI run and I don't think my changes affected that test. Could you rerun that one @gargsaumya? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
gargsaumya
left a comment
There was a problem hiding this comment.
Really appreciate the effort here, this is a very meaningful contribution @ffelixg. I've reviewed the PR and left 2 inline comments, please take a look!
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
I have a few generic comments before I delve into core review:
|
| result.push_back(static_cast<wchar_t>(UNICODE_REPLACEMENT_CHAR)); | ||
| } | ||
| } | ||
| inline std::string utf16LeToUtf8Alloc(const std::u16string& utf16) { |
There was a problem hiding this comment.
Would recommend to move utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, and reinterpretU16stringAsSqlWChar from ddbc_bindings.h into a small utf_utils.h so they can be unit tested and the changes don't force recompilation of the whole module.
I do not see dedicated unit tests for utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, or reinterpretU16stringAsSqlWChar. Edge cases (empty strings, surrogate pairs, lone surrogates) are only tested indirectly through integration tests. Once moved to utf_utils.h I would suggest to add targeted unit tests.
There was a problem hiding this comment.
By targeted unit tests, am I correct in assuming that you mean pure c++ unit tests? Testing them via python wrappers and pytest feels wrong. It would also involve changes to the build process as well as CI pipelines, since there is currently no infrastructure for c++ tests. The predecessors of these functions were also only tested via integration or even fully inlined as was the case for some reinterpretU16stringAsSqlWChar applications.
There was a problem hiding this comment.
Good point! I agree that setting up a full C++ unit test framework and CI pipeline changes is out of scope for this PR. Let's not block this PR on it, but I'd like to open a tracking issue for it in our internal backlog for future enhancements.
I see you've already added utf_utils.h, thank you! Let's keep moving with this PR. We are expecting one more review on this before we merge it.
There was a problem hiding this comment.
@gargsaumya please track this: "but I'd like to open a tracking issue for it in our internal backlog for future enhancements" in ADO.
Co-authored-by: Copilot <copilot@github.com>
|
Thanks for the review @sumitmsft. Regarding the first point, There's actually one more method that's currently being used to convert unicode aside from simdutf and I hope the added dependency is not too much of an inconvenience. I looked into native iconv as well, but didn't find it appealing in comparison. The simdutf releases look to me like they are mostly about adding new features, which we likely won't care about. Since unicode conversion itself is a solved problem, I was hoping that we would technically be fine using the same version forever, assuming no crazy bugs. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
bewithgaurav
left a comment
There was a problem hiding this comment.
one real bug (diag message overflow, stack memory leaking into exception text on long error messages, repro inline), one accidental README typo, one perf nit on the new helper. only the diag bug actually blocks, the other two can ride along or land separately.
| errorInfo.ddbcErrorMsg = SQLWCHARToWString(message, messageLen); | ||
| #endif | ||
| std::u16string sqlStateUtf16 = dupeSqlWCharAsUtf16Le(sqlState, 5); | ||
| std::u16string messageUtf16 = dupeSqlWCharAsUtf16Le(message, static_cast<size_t>(messageLen)); |
There was a problem hiding this comment.
heads up, this is a pre-existing bug on the unix path that surfaced while reviewing this PR. flagging it here since the PR is touching the same code and the fix is essentially free now.
any time SQL Server returns an error message longer than 512 chars, this branch leaks driver stack memory into the user-visible exception text. trigger is common: a stored proc RAISERROR with row context interpolated in, a syntax error where the server echoes back a long ORM-generated query, a linked-server failure with prefixed context. T-SQL's own RAISERROR ceiling is 2047 chars so the surface area is real.
minimal repro, just one stored proc call wrapped in try/except + logging.exception:
proc = """
DECLARE @msg NVARCHAR(2048) = REPLICATE(N'Order validation failed for line items in batch #4218: ', 25);
RAISERROR(@msg, 16, 1);
"""
cur.execute(proc)what lands in the log:
mssql_python.exceptions.ProgrammingError: Driver Error: ...; DDBC Error:
[Microsoft][SQL Server]Order validation failed for line items in batch #4218:
Order validation failed for line items in batch #4218: ... Order validation
# 42000
...
first ~511 chars are the real error. then 42000 is the SQLSTATE leaking through, then ~1500 chars of stack memory rendered as text. the recurring high-codepoint chars (, , ``) are aarch64 heap pointer high bits being read 16 bits at a time. user impact: log noise, anything pattern-matching on error strings breaks on long errors, and wherever exceptions get shipped (sentry / datadog / support tickets) now ingests our process memory.
the why: SQLGetDiagRec's contract is that on a too-small buffer it truncates + NULs the buffer but sets the messageLen out-param to the full untruncated length so callers can re-call with a bigger buffer. dupeSqlWCharAsUtf16Le(message, messageLen) trusts messageLen and memcpys that many chars out of the (smaller) buffer, walking off the stack. cap messageLen at the buffer size in the helper and both this site and SQLGetAllDiagRecords at L1605 are covered.
checked main: the unix path on main had this exact bug (SQLWCHARToWString(message, messageLen) reads the same way), windows path was safe (std::wstring(message), NTS). this PR collapses the platform split into one path that always uses messageLen, so windows regresses, macos/linux carry it forward. since the helper is new in this PR and both sites already touch it, makes sense to clamp here rather than spin a separate fix later.
same shape (different buffer + length pair) exists in SQLDescribeCol_wrap at L2783 with ColumnName[256] + NameLength. SQL Server's sysname=128 keeps it from firing today, but a one-line clamp inside the helper covers it for free as well.
There was a problem hiding this comment.
Oh yeah, that's not great at all! I agree that this PR is a decent place to sneak in a fix for it.
Apparently the PRINT function is the way to produce the longest possible message with 8000 characters and THROW is somewhere between RAISERROR and PRINT. Ontop of that we get the driver name in front of the message, which does not count to the SQL Server limit of 8000 characters.
Since the SQL_MAX_MESSAGE_LENGTH constant defined in sql.h appears to be wrong, I instead defined a custom SQL_MAX_MESSAGE_LENGTH_SQLSERVER which I set at 10000 characters. That means we allocate 20kb on the stack, but we're not targeting embedded or doing any recursion, so I think it's fine. An alternative would be to call SQLGetDiagRec twice to get the length first and heap allocate accordingly, but that doesn't seem necessary.
With that, there should be enough space to retrieve arbitrary errors/messages. I also added test_long_raiserror and test_long_print_message to cover the extremes.
Clamping the lengths for SQLDescribeCol / SQLGetDiagRec of course makes sense as well just in case.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Login failures and other connection-time errors raised by the C++ pybind layer were surfacing as plain RuntimeError instead of a mssql_python exception, making it impossible for callers to catch them via the DB-API 2.0 exception hierarchy. Connection::checkError() now embeds the SQLSTATE code in the thrown message (SQLSTATE:XXXXX:<msg>) so the Python layer can map it to the correct exception class via sqlstate_to_exception() -- consistent with how cursor-level errors are already handled in helpers.py. The new _raise_connection_error() helper is applied to all four connection operations that go through checkError(): connect, commit, rollback, and set_autocommit. Note: when PR #526 (simdutf) merges, the two WideToUTF8() calls in connection.cpp::checkError() will need updating to utf16LeToUtf8Alloc(). Fixes #532
|
@ffelixg - thanks for your responses - requesting you to please resolve conflicts and we'll do a last set of review |
Work Item / Issue Reference
Summary