refactor: cache hot-path lookups + fix MySQL connection-pool footguns#8524
Open
chrisburr wants to merge 8 commits intoDIRACGrid:integrationfrom
Open
refactor: cache hot-path lookups + fix MySQL connection-pool footguns#8524chrisburr wants to merge 8 commits intoDIRACGrid:integrationfrom
chrisburr wants to merge 8 commits intoDIRACGrid:integrationfrom
Conversation
fstagni
reviewed
May 4, 2026
5535f89 to
9595934
Compare
Contributor
|
As agreed, setting it as draft until we test the 2 critical commit in LHCb tomorrow morning |
Member
Author
|
Everything seems fine with them |
Contributor
|
Can you please just add the "singleton new magic" explanation we mentioned yesterday. I'll merge and release tomorrow |
importlib_metadata.entry_points() walks every installed distribution and opens each entry_points.txt. The set of installed dirac entry points cannot change during a process's lifetime, so memoise both lookups.
traceback.format_stack() invokes linecache.checkcache for every frame on every call, which stat()s each source file on the stack. Read each file into linecache once and reuse the cached contents on subsequent calls; output is byte-identical for any file not edited mid-process.
loadProxyFromFile re-parses the same proxy on every DISET RPC, with SAX/M2Crypto cert parsing dominating the cost. Cache the parsed cert list and key object keyed by the file's stat metadata so that repeat loads of an unchanged proxy are free.
The site/SE mapping and related lookups are derived from the configuration system, but agents construct a fresh DMSHelpers per task and re-run the heavy getSiteSEMapping / getTiers / getShortSiteNames lookups every time. Make DMSHelpers a per-VO singleton via __new__ and re-derive only when gConfigurationData.getVersion() has changed since the last initialisation so callers still see CS refreshes.
ConnectionPool issued conn.ping() — a network round-trip — on every checkout, even when the same thread had used the same connection a millisecond earlier. Track each connection's last-use timestamp and only ping after PING_IDLE_THRESHOLD seconds of idle. Spare-pool connections are still pinged on first checkout; freshly-opened connections skip ping entirely.
Skipping the per-checkout ping leaves a gap: if MySQL drops a connection inside the idle window, the bad connection stays in the per-thread cache and every subsequent query fails with the same 2006 until something evicts it. Evict in _except when the underlying MySQLdb error code indicates the link is dead (2006, 2013, 2055, 4031); the next call opens a fresh connection through the existing retry path.
AccountingDB.insertRecordDirectly, DirectoryTreeBase.getDirectorySize and PilotStatusAgent.execute all called .close() on a connection obtained from the per-thread pool, destroying the pooled socket while the pool's __assigned dict still cached the reference. Stock code masked this with the per-call ping that reconnected on the next checkout; once the warm ping skip was deployed the dead reference was reused inside the idle window and the next query failed with (2006, '').
9595934 to
eae701f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two related series motivated by py-spy profiles of LHCb DIRAC services and the
incidents that surfaced after deploying the first round in production.
Hot-path caches (
refactor:commits):extensionsByPriority/getExtensionMetadataget@functools.cache—importlib_metadata.entry_points()is a per-call filesystem walk.S_ERRORstack capture replacestraceback.format_stack()(whichstat()severy source file on the stack via
linecache.checkcache) with a cachedequivalent that reads each file once.
loadProxyFromFilecaches the parsed cert chain + key keyed by(path, mtime_ns, size, inode)so repeat loads of an unchanged proxy arefree.
DMSHelpersbecomes a per-VO singleton, re-derived only whengConfigurationData.getVersion()has changed since last init (so callersstill pick up CS refreshes).
ConnectionPool.getskipsconn.ping()for warm thread-local connections(kept under
wait_timeoutviaPING_IDLE_THRESHOLD = 5.0s).MySQL connection-pool fixes (
fix:commits, prompted by the cascades theping-skip exposed in production):
_exceptevicts the cached connection on MySQLdb errnos that mean the linkis dead (
2006/2013/2055/4031), so a single dead conn cannot poison theper-thread cache for the rest of the idle window.
AccountingDB.insertRecordDirectly,DirectoryTreeBase.getDirectorySizeand
PilotStatusAgent.executeno longer call.close()on the pool-ownedconnection. The manual close left a dead reference inside the pool that
the next checkout on the same thread happily reused. Stock code masked
this with the per-call ping; the warm-skip optimisation made it visible.
BEGINRELEASENOTES
*Core
CHANGE: (#8524) cache extension entry-point lookups
CHANGE: (#8524) avoid per-frame stat() in S_ERROR stack capture
CHANGE: (#8524) cache parsed proxy file by (path, mtime, size, inode)
CHANGE: (#8524) skip MySQL ping on warm thread-local connections
FIX: (#8524) drop dead MySQL connection on connection-loss errors
*DataManagementSystem
CHANGE: (#8524) share DMSHelpers instance per VO
FIX: (#8524) stop closing pooled MySQL connection in getDirectorySize
*AccountingSystem
FIX: (#8524) stop closing pooled MySQL connection in bucket insert
*WorkloadManagementSystem
FIX: (#8524) stop closing pooled MySQL connection in PilotStatusAgent
ENDRELEASENOTES