Skip to content

refactor: cache hot-path lookups + fix MySQL connection-pool footguns#8524

Open
chrisburr wants to merge 8 commits intoDIRACGrid:integrationfrom
chrisburr:cache-extensions-entry-points
Open

refactor: cache hot-path lookups + fix MySQL connection-pool footguns#8524
chrisburr wants to merge 8 commits intoDIRACGrid:integrationfrom
chrisburr:cache-extensions-entry-points

Conversation

@chrisburr
Copy link
Copy Markdown
Member

@chrisburr chrisburr commented Apr 29, 2026

Two related series motivated by py-spy profiles of LHCb DIRAC services and the
incidents that surfaced after deploying the first round in production.

Hot-path caches (refactor: commits):

  • extensionsByPriority / getExtensionMetadata get @functools.cache
    importlib_metadata.entry_points() is a per-call filesystem walk.
  • S_ERROR stack capture replaces traceback.format_stack() (which stat()s
    every source file on the stack via linecache.checkcache) with a cached
    equivalent that reads each file once.
  • loadProxyFromFile caches the parsed cert chain + key keyed by
    (path, mtime_ns, size, inode) so repeat loads of an unchanged proxy are
    free.
  • DMSHelpers becomes a per-VO singleton, re-derived only when
    gConfigurationData.getVersion() has changed since last init (so callers
    still pick up CS refreshes).
  • ConnectionPool.get skips conn.ping() for warm thread-local connections
    (kept under wait_timeout via PING_IDLE_THRESHOLD = 5.0s).

MySQL connection-pool fixes (fix: commits, prompted by the cascades the
ping-skip exposed in production):

  • _except evicts the cached connection on MySQLdb errnos that mean the link
    is dead (2006/2013/2055/4031), so a single dead conn cannot poison the
    per-thread cache for the rest of the idle window.
  • AccountingDB.insertRecordDirectly, DirectoryTreeBase.getDirectorySize
    and PilotStatusAgent.execute no longer call .close() on the pool-owned
    connection. The manual close left a dead reference inside the pool that
    the next checkout on the same thread happily reused. Stock code masked
    this with the per-call ping; the warm-skip optimisation made it visible.

BEGINRELEASENOTES

*Core

CHANGE: (#8524) cache extension entry-point lookups
CHANGE: (#8524) avoid per-frame stat() in S_ERROR stack capture
CHANGE: (#8524) cache parsed proxy file by (path, mtime, size, inode)
CHANGE: (#8524) skip MySQL ping on warm thread-local connections
FIX: (#8524) drop dead MySQL connection on connection-loss errors

*DataManagementSystem

CHANGE: (#8524) share DMSHelpers instance per VO
FIX: (#8524) stop closing pooled MySQL connection in getDirectorySize

*AccountingSystem

FIX: (#8524) stop closing pooled MySQL connection in bucket insert

*WorkloadManagementSystem

FIX: (#8524) stop closing pooled MySQL connection in PilotStatusAgent

ENDRELEASENOTES

Comment thread src/DIRAC/WorkloadManagementSystem/Agent/PilotStatusAgent.py Outdated
Comment thread src/DIRAC/DataManagementSystem/Utilities/DMSHelpers.py Outdated
@chrisburr chrisburr force-pushed the cache-extensions-entry-points branch from 5535f89 to 9595934 Compare May 4, 2026 14:43
@chrisburr chrisburr changed the title perf: cache extension entry-point lookups refactor: cache hot-path lookups + fix MySQL connection-pool footguns May 4, 2026
@chrisburr chrisburr marked this pull request as ready for review May 4, 2026 15:00
@chaen
Copy link
Copy Markdown
Contributor

chaen commented May 4, 2026

As agreed, setting it as draft until we test the 2 critical commit in LHCb tomorrow morning

@chaen chaen marked this pull request as draft May 4, 2026 21:21
@chrisburr chrisburr marked this pull request as ready for review May 5, 2026 18:57
@chrisburr
Copy link
Copy Markdown
Member Author

Everything seems fine with them

@chaen
Copy link
Copy Markdown
Contributor

chaen commented May 5, 2026

Can you please just add the "singleton new magic" explanation we mentioned yesterday. I'll merge and release tomorrow

chrisburr added 7 commits May 6, 2026 06:47
importlib_metadata.entry_points() walks every installed distribution and
opens each entry_points.txt. The set of installed dirac entry points
cannot change during a process's lifetime, so memoise both lookups.
traceback.format_stack() invokes linecache.checkcache for every frame on
every call, which stat()s each source file on the stack. Read each file
into linecache once and reuse the cached contents on subsequent calls;
output is byte-identical for any file not edited mid-process.
loadProxyFromFile re-parses the same proxy on every DISET RPC, with
SAX/M2Crypto cert parsing dominating the cost. Cache the parsed cert
list and key object keyed by the file's stat metadata so that repeat
loads of an unchanged proxy are free.
The site/SE mapping and related lookups are derived from the configuration
system, but agents construct a fresh DMSHelpers per task and re-run the
heavy getSiteSEMapping / getTiers / getShortSiteNames lookups every time.
Make DMSHelpers a per-VO singleton via __new__ and re-derive only when
gConfigurationData.getVersion() has changed since the last initialisation
so callers still see CS refreshes.
ConnectionPool issued conn.ping() — a network round-trip — on every
checkout, even when the same thread had used the same connection a
millisecond earlier. Track each connection's last-use timestamp and
only ping after PING_IDLE_THRESHOLD seconds of idle. Spare-pool
connections are still pinged on first checkout; freshly-opened
connections skip ping entirely.
Skipping the per-checkout ping leaves a gap: if MySQL drops a connection
inside the idle window, the bad connection stays in the per-thread cache
and every subsequent query fails with the same 2006 until something
evicts it. Evict in _except when the underlying MySQLdb error code
indicates the link is dead (2006, 2013, 2055, 4031); the next call
opens a fresh connection through the existing retry path.
AccountingDB.insertRecordDirectly, DirectoryTreeBase.getDirectorySize and
PilotStatusAgent.execute all called .close() on a connection obtained
from the per-thread pool, destroying the pooled socket while the pool's
__assigned dict still cached the reference. Stock code masked this with
the per-call ping that reconnected on the next checkout; once the warm
ping skip was deployed the dead reference was reused inside the idle
window and the next query failed with (2006, '').
@chrisburr chrisburr force-pushed the cache-extensions-entry-points branch from 9595934 to eae701f Compare May 6, 2026 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants