Skip to content

Disconnect non-consuming IPC subscribers to fix publisher memory leak (#68114)#69436

Open
dwoz wants to merge 2 commits into
saltstack:3006.xfrom
dwoz:fix/issue-68114
Open

Disconnect non-consuming IPC subscribers to fix publisher memory leak (#68114)#69436
dwoz wants to merge 2 commits into
saltstack:3006.xfrom
dwoz:fix/issue-68114

Conversation

@dwoz

@dwoz dwoz commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

IPCMessagePublisher now disconnects a subscriber that does not consume
messages within a configurable timeout (ipc_write_timeout, default 30s).
This stops the master event publisher from accumulating pending
stream.write() coroutines (and their referenced payloads) on a stalled
IPC client, which was a real source of unbounded RSS growth.

What issues does this PR fix or reference?

Fixes #68114

Previous Behavior

IPCMessagePublisher.publish() did
self.io_loop.spawn_callback(self._write, stream, pack) for every
subscriber on every published message. _write then awaited
stream.write(pack) with no timeout. A subscriber that stopped reading
caused every following publish to queue another _write coroutine that
awaited a write that never completed; the pending writes and their
payloads piled up in the ioloop and the publisher's RSS grew without
bound.

New Behavior

_write wraps stream.write(pack) in
salt.ext.tornado.gen.with_timeout(...) using the new ipc_write_timeout
opt (seconds; default 30, 0 disables the timeout for legacy behavior).
When a single write does not complete in time, the publisher logs a
warning, closes the offending stream, and removes it from self.streams.
publish() also skips streams that still have a pending write buffer so
we no longer pile up more _write coroutines on a slow client while the
existing one is waiting to time out.

The opt is documented on the master and minion configuration pages and a
changelog fragment was added.

Merge requirements satisfied?

  • Used a Forge-PR Label (test:full)
  • Tests written/updated (tests/pytests/unit/transport/test_ipc.py::test_ipc_publisher_drops_non_consuming_client_68114)
  • Documentation updated (master and minion config refs)
  • Changelog updated (changelog/68114.fixed.md)

Commits signed with GPG?

Yes

IPCMessagePublisher.publish() spawned an unbounded number of
stream.write() coroutines per subscriber per message. If a subscriber
stopped consuming, every following publish queued another _write
coroutine that awaited a stream.write() that would never complete; the
pending writes (and their referenced payloads) accumulated in the
publisher's ioloop and inflated the master EventPublisher's RSS without
bound -- the "non consuming IPC clients" leak.

Wrap _write() with the new ipc_write_timeout opt (default 30s, 0
disables) via salt.ext.tornado.gen.with_timeout. On timeout, close the
stream and discard it from self.streams so the publisher reclaims
memory and stops spawning writers for that client. publish() now also
skips streams that still have a pending write buffer, which prevents
piling up more coroutines on a slow subscriber before the timeout
fires.

Documented on master and minion config pages and added a changelog
fragment.

Fixes saltstack#68114
@dwoz dwoz requested a review from a team as a code owner June 12, 2026 00:53
@dwoz dwoz added this to the Sulphur v3006.26 milestone Jun 12, 2026
@dwoz dwoz added the test:full Run the full test suite label Jun 12, 2026
@dwoz

dwoz commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Context on prior work in this area (none of it surfaced in PR-list / timeline searches because it landed as direct-to-branch commits rather than as PRs):

  • 4c0e5529f5d Ipc leak fix (May 24 2025) shipped the same ipc_write_timeout/with_timeout approach this PR uses, on 3006.x / 3007.x / 3008.x / master.
  • 088b33bc635 Add regression test for ipc leak issue (same day) added tests/pytests/functional/master/test_event_publisher.py — a memory-growth regression test against EventPublisher.
  • 17257ebe1f3 Revert "Ipc leak fix" (Aug 3 2025) reverted only the fix commit, not the regression test. The reason is visible in 4c0e552 itself: it left literal merge-conflict markers in salt/config/__init__.py (<<<<<<< HEAD ... >>>>>>> f1de671598e (Ipc leak fix)), so the release branch was broken until the revert. The revert appears to have been tactical ("this commit is broken, get the branch building"), not a rejection of the approach.
  • 07a6a0788c4 Remove legacy salt.transport.ipc module removed salt/transport/ipc.py from 3008.x / master. That is why this PR targets 3006.x only — the merge-forward branches don't need the same patch because the module is gone.
  • No clean re-attempt has landed on 3006.x / 3007.x in the ~10 months since the revert. No open or closed PR references [BUG] Non consuming ipc clients #68114 or names ipc_write_timeout / IPCMessagePublisher / non-consuming.

This PR is a clean re-implementation of 4c0e552, without the merge-conflict damage, plus:

  • adds the opt to DEFAULT_MINION_OPTS as well as DEFAULT_MASTER_OPTS (the original only added it to the master),
  • short-circuits publish() for streams that still have a pending write so we don't pile up spawn_callback(self._write, ...) coroutines before the timeout fires,
  • raises the log level from trace to warning so operators can see when a non-consuming subscriber is being dropped,
  • documents the new opt on the master and minion config pages,
  • adds a small unit-level regression test (test_ipc_publisher_drops_non_consuming_client_68114) that fails on unmodified 3006.x and passes after this patch. The original test_event_publisher.py is preserved as the functional/RSS-based regression test.

Default timeout bumped from 15s (original) to 30s — more headroom for legitimately slow consumers under load — but is configurable and 0 disables (legacy behavior).

A stale `.. conf_minion:: ipc_write_timeout` block (Default: 15,
versionadded:: 3006.11) was inherited from origin/3006.x as a carryover
from the reverted 4c0e552 attempt. It collides with this PR's new
`.. conf_minion:: ipc_write_timeout` entry in minion.rst, causing
Sphinx -W to abort with a duplicate description warning and cascading
failures across the docs build and 28+ dependent CI jobs.

The correct master-side entry already exists at master.rst:1077
(`.. conf_master:: ipc_write_timeout`, Default: 30) matching
salt.defaults.IPC_WRITE_TIMEOUT. Remove only the stale minion directive
block; no other changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant