Disconnect non-consuming IPC subscribers to fix publisher memory leak (#68114)#69436
Open
dwoz wants to merge 2 commits into
Open
Disconnect non-consuming IPC subscribers to fix publisher memory leak (#68114)#69436dwoz wants to merge 2 commits into
dwoz wants to merge 2 commits into
Conversation
IPCMessagePublisher.publish() spawned an unbounded number of stream.write() coroutines per subscriber per message. If a subscriber stopped consuming, every following publish queued another _write coroutine that awaited a stream.write() that would never complete; the pending writes (and their referenced payloads) accumulated in the publisher's ioloop and inflated the master EventPublisher's RSS without bound -- the "non consuming IPC clients" leak. Wrap _write() with the new ipc_write_timeout opt (default 30s, 0 disables) via salt.ext.tornado.gen.with_timeout. On timeout, close the stream and discard it from self.streams so the publisher reclaims memory and stops spawning writers for that client. publish() now also skips streams that still have a pending write buffer, which prevents piling up more coroutines on a slow subscriber before the timeout fires. Documented on master and minion config pages and added a changelog fragment. Fixes saltstack#68114
Contributor
Author
|
Context on prior work in this area (none of it surfaced in PR-list / timeline searches because it landed as direct-to-branch commits rather than as PRs):
This PR is a clean re-implementation of 4c0e552, without the merge-conflict damage, plus:
Default timeout bumped from 15s (original) to 30s — more headroom for legitimately slow consumers under load — but is configurable and |
A stale `.. conf_minion:: ipc_write_timeout` block (Default: 15, versionadded:: 3006.11) was inherited from origin/3006.x as a carryover from the reverted 4c0e552 attempt. It collides with this PR's new `.. conf_minion:: ipc_write_timeout` entry in minion.rst, causing Sphinx -W to abort with a duplicate description warning and cascading failures across the docs build and 28+ dependent CI jobs. The correct master-side entry already exists at master.rst:1077 (`.. conf_master:: ipc_write_timeout`, Default: 30) matching salt.defaults.IPC_WRITE_TIMEOUT. Remove only the stale minion directive block; no other changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
IPCMessagePublishernow disconnects a subscriber that does not consumemessages within a configurable timeout (
ipc_write_timeout, default 30s).This stops the master event publisher from accumulating pending
stream.write()coroutines (and their referenced payloads) on a stalledIPC client, which was a real source of unbounded RSS growth.
What issues does this PR fix or reference?
Fixes #68114
Previous Behavior
IPCMessagePublisher.publish()didself.io_loop.spawn_callback(self._write, stream, pack)for everysubscriber on every published message.
_writethen awaitedstream.write(pack)with no timeout. A subscriber that stopped readingcaused every following publish to queue another
_writecoroutine thatawaited a write that never completed; the pending writes and their
payloads piled up in the ioloop and the publisher's RSS grew without
bound.
New Behavior
_writewrapsstream.write(pack)insalt.ext.tornado.gen.with_timeout(...)using the newipc_write_timeoutopt (seconds; default 30,
0disables the timeout for legacy behavior).When a single write does not complete in time, the publisher logs a
warning, closes the offending stream, and removes it from
self.streams.publish()also skips streams that still have a pending write buffer sowe no longer pile up more
_writecoroutines on a slow client while theexisting one is waiting to time out.
The opt is documented on the master and minion configuration pages and a
changelog fragment was added.
Merge requirements satisfied?
test:full)tests/pytests/unit/transport/test_ipc.py::test_ipc_publisher_drops_non_consuming_client_68114)changelog/68114.fixed.md)Commits signed with GPG?
Yes