Fix #389: Prevent 100% CPU usage when Docker restarts#41
Closed
romprod wants to merge 2 commits intoPatchMon:mainfrom
Closed
Fix #389: Prevent 100% CPU usage when Docker restarts#41romprod wants to merge 2 commits intoPatchMon:mainfrom
romprod wants to merge 2 commits intoPatchMon:mainfrom
Conversation
added 2 commits
January 3, 2026 10:19
…ersions Fixes #379 Fixes #390 Issue: Virtual kernel meta-packages (linux-image-virtual, linux-image-generic, etc.) were incorrectly marked as needs_reboot=true because the agent compared the meta-package name against the running kernel version instead of the actual kernel the meta-package depends on. Root Cause: - Agent extracted meta-package name: 'virtual' - Compared against running kernel: '5.15.0-164-generic' - Mismatch triggered false positive needs_reboot=true Solution: - Expanded meta-package detection to cover all variants - Added resolveMetaPackageKernel() to resolve meta-packages to actual kernels - Uses apt-cache depends to find kernel dependencies - Gracefully falls back if apt-cache unavailable (non-Debian systems) Changes: - Modified getLatestKernelFromDpkg() in internal/system/reboot.go - Now skips: virtual, generic, lowlatency, cloud, generic-hwe meta-packages - Resolves each meta-package to its actual kernel package - Added resolveMetaPackageKernel() function - Queries apt-cache for package dependencies - Parses kernel package information - Logs resolution attempts for debugging - Added comprehensive tests in internal/system/reboot_test.go - 22+ test cases covering parsing, comparison, and edge cases - Tests for all known meta-package variants - Tests for missing/unavailable apt-cache fallback Impact: - Affects: Ubuntu/Debian systems with virtual kernel meta-packages - Risk: Low - only changes reboot detection logic - Backward compatible: Non-meta-package systems unaffected - Performance: No impact
Fixes #389 Issue: When Docker service restarts, the PatchMon-agent would consume 100% CPU and become unresponsive. The only way to recover was to manually restart the agent. Root Cause: The Docker event monitoring loop had a critical flaw in handling connection failures. When Docker restarts, the event channels close, but the loop continued spinning indefinitely: 1. Event channels close (EOF error) 2. Error handler sleeps for 5 seconds 3. Continue statement jumps back to select statement 4. Select statement fires IMMEDIATELY (channels are still closed) 5. Creates busy-spin loop: select → continue → select → continue... 6. Result: 100% CPU with no recovery mechanism Technical Issue: In Go, receiving from a CLOSED channel returns immediately with the zero value for that type (non-blocking). The original code's select statement would fire thousands of times per second on closed channels, creating a busy loop instead of properly waiting for recovery. Solution: Implemented a two-tier monitoring architecture with exponential backoff: TIER 1: monitoringLoop() - Manages reconnection - Attempts to establish event stream via monitorEvents() - On failure: waits with exponential backoff (1s, 1.5s, 2.25s... capped at 30s) - Sleep properly blocks (0% CPU during wait) - Automatically retries when backoff completes - Supports graceful shutdown via context cancellation TIER 2: monitorEvents() - Handles single event stream - Gets FRESH channels from Docker API on each call - Processes events from the current stream - Returns immediately on any error (EOF, context cancel, etc) - Lets caller (monitoringLoop) decide on retry strategy Key Insight: By separating event processing (monitorEvents) from reconnection logic (monitoringLoop), we ensure the wait/sleep happens OUTSIDE the select loop. This prevents busy-spinning on closed channels. Impact: - CPU usage: 100% spin → 0-0.1% (sleeping) - Recovery time: Manual restart → 1-2 seconds automatic - Error messages: Generic → Detailed with retry attempts - Graceful shutdown: Properly cancels context Testing: - Comprehensive test suite: 7 unit tests + 1 benchmark - Tests cover: reconnection logic, exponential backoff, error handling - EOF error handling explicitly tested - Context cancellation tested for graceful shutdown Files Changed: - internal/integrations/docker/monitoring.go (refactored event monitoring) - internal/integrations/docker/monitoring_test.go (new test suite) The fix is backward compatible and improves overall agent reliability.
Collaborator
|
This has been merged and implemented :) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixed critical issue where PatchMon-agent consumed 100% CPU when Docker service was restarted, becoming completely unresponsive.
Problem
When Docker service restarts, the agent's Docker event monitoring loop enters an infinite CPU spin and must be manually restarted to recover. Only log message: "Docker event error" with error=EOF
Root Cause
The event monitoring loop had a critical flaw: when Docker restarts and channels close, Go's select statement returns immediately on closed channels (non-blocking). The loop would: select → sleep 5s → continue → select (immediately fires again) creating a busy-spin loop instead of properly waiting.
Technical Issue in Go:
When a channel is closed, receiving from it returns immediately with the zero value. The original code's select statement would fire thousands of times per second, creating a busy loop consuming 100% CPU.
Solution
Implemented two-tier monitoring architecture with exponential backoff:
TIER 1: monitoringLoop() - Manages Reconnection
TIER 2: monitorEvents() - Handles Single Event Stream
Key Insight:
By separating event processing (monitorEvents) from reconnection logic (monitoringLoop), we ensure the wait/sleep happens OUTSIDE the select loop. This prevents busy-spinning on closed channels.
Impact
Testing
Files Changed
Quality
Fixes #389