Describe the bug
I have a cacti system with 4000 hosts. This system was isolated by a network outage, and all hosts were reported DOWN by thold. When the network recovered, thold started to report the hosts as up, but because they were so many (and also because I run a local script with Status Change Command for each notification which increases processing time), it took more than 5 minutes to go through the down hosts and generate the emails. The new poller kicked in and the thold process never got to finish thold_debug('Down device checks finished.');. This caused the cleanup to be missed (e.g. thold_debug('Thold Log Cleanup finished.');). On the next polling cycle thold restarts sending the recovered from DOWN emails all over again, and can't recover by itself, because it never has time to finish.
To break the loop I had to truncate the table plugin_thold_host_failed.
To Reproduce
Steps to reproduce the behavior:
- Configure a bunch of hosts (e.g. 10) for monitoring
- Enable
Status Change Command and set it to a script which (for instance), does sleep 60. This is to artificially slow down the execution of the plugin, to force it to be longer than the polling cycle
- Enable email notifications for down/up hosts
- Isolate all 10 hosts from the network, to force them down.
- Wait for the hosts to be down, then recover the network to force them up.
- Observe that thold doesn't finish during a polling cycle and doesn't clear
plugin_thold_host_failed, causing an alarm loop every polling cycle.
Expected behavior
Hosts for which a down/up notification has been sent should be marked in the respective tables and skipped on the next polling cycle, to prevent such a loop. Also, in case thold doesn't finish and is killed off, it should report in the logs (similarly to how the main poller does it).
Plugin (please complete the following information):
- Version: 1.8.2
- Source: github
- Identifier: official release
Describe the bug
I have a cacti system with 4000 hosts. This system was isolated by a network outage, and all hosts were reported DOWN by thold. When the network recovered, thold started to report the hosts as up, but because they were so many (and also because I run a local script with
Status Change Commandfor each notification which increases processing time), it took more than 5 minutes to go through the down hosts and generate the emails. The new poller kicked in and the thold process never got to finishthold_debug('Down device checks finished.');. This caused the cleanup to be missed (e.g.thold_debug('Thold Log Cleanup finished.');). On the next polling cycle thold restarts sending the recovered from DOWN emails all over again, and can't recover by itself, because it never has time to finish.To break the loop I had to truncate the table
plugin_thold_host_failed.To Reproduce
Steps to reproduce the behavior:
Status Change Commandand set it to a script which (for instance), doessleep 60. This is to artificially slow down the execution of the plugin, to force it to be longer than the polling cycleplugin_thold_host_failed, causing an alarm loop every polling cycle.Expected behavior
Hosts for which a down/up notification has been sent should be marked in the respective tables and skipped on the next polling cycle, to prevent such a loop. Also, in case thold doesn't finish and is killed off, it should report in the logs (similarly to how the main poller does it).
Plugin (please complete the following information):