thold does not clear plugin_thold_host_failed when it doesn't have time to finish, causing a loop

**Describe the bug**
I have a cacti system with 4000 hosts. This system was isolated by a network outage, and all hosts were reported DOWN by thold. When the network recovered, thold started to report the hosts as up, but because they were so many (and also because I run a local script with `Status Change Command` for each notification which increases processing time), it took more than 5 minutes to go through the down hosts and generate the emails. The new poller kicked in and the thold process never got to finish `thold_debug('Down device checks finished.');`. This caused the cleanup to be missed (e.g. `thold_debug('Thold Log Cleanup finished.');`). On the next polling cycle thold restarts sending the recovered from DOWN emails all over again, and can't recover by itself, because it never has time to finish. 

To break the loop I had to truncate the table `plugin_thold_host_failed`.

**To Reproduce**
Steps to reproduce the behavior:
1. Configure a bunch of hosts (e.g. 10) for monitoring
2. Enable `Status Change Command` and set it to a script which (for instance), does `sleep 60`. This is to artificially slow down the execution of the plugin, to force it to be longer than the polling cycle
3. Enable email notifications for down/up hosts
4. Isolate all 10 hosts from the network, to force them down. 
5. Wait for the hosts to be down, then recover the network to force them up.
6. Observe that thold doesn't finish during a polling cycle and doesn't clear `plugin_thold_host_failed`, causing an alarm loop every polling cycle. 

**Expected behavior**
Hosts for which a down/up notification has been sent should be marked in the respective tables and skipped on the next polling cycle, to prevent such a loop. Also, in case thold doesn't finish and is killed off, it should report in the logs (similarly to how the main poller does it).


**Plugin (please complete the following information):**
 - Version: 1.8.2
 - Source: github
 - Identifier: official release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thold does not clear plugin_thold_host_failed when it doesn't have time to finish, causing a loop #737

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

thold does not clear plugin_thold_host_failed when it doesn't have time to finish, causing a loop #737

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions