On my network my MDF's are all on battery backup and emergency generators. My IDF's are not. When power goes out I can get hundreds of email notifications, especially if it affects multiple buildings. I have dependencies setup to depend on the link to the MDF, but that doesn't help because the MDF never goes down. I've dealt with this for a long time and have come back to here searching for solutions and eventually decided to write my own solution.
Here is the python script: https://pastebin.com/qjxpXHkE
Basically you specify the list of action IDs you want to monitor (get the ID by looking at the URL of the link to the action in the GUI) and then it monitors the number of alerts that are generated per action over a specified amount of time. I have it configured so that if it sees 5 "problem" alerts generated in the last 5 minutes then it will disable the action. This prevents any further notifications. Then there is another parameter to configure for the recovery time. I have it set for 15 minutes. So after 15 minutes it re-enables the action and new notifications will start coming through again. You set it up inside cron so that it checks every 30 seconds. Earlier today we had a power outage in two of our buildings and I only got 15 alerts before it caught it and disabled it. This would have normally been over 100 alerts.
The API user needs super admin rights in order to modify the actions. Emails can be generated on action disable and enable. You can filter certain alerts out so they aren't counted in the tally. e.g. I have "Resolved:" filtered out as that text is on all of my resolution emails. This way it only counts "problem alerts".
Example crontab:
The repeat of the command prefixed with the "sleep 30;" in the second line is to make it run every 30 seconds. Cron executes every 60 seconds. So putting a sleep 30 in front of the second one makes it go every 30 seconds.
Here is the python script: https://pastebin.com/qjxpXHkE
Basically you specify the list of action IDs you want to monitor (get the ID by looking at the URL of the link to the action in the GUI) and then it monitors the number of alerts that are generated per action over a specified amount of time. I have it configured so that if it sees 5 "problem" alerts generated in the last 5 minutes then it will disable the action. This prevents any further notifications. Then there is another parameter to configure for the recovery time. I have it set for 15 minutes. So after 15 minutes it re-enables the action and new notifications will start coming through again. You set it up inside cron so that it checks every 30 seconds. Earlier today we had a power outage in two of our buildings and I only got 15 alerts before it caught it and disabled it. This would have normally been over 100 alerts.
The API user needs super admin rights in order to modify the actions. Emails can be generated on action disable and enable. You can filter certain alerts out so they aren't counted in the tally. e.g. I have "Resolved:" filtered out as that text is on all of my resolution emails. This way it only counts "problem alerts".
Example crontab:
Code:
* * * * * python /home/normal/zabbix_stuff/zabbix_alert_monitor.py * * * * * (sleep 30;python /home/normal/zabbix_stuff/zabbix_alert_monitor.py)
Comment