Ad Widget

Collapse

Throttling / Limiting number of notification alerts using Python + API

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • syntax53
    Member
    • Mar 2018
    • 40

    #1

    Throttling / Limiting number of notification alerts using Python + API

    On my network my MDF's are all on battery backup and emergency generators. My IDF's are not. When power goes out I can get hundreds of email notifications, especially if it affects multiple buildings. I have dependencies setup to depend on the link to the MDF, but that doesn't help because the MDF never goes down. I've dealt with this for a long time and have come back to here searching for solutions and eventually decided to write my own solution.

    Here is the python script: https://pastebin.com/qjxpXHkE

    Basically you specify the list of action IDs you want to monitor (get the ID by looking at the URL of the link to the action in the GUI) and then it monitors the number of alerts that are generated per action over a specified amount of time. I have it configured so that if it sees 5 "problem" alerts generated in the last 5 minutes then it will disable the action. This prevents any further notifications. Then there is another parameter to configure for the recovery time. I have it set for 15 minutes. So after 15 minutes it re-enables the action and new notifications will start coming through again. You set it up inside cron so that it checks every 30 seconds. Earlier today we had a power outage in two of our buildings and I only got 15 alerts before it caught it and disabled it. This would have normally been over 100 alerts.

    The API user needs super admin rights in order to modify the actions. Emails can be generated on action disable and enable. You can filter certain alerts out so they aren't counted in the tally. e.g. I have "Resolved:" filtered out as that text is on all of my resolution emails. This way it only counts "problem alerts".

    Example crontab:
    Code:
    * * * * * python /home/normal/zabbix_stuff/zabbix_alert_monitor.py
    * * * * * (sleep 30;python /home/normal/zabbix_stuff/zabbix_alert_monitor.py)
    The repeat of the command prefixed with the "sleep 30;" in the second line is to make it run every 30 seconds. Cron executes every 60 seconds. So putting a sleep 30 in front of the second one makes it go every 30 seconds.
    Last edited by syntax53; 21-03-2019, 20:56.
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    To many notifications is almost always symptom lack of dependencies between triggers.

    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • syntax53
      Member
      • Mar 2018
      • 40

      #3
      Originally posted by kloczek
      To many notifications is almost always symptom lack of dependencies between triggers.
      I mentioned in my post that I already have dependencies where I can. There is no dependency I can add that will stop these from happening. Consider the picture below... the switches depend on the MDF's. The MDF's depend on the router. None of these ever go down though. There is no dependency I can add that will stop notifications when all of the switches go down.

      Click image for larger version

Name:	zd.png
Views:	1099
Size:	17.6 KB
ID:	375217

      Comment

      • kloczek
        Senior Member
        • Jun 2006
        • 1771

        #4
        Originally posted by syntax53
        I mentioned in my post that I already have dependencies where I can. There is no dependency I can add that will stop these from happening. Consider the picture below... the switches depend on the MDF's. The MDF's depend on the router. None of these ever go down though. There is no dependency I can add that will stop notifications when all of the switches go down.
        Currently zabbix is GoodEnough(tm) if not even perfect on defining and handling host dependencies.
        To have something outside that area it needs to be necessary to define metrics on top of multiple per host .. triggers.
        Sometimes it is necessary to have something like this.
        Typical case is horizontally scaled farm of hosts or few switches providing redundant paths using (fast) spanning tree protocol.
        In such cases you may be interested about fact that some number of backbone switches already is dead but as long as still is provides some number of alternative routes it should be not a problem.
        With that would be possible to hook under such master trigger that as long as it is between N and N+M paths trigger about critical problem should not fire and even some host (switch) critical alarms could be hidden.

        Nevertheless theoretically you can define dummy host metric which will depend on per switch availability metrics connected over "and" operand and some alarm can be displayed only when "switch_A_is_down and switch_B_is_down and switch_C_ ..." then theoretically should be possible to create dependency to hide some per switch alarms. That is only theory because so far it is not possible to create inter host trigger dependencies.
        Sometimes it is good .. sometimes not.

        I would suggest to use some queuing software because it may be much more effective and deterministic.
        On managing messages already in the queue may be done as well for example reordering messages to deliver those with highest severity first.
        In worse case scenario such queue could be easily blocked on input or redirected to null and flushed manually if number of messages which still needs to be delivered will be high.
        http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
        https://kloczek.wordpress.com/
        zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
        My zabbix templates https://github.com/kloczek/zabbix-templates

        Comment

        • syntax53
          Member
          • Mar 2018
          • 40

          #5
          You do realize that I posted this in the "cookbook" section and I am simply sharing my solution to this problem...

          Comment

          • tim.mooney
            Senior Member
            • Dec 2012
            • 1427

            #6
            Hey syntax53, thank you for sharing your solution for this problem.

            My site uses dependencies wherever we can, but like you we have an annoying situation where there's literally no dependency we can use to prevent alert storms for a specific group of systems.

            I plan to use your script to help throttle alerts for that group. I have a couple questions about it.
            1. Are you still using it? Have you made any changes since the version you posted here?
            2. I'm stronger in other scripting languages than I currently am in Python, so I'm wondering if you've needed to make any changes to the script to work with recent versions of Python3?
            I think I'm actually going to modify the script slightly, so that rather than running as a one-shot frequently from cron, the main body is a permanent loop with a 30 second sleep at the end. I'll write a systemd service file for it, so that systemd starts it and is responsible for restarting it if it ever exits, but it's basically running continuously (but sleeping for most of the time).

            Thanks again for sharing. Not only does it likely address the issue we're seeing, but since it's Python it's a good example for me to help improve my Python scripting.

            Comment

            • syntax53
              Member
              • Mar 2018
              • 40

              #7
              Hello Tim. Yes, we still use it. There have been some minor tweaks since last posting. I have updated the pastebin.

              Regards,
              Matt

              Comment

              Working...