Ad Widget

Collapse

Mass event management; reduce hundreds of alerts to one alert?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Maxburn
    Member
    • Sep 2019
    • 48

    #1

    Mass event management; reduce hundreds of alerts to one alert?

    Is there an easy way to get many alerts suppressed if a condition is met and send one different alert? Or said a different way; At the moment every alert I have everywhere is "high" or lower, is it as simple as defining a "Disaster" category alert somewhere and suppressing all lower severity alerts?

    ------background------
    I have Zabbix running in a server at my office that monitors some VM's in our data center over a VPN. Occasionally the VPN goes down and I get ICMP ping fail alerts from hundreds of servers.

    At one point in time I tried to make this happen by adding a ping to the remote network gateway, and in every server ICMP action I put a dependency on that remote gateway alert.
    #1 This doesn't appear to work (no doubt configured wrong), and
    #2 it's a complete pain to set up in every single server.

    All servers are currently using "Template Module ICMP Ping" and "Template OS Windows by Zabbix agent"

    Zabbix 7.0.18
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    Dependencies, especially network topology dependencies, are the best way to handle that, but they're not foolproof.

    As far as #2, pre-canned templates will never have the correct dependencies built-in, so what you could do is
    1. full clone the Template Module ICMP Ping template, and give it a local name. Don't modify the pre-built templates, always clone & modify your locally-tailored templates instead.
    2. add your network gateway trigger as a dependency for any triggers in your cloned template. Making the dependency part of the template makes it easy to have that included for all hosts on the other side of that link/segment.
    3. unlink the old template from a few hosts and link the new template, preferably in the same update. Test and expand to more hosts when the tests look good.

    Timing for when the dependency is checked vs. when the host checks are run can still cause alerts, before Zabbix detects that the VPN is down. It helps to have your topology dependencies checked on a more frequent basis than the leaf nodes.

    If setting up dependencies doesn't do enough for you, there are other options, such as https://www.zabbix.com/forum/zabbix-...ing-python-api

    Comment

    • Linwood
      Senior Member
      • Dec 2013
      • 398

      #3
      I'll echo that question, but also offer one fix I've implemented before but is tedious.

      You can make some triggers depend on others. For example if I have triggers that may fire from lack of data or timeouts, I will often make them dependent on a trigger that checks pings to see if a host is up. If the host is down, there's no point in telling me (for example) that SSH to the host is down, and that HTTP to the host is down, etc. These can mostly be automated.

      What's more difficult and I have done but is the tedious one- let's say you have a site with 50 hosts, and one edge device connecting it to your primary site. If that edge device is unreachable (network down, device down, whatever) then every device behind it will alert as unreachable as well. I once set up a SQL statement where I could label these edge devices, and then within a site (defined as a host group "location") I blew in dependencies on each ping trigger at the site against the edge device ping trigger. If it was down, they won't alert (I also increased the speed of pings on the edge device so it always alerted first). This does work, but I never found a way to maintain it practically other than sql modifications to the database, which I stopped doing some years ago due to all the changes.

      But I think someone good at API calls could probably build something that would remove and add such dependencies based on some kind of host marking (macro, host group, etc.)

      Maybe something is built in now, I have not searched in some years -- so anxiously awaiting that someone has a good answer for you. Mine is not a good answer. Ah... Tim types faster than i do.

      There were some issues with cross template dependencies when I tried this ages ago. I may need to try again, maybe it's easier now. And I do know there's now a LOT more parameterization available in places that wasn't back then (e.g. macro expansion).

      Comment

      • Maxburn
        Member
        • Sep 2019
        • 48

        #4
        Originally posted by tim.mooney
        [LIST=1][*]full clone the Template[*]unlink the old template from a few hosts and link the new template,
        That does not sound horrible to do, certainly better than modifying every hosts config.

        I already have trigger action step delays and notifications start going out on step 2. I THINK that might take care of the timing concern you bring up?

        I'll start reading up on how to do this.

        Comment

        • Linwood
          Senior Member
          • Dec 2013
          • 398

          #5
          Originally posted by Maxburn
          I already have trigger action step delays and notifications start going out on step 2. I THINK that might take care of the timing concern you bring up?
          The issue is that the sentinel device (the one you want to alarm first, and by doing so block other alarms) has to alarm more quickly.

          For example, we use pings for up/down, and need to get three consecutive failed pings before a node triggers an alarm.

          What we did is have two different ping frequency, one 30s interval for the sentinel device, and 60s interval for all the others. So for a link-to-sentinel-device down it will alert in about 90s, the rest won't try until 180s.

          Your timing can vary, but that gives you the idea.

          Comment


          • Maxburn
            Maxburn commented
            Editing a comment
            If my notification step is ten minutes out and I've got the default 1 minute polls won't all that be sorted out by the time ten minutes are up?

          • Linwood
            Linwood commented
            Editing a comment
            If I understand you are having the trigger fire, then waiting before notification. Sounds right. In our case we want trigger fire (and when it appears on the problems display) to correspond to notification, so that approach didn't work for us.

          • Maxburn
            Maxburn commented
            Editing a comment
            Yes, that's what I meant. Our situation is if the NOC sees the issue and makes a quick fix or acknowledges the alert in Zabbix we don't need to notify engineers. If they drop the ball we send the alerts.
        Working...