Ad Widget

Collapse

Alert suppression for groups of hosts

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mschlegel
    Member
    • Oct 2008
    • 40

    #1

    Alert suppression for groups of hosts

    In our environment, one of the things Zabbix is used for is to monitor several compute clusters. Each of these compute clusters includes a 'master' host which has a zabbix proxy running on it, and all of the hosts in the cluster send their data through the proxy, including the master node.

    Each host has two main 'alive' triggers:
    * SSH check every 5 minutes, trigger host down if the last 4 checks show the host down
    * Zabbix agent down - trigger if no data is received for a CPU metric, which refreshes every 5 minutes, for the past 20 minutes.

    Each of these clusters also has an aggregate host, which provides summary statistics on a couple key items. This aggregate host is also intended to throttle notifications that would be generated in the event that more than a couple of the hosts go down at the same time.

    Currently I have an aggregate item with the following key:
    grpsum["{$CLUSTER}","ssh","last","0"]

    This gives me the number of hosts that last responded to an ssh check item, which is one of the ways that define if the host is alive or not. A trigger on this item can then be used as a dependency for the individual host triggers, which provides alert suppression when more than a specific number of hosts are down.

    The problem, however, is that the 'ssh' items don't get updated when the host running the proxy is offline. As a result, the aggregate trigger never goes true. The trigger based on the last data received for each host still works, and as a result, an alert gets generated for every host in the cluster.


    I tried adding an aggregate item to count the number of results that was seen for the CPU metric to use as second method of enabling the suppression trigger. The item collects data as expected as long as the proxy is up, but if the proxy is disabled to simulate the host being down, the item becomes invalid shortly after. This behavior blocks the aggregate trigger from working properly.

    How are others managing alert suppression for groups of hosts? Basically we need to be able to get alerts that individual hosts are down, unless there are more than 5 or 10 hosts down. Once there are 5 or 10 hosts down, a single 'X group of hosts is down' style alert is needed. The trick is that this needs to work to suppress alerts when the proxy is down as well.

    Thank you,
  • danrog
    Senior Member
    • Sep 2009
    • 164

    #2
    I was thinking of this too. We are starting to deploy proxies all over our global network. However, our environment is a little different in that we don't have a lot of grpavg triggers, just host based ones (we don't have any compute clusters).

    What I am going to test (hopefully in the next week or two) in our dev environment is adding a trigger dependency on all hosts behind a proxy for the proxy itself. I think this will suppress certain triggers if a proxy is down but I won't know until I try it.

    Now if this does work, it certainly is not scalable to go in and manually update all the triggers (in our environment it will be 1000+ after we finish our global roll out), but this is where the API comes in handy. I am doing a lot with the api today, so adding a script to update trigger depends on a regular basis won't be difficult to implement.

    If you feel like this could work for your environment and try it out soon , I'd be interested to know if it worked.

    Comment

    • mschlegel
      Member
      • Oct 2008
      • 40

      #3
      That method would solve one of the problems, which is if the proxy goes down. It doesn't help to suppress alerting over a specific threshold though. Even with the dependency on the proxy, we still need to be able to have some sort of logic to say '6 of the hosts in this group are down, trigger that the group is down and don't send individual alerts until back under that threshold.'

      One of our 'host down' triggers is done with the following:
      system.cpu.load[,avg1].nodata(1200)

      Currently the suppression trigger looks like this:
      {Template_ClusterAggregate:grpsum["{$CLUSTER}","ssh","last","0"].last(0)}<{$NODESMIN}

      If I had a way to get the number of hosts where '.nodata(600)' on an item was true, or something similar, then I could add a clause to the suppression trigger to make use of it. I tried using one I thought meant that, but the aggregate item dropped as soon as the host items stopped reporting.

      Comment

      • mschlegel
        Member
        • Oct 2008
        • 40

        #4
        Any further ideas on how this might be possible?

        Thank you

        Comment

        Working...