In our environment, one of the things Zabbix is used for is to monitor several compute clusters. Each of these compute clusters includes a 'master' host which has a zabbix proxy running on it, and all of the hosts in the cluster send their data through the proxy, including the master node.
Each host has two main 'alive' triggers:
* SSH check every 5 minutes, trigger host down if the last 4 checks show the host down
* Zabbix agent down - trigger if no data is received for a CPU metric, which refreshes every 5 minutes, for the past 20 minutes.
Each of these clusters also has an aggregate host, which provides summary statistics on a couple key items. This aggregate host is also intended to throttle notifications that would be generated in the event that more than a couple of the hosts go down at the same time.
Currently I have an aggregate item with the following key:
grpsum["{$CLUSTER}","ssh","last","0"]
This gives me the number of hosts that last responded to an ssh check item, which is one of the ways that define if the host is alive or not. A trigger on this item can then be used as a dependency for the individual host triggers, which provides alert suppression when more than a specific number of hosts are down.
The problem, however, is that the 'ssh' items don't get updated when the host running the proxy is offline. As a result, the aggregate trigger never goes true. The trigger based on the last data received for each host still works, and as a result, an alert gets generated for every host in the cluster.
I tried adding an aggregate item to count the number of results that was seen for the CPU metric to use as second method of enabling the suppression trigger. The item collects data as expected as long as the proxy is up, but if the proxy is disabled to simulate the host being down, the item becomes invalid shortly after. This behavior blocks the aggregate trigger from working properly.
How are others managing alert suppression for groups of hosts? Basically we need to be able to get alerts that individual hosts are down, unless there are more than 5 or 10 hosts down. Once there are 5 or 10 hosts down, a single 'X group of hosts is down' style alert is needed. The trick is that this needs to work to suppress alerts when the proxy is down as well.
Thank you,
Each host has two main 'alive' triggers:
* SSH check every 5 minutes, trigger host down if the last 4 checks show the host down
* Zabbix agent down - trigger if no data is received for a CPU metric, which refreshes every 5 minutes, for the past 20 minutes.
Each of these clusters also has an aggregate host, which provides summary statistics on a couple key items. This aggregate host is also intended to throttle notifications that would be generated in the event that more than a couple of the hosts go down at the same time.
Currently I have an aggregate item with the following key:
grpsum["{$CLUSTER}","ssh","last","0"]
This gives me the number of hosts that last responded to an ssh check item, which is one of the ways that define if the host is alive or not. A trigger on this item can then be used as a dependency for the individual host triggers, which provides alert suppression when more than a specific number of hosts are down.
The problem, however, is that the 'ssh' items don't get updated when the host running the proxy is offline. As a result, the aggregate trigger never goes true. The trigger based on the last data received for each host still works, and as a result, an alert gets generated for every host in the cluster.
I tried adding an aggregate item to count the number of results that was seen for the CPU metric to use as second method of enabling the suppression trigger. The item collects data as expected as long as the proxy is up, but if the proxy is disabled to simulate the host being down, the item becomes invalid shortly after. This behavior blocks the aggregate trigger from working properly.
How are others managing alert suppression for groups of hosts? Basically we need to be able to get alerts that individual hosts are down, unless there are more than 5 or 10 hosts down. Once there are 5 or 10 hosts down, a single 'X group of hosts is down' style alert is needed. The trick is that this needs to work to suppress alerts when the proxy is down as well.
Thank you,
Comment