First off, we've been using zabbix for internal monitoring now since about 2006 so we love the tool and have a good bit of experience with it.
It seems like we are having an issue with our trigger for the agent.ping item we are monitoring. We have several hosts that have very large databases on them, and periodically, there is are queries generated from our application and/or ad-hoc queries from the customer for reporting that consume all kinds of resources on the machine, causing it to grind to a halt for a minute or so. After these situations, I've often seen the load average of the system in the area of 100. While we are trying to get our devs + customers to get rid of these bad queries, in the meantime I'd like to not get alerted every time the system "goes out to lunch" for a minute or two.
Setup in question is:
Item: agent.ping, fetch value every 60s
Trigger: {Template App Zabbix Agent:agent.ping.nodata(180)}=1
Action: Send email immediately, escalate to pagers 3 minutes later
From the documentaiton, nodata() returns: "1 - if no data received during period of time in seconds. The period should not be less than 30 seconds.
0 - otherwise"
Every time this trigger fires, our action fires a problem and then recovery email message almost immediately, I can't remember the last time I saw those two messages come in at different times (i.e. down for a period of time, then after a matter of minutes a recovery message). Our actions/etc. are working great for all other cases, except this agent.ping trigger. What we would want to happen is if we haven't received a bit of data for the whole previous 3 minutes, send us an alert. But it <seems> like it is firing when we haven't received data at some point in the previous 3 minutes.
Has anyone else had this same sort of problem with nodata() and/or agent.ping?
Thanks.
It seems like we are having an issue with our trigger for the agent.ping item we are monitoring. We have several hosts that have very large databases on them, and periodically, there is are queries generated from our application and/or ad-hoc queries from the customer for reporting that consume all kinds of resources on the machine, causing it to grind to a halt for a minute or so. After these situations, I've often seen the load average of the system in the area of 100. While we are trying to get our devs + customers to get rid of these bad queries, in the meantime I'd like to not get alerted every time the system "goes out to lunch" for a minute or two.
Setup in question is:
Item: agent.ping, fetch value every 60s
Trigger: {Template App Zabbix Agent:agent.ping.nodata(180)}=1
Action: Send email immediately, escalate to pagers 3 minutes later
From the documentaiton, nodata() returns: "1 - if no data received during period of time in seconds. The period should not be less than 30 seconds.
0 - otherwise"
Every time this trigger fires, our action fires a problem and then recovery email message almost immediately, I can't remember the last time I saw those two messages come in at different times (i.e. down for a period of time, then after a matter of minutes a recovery message). Our actions/etc. are working great for all other cases, except this agent.ping trigger. What we would want to happen is if we haven't received a bit of data for the whole previous 3 minutes, send us an alert. But it <seems> like it is firing when we haven't received data at some point in the previous 3 minutes.
Has anyone else had this same sort of problem with nodata() and/or agent.ping?
Thanks.
Comment