Do "UNKNOWN" triggers & "stale items" make Zabbix base monitoring unreliable?
OK the introduction is bit harsh but I thinks the the subjet really matters: I have two issues that have been discussed here already but without a clear solution (none that I could find here or in the documentation at least)?.
- The fact that UNKNOWN triggers are ignored by the alerting components of Zabbix hides real problems in some common cases.
- If for some reasons the agents stop sending items values (the server to monitor is down), "my_item.last(0)" value is stale but is still valid. Therefore all calculated items or triggers depending in this stale value say everything is "OK" when the service is in fact down.
Unfortunately it is easy to have a trigger go UNKNOWN or a item become stale.
I have a calculated items C
which depends on 2 items A
comings from 2 servers. The 2 servers run the same application for High Availability. The item value is 0 if both applications are down and > 0 if at least one application is working.
Therefore I created a trigger that is true if the calculated item's value is 0.
If item A
takes too long to calculate the Zabbix agent sends a "ZBX_NOTSUPPORTED
" response to the server. Therefore item C
cannot be calculated anymore and the trigger goes "UNKNOWN
" is not a problem and there is no way to catch unknown triggers (their are even invisible in the dashboard except for a global counter) Zabbix cannot alert users that something may be wrong.
If both servers go down their zabbix agent does not run anymore. Items A
still have a "last(0)
" value available, which says "everything's fine". Therefore item C will be calculated and have a value different from 0. The trigger remains "OK
" whereas nothing works.
How can I have a working trigger that correctly informs me that a server is down (or not responding correctly) if all the trigger does is to switch to "UNKNOWN" state?
I know that "agent.ping(180)
" can catch stale data but this is a low level item. I do no see how I could fit it in my calculated items that checks the status of some complicated software on multiple servers. Furthermore it cannot help solve the "Trigger UNKNOW" issue.
I have the feeling that this is a very global problem with the UNKNOWN state of triggers, the ZBX_NOTSUPPORTED state and the staleness of items: how can I build a reliable monitoring solution if some common situations cannot be treated as "problems
" by design?