ZABBIX Forums  

Go Back   ZABBIX Forums > Zabbix Discussions and Feedback > Zabbix Troubleshooting and Problems

Reply
 
Thread Tools Display Modes
  #1  
Old 13-04-2012, 14:11
Farzad FARID Farzad FARID is offline
Member
 
Join Date: Apr 2007
Location: Paris, France
Posts: 71
Default How can we create alerts or actions on "unknown triggers" & "stale items"?

Hi,

Do "UNKNOWN" triggers & "stale items" make Zabbix base monitoring unreliable?

OK the introduction is bit harsh but I thinks the the subjet really matters: I have two issues that have been discussed here already but without a clear solution (none that I could find here or in the documentation at least)?.
  • The fact that UNKNOWN triggers are ignored by the alerting components of Zabbix hides real problems in some common cases.
  • If for some reasons the agents stop sending items values (the server to monitor is down), "my_item.last(0)" value is stale but is still valid. Therefore all calculated items or triggers depending in this stale value say everything is "OK" when the service is in fact down.

Unfortunately it is easy to have a trigger go UNKNOWN or a item become stale.

Example:

I have a calculated items C which depends on 2 items A & B comings from 2 servers. The 2 servers run the same application for High Availability. The item value is 0 if both applications are down and > 0 if at least one application is working.

Therefore I created a trigger that is true if the calculated item's value is 0.

Scenario 1:

If item A takes too long to calculate the Zabbix agent sends a "ZBX_NOTSUPPORTED" response to the server. Therefore item C cannot be calculated anymore and the trigger goes "UNKNOWN".

Since "UNKNOWN" is not a problem and there is no way to catch unknown triggers (their are even invisible in the dashboard except for a global counter) Zabbix cannot alert users that something may be wrong.

Scenario 2:

If both servers go down their zabbix agent does not run anymore. Items A & B still have a "last(0)" value available, which says "everything's fine". Therefore item C will be calculated and have a value different from 0. The trigger remains "OK" whereas nothing works.

Question:

How can I have a working trigger that correctly informs me that a server is down (or not responding correctly) if all the trigger does is to switch to "UNKNOWN" state?

I know that "agent.ping(180)" can catch stale data but this is a low level item. I do no see how I could fit it in my calculated items that checks the status of some complicated software on multiple servers. Furthermore it cannot help solve the "Trigger UNKNOW" issue.

I have the feeling that this is a very global problem with the UNKNOWN state of triggers, the ZBX_NOTSUPPORTED state and the staleness of items: how can I build a reliable monitoring solution if some common situations cannot be treated as "problems" by design?

Regards

Last edited by Farzad FARID; 16-04-2012 at 11:06. Reason: Reformulate title
Reply With Quote
  #2  
Old 16-04-2012, 16:46
Farzad FARID Farzad FARID is offline
Member
 
Join Date: Apr 2007
Location: Paris, France
Posts: 71
Default

I finally found out that there is already a Feature Request describing the very same problems I wrote about: https://support.zabbix.com/browse/ZBXNEXT-341.

But the ticket was opened two years ago and still has no fix although it's the third most popular feature request on Zabbix (https://support.zabbix.com/browse/ZB...arissues-panel).

Are there any large system Zabbix users hit by this issue that wish to comment this or provide some hints on how they bypassed it?

Regards
Reply With Quote
  #3  
Old 26-04-2012, 15:03
bob.todd bob.todd is offline
Junior Member
 
Join Date: Apr 2012
Posts: 1
Default Testimony

Hi there,
I'm a system engineer working for a big finance corporation. I'm trying to deploy a full fledged zabbix supervision platform as part of a global effort toward open-source solutions awareness.

I'm working on this project since the end of last year, and I must admit I'm bit surprised by this "lack of feature"/"bug"/"design".
So much, that I didn't even verified that this part is working as expected.
My bad. I should have.

One of the mean we found to compensate this situation was to develop some kind of wrapper which ensure that whatever is the value collected, the value that will be transmitted to the server will be compliant with the format expected, so as not freezing the item, and subsequent triggers.

It is ugly, and doesn't handle well all situations...

To Zabbix development team : Please, implement a correct way to handle this "unknown data" situation.

Best regards,
Bot Todd
Reply With Quote
  #4  
Old 27-05-2012, 18:47
Pax Pax is offline
Junior Member
 
Join Date: May 2012
Posts: 1
Default

Yes, it looks like Zabbix is not designed to be reliable in large environments. But you always can create nodata() triggers for each item. The trouble is that you get much notifications if a host goes down.
Reply With Quote
  #5  
Old 29-05-2012, 17:17
Farzad FARID Farzad FARID is offline
Member
 
Join Date: Apr 2007
Location: Paris, France
Posts: 71
Default

Quote:
Originally Posted by Pax View Post
Yes, it looks like Zabbix is not designed to be reliable in large environments. But you always can create nodata() triggers for each item. The trouble is that you get much notifications if a host goes down.
Sure, and I think that creating all those triggers just to monitor Zabbix itself (or counterbalance its weaknesses) is counterproductive...

Zabbix 2.0 is now ready (congrats to the Zabbix team!) but there is still no progress on ticket ZBXNEXT-341. What's more, some recent patches like ZBXNEXT-522 go even further in hiding "unknown" triggers where the right choice should be to treat them as potentially real problems.

Are we still only a minority to believe that the mishandling of unsupported items (cause by timeouts or script errors for example) and unknown triggers by Zabbix can make the whole supervision platform unreliable?

Regards
Reply With Quote
  #6  
Old 31-05-2012, 14:04
MarkusL MarkusL is offline
Member
 
Join Date: Nov 2008
Posts: 41
Default

Hi all!

We digged into this a while ago. Out of my experience, I can tell you: it´s complicate and hard to manage! Zabbix does not have an included "I am consistent with all my monitored hosts and items"-function. You have to manage this (for now) by your own.

Our situation is quite complex, as we work with many proxies (one per customer). Every proxy starts automatically an autossh-session to our zabbix-server where all zabbix-stuff is going to. WAN-connections from us and customers can be A-DSL / S-DSL, sometimes X21. Most it is A-DSL,...
Now to see EXACTLY, where a root-cause comes from, we have to see the whole picture starting from our server to a customer-proxy and the network behind the proxy (he is monitoring). In our example this is abstract:
server - ups & firewall - wan(up) - customer-wan(up) - proxy-autossh ok - proxy-services ok - ups ok - backoffice-switches ok - host with basic services ok.

We do very very much "baseline-monitoring" with VERY much nodata-stuff, just to be sure with this point: all I see in my zabbix-frontend is 100% what is going on in the real systems out their; I miss NOTHING.

As soon as we have checked all the single points betwenn us (zabbix-server that generates the triggers) and the customer-hosts to be monitored, we start the "real" monitoring. All parts of the "real monitoring"-templates depend on our baseline-monitoring. Does a baseline go down (f.e. snmp-service on a windows machine) the host gets an corresponding trigger and our zabbix-server gives ONLY ONE message, not 200x snmp-item nodata,...

Our baseline-monitoring triggers all on the lowest severity and is not visible for our agents on the dashboard (severity not shown). On these triggers our real templates (nodata for every item) depend. If a trigger from baseline-monitoring stays longer than f.e. 60sec a second to the agent VISIBLE trigger is generated because we say: if 60sec a host is not pingable / service not started or something: this can´t be a "something"-problem inside zabbix (communication timeout ore anything else), this must be a real problem we have to get into.

The whole concept took around 3 weeks of heavy brainwork,...
Because of this, all I wrote is just a "simple discription"; would take to long to discuss every detail here.


I would really love to discuss this thing with Alexei or someone from zabbix who has hopefully :-) interest to solve this very heavy problem. In my eyes this point is the biggest con with zabbix.


Best regards,

Markus.

Last edited by MarkusL; 31-05-2012 at 14:08.
Reply With Quote
Reply

Tags
items, trigger, unknown

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 20:10.