Ad Widget

Collapse

Detecting down servers? Surely missing something simple...

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cheezus
    Member
    • Nov 2011
    • 35

    #1

    Detecting down servers? Surely missing something simple...

    Surely we must be missing something simple here...

    If a server goes down all our triggers do into "unknown" state, a state which we cannot alert on.

    For example we have an item is zabbix agent running, proc.num[zabbix_agentd].

    And a trigger {MM_Template_Zabbix_with_zagentroc.num[zabbix_agentd].last(0)}<1

    But if the service is stopped it comes back unknown and doesn't fire an alert.

    Likewise we have a ping to the host ip but if the host is down the trigger goes into an unknown status and again doesn't fire an alert.

    So how do you actually monitor if a host is down if the items/triggers go into "UNKNOWN" status and you cannot alert on that?
    Last edited by cheezus; 02-02-2012, 15:54.
  • QwErTy_LoGiC
    Member
    • Feb 2010
    • 66

    #2
    Check something else...

    From what I see in the item on which you are basing your trigger, you will never have a correct result.

    Your item is:
    proc.num[zabbix_agentd]

    Which roughly translate to asking the zabbix agent to count the number of instances of the zabbix_agentd process...

    But you want to trigger if the value returned is less than one...

    ...but how can the zabbix agent return anything if there is less than one zabbix_agentd process active?

    A common method of checking the "heartbeart", if you will, is with the nodata function along with the agent.ping item or some trapper item.

    Comment

    • cheezus
      Member
      • Nov 2011
      • 35

      #3
      Originally posted by QwErTy_LoGiC
      From what I see in the item on which you are basing your trigger, you will never have a correct result.

      Your item is:
      proc.num[zabbix_agentd]

      Which roughly translate to asking the zabbix agent to count the number of instances of the zabbix_agentd process...

      But you want to trigger if the value returned is less than one...

      ...but how can the zabbix agent return anything if there is less than one zabbix_agentd process active?

      A common method of checking the "heartbeart", if you will, is with the nodata function along with the agent.ping item or some trapper item.
      Right, I get that you can't ask the zabbix agent to report back < 1 processes when it's not running. The issue, which appears really stupid, is that when the zabbix agent is down the trigger goes into a state of "UNKNOWN". Which itself is fine. But the problem is that you can't create an action to alert on "UNKNOWN". Currently there is no way to to fire an alert if the zabbix agent isn't running.

      We'll give the nodata trigger a shot on the ping. But this "UNKNOWN" trigger status thing seems crazy. I found in the nxt tracker it's been requested for 2 years now with upvotes and plenty of comments yet no action has been taken.

      What's the point of a monitoring system that can't alert you when the monitoring method fails?
      Last edited by cheezus; 02-02-2012, 19:12.

      Comment

      • cheezus
        Member
        • Nov 2011
        • 35

        #4
        So it looks like:

        Code:
        {MM_Template_Linux_with_zagent:agent.ping.nodata(180)}=1
        Is working for us after lot's of fiddling. And by fiddling I mean:

        - nodata(0) did not and does not work for us.

        - actions for Trigger severity >= "Warning" never fired even though the trigger is a "Disaster". Changing Trigger severity = "Disaster" works.

        - we had to unlink our master template (with linked templates) from all the hosts and relink them.

        I still think it's important to be able to fire actions on "UNKNOWN". People who say that's going to cause a lot of alerts aren't justified if you give each person the ability to choose to be alerted or not. Plus in an action you refine that to be much more specific.

        What I can't understand is why none of the templates we found in our search had this trigger in them. There must be a lot of people out there not getting alerts when a host or agent goes down and taking a beating.

        Comment

        Working...