Ad Widget

Collapse

Nodata false alarms when server down

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Linwood
    Senior Member
    • Dec 2013
    • 398

    #1

    Nodata false alarms when server down

    I feel like I'm missing something obvious here. We have a number of devices that have odd alarm structures (or seems odd), rather than a fixed OID to poll for an alarm, they have tables of alarm conditions where the OID just doesn't exist if the alarm is not active. To find these we use LLD to create item prototypes of these temporary OID's, and trigger alarms when they appear, and clear the alarms when they no longer exist.

    To do the no-longer-exists I use nodata(). Here's an example:

    Code:
    ( {TRIGGER.VALUE} = 1 and {Template SNMP NEC iPaso Device:severity[{#INDEX}].nodata(180)}=0 ) 
      or 
    ( {TRIGGER.VALUE} = 0 and {Template SNMP NEC iPaso Device:severity[{#INDEX}].last(#1)}=1 and 
      {Template SNMP NEC iPaso Device:time[{#INDEX}].nodata(180)}>=-1
    )
    In English, it fires when the severity is 1, and has data recently. If the SNMP table item vanishes for a while, the nodata() clears the alarm (ok, I don't remember why I did >= -1 in one case but I think that is not an issue.

    The manual says the nodata() expression gives an error if the server was restarted in that interval, so I thought I was good, it would not process, and would not clear. But at least sometimes the server comes back when an alert was present, sends an "OK" presumably from lack of data from lack of polling, then immediately sends the alert out again.

    My guess is there are some race issues here, e.g. (making this up) the server was up 170 seconds ago but the poll interval or queue or some such was such that no new data has been posted for the item even though it runs on (if I recall) a 120 second polling interval.

    Hard to set up controlled tests of this as it is intermittent (again making me think it is an race condition).

    But... I can work on the race condition if that's it.

    My question is this: Is there a better approach to monitoring such "disappearing" entities? Where disappearance itself is the clear condition? I hate to rewrite them (there's quite a few templates) but the number of bogus alarms has been annoying lately (doing lots of upgrades and changes so lots of brief outages).

    My one thought was to turn them into external checks where I can return a value for "no such OID" instead of a failure and no data? Painful. I don't see that the Item Preprocessing allows you to manage poll failures?

    Any other thoughts?
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    What that trigger expression should express? What kind of state?
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • Linwood
      Senior Member
      • Dec 2013
      • 398

      #3
      I'm not sure I quite understand but let me elaborate.

      The severity item is either absent (OID not present, alarm is clear), or a value like 1-4 to indicate severity of an alarm.

      The time is the time the alarm was created. The reason the Time is present in the trigger is to make it visible in the macros for reporting inside the email, it is not there for controlling the alarm state really.

      So in english, the trigger should be active if severity({#INDEX}) is 1. amd should clear when it is either some other value, or OID not present.

      Comment

      Working...