I feel like I'm missing something obvious here. We have a number of devices that have odd alarm structures (or seems odd), rather than a fixed OID to poll for an alarm, they have tables of alarm conditions where the OID just doesn't exist if the alarm is not active. To find these we use LLD to create item prototypes of these temporary OID's, and trigger alarms when they appear, and clear the alarms when they no longer exist.
To do the no-longer-exists I use nodata(). Here's an example:
In English, it fires when the severity is 1, and has data recently. If the SNMP table item vanishes for a while, the nodata() clears the alarm (ok, I don't remember why I did >= -1 in one case but I think that is not an issue.
The manual says the nodata() expression gives an error if the server was restarted in that interval, so I thought I was good, it would not process, and would not clear. But at least sometimes the server comes back when an alert was present, sends an "OK" presumably from lack of data from lack of polling, then immediately sends the alert out again.
My guess is there are some race issues here, e.g. (making this up) the server was up 170 seconds ago but the poll interval or queue or some such was such that no new data has been posted for the item even though it runs on (if I recall) a 120 second polling interval.
Hard to set up controlled tests of this as it is intermittent (again making me think it is an race condition).
But... I can work on the race condition if that's it.
My question is this: Is there a better approach to monitoring such "disappearing" entities? Where disappearance itself is the clear condition? I hate to rewrite them (there's quite a few templates) but the number of bogus alarms has been annoying lately (doing lots of upgrades and changes so lots of brief outages).
My one thought was to turn them into external checks where I can return a value for "no such OID" instead of a failure and no data? Painful. I don't see that the Item Preprocessing allows you to manage poll failures?
Any other thoughts?
To do the no-longer-exists I use nodata(). Here's an example:
Code:
( {TRIGGER.VALUE} = 1 and {Template SNMP NEC iPaso Device:severity[{#INDEX}].nodata(180)}=0 )
or
( {TRIGGER.VALUE} = 0 and {Template SNMP NEC iPaso Device:severity[{#INDEX}].last(#1)}=1 and
{Template SNMP NEC iPaso Device:time[{#INDEX}].nodata(180)}>=-1
)
The manual says the nodata() expression gives an error if the server was restarted in that interval, so I thought I was good, it would not process, and would not clear. But at least sometimes the server comes back when an alert was present, sends an "OK" presumably from lack of data from lack of polling, then immediately sends the alert out again.
My guess is there are some race issues here, e.g. (making this up) the server was up 170 seconds ago but the poll interval or queue or some such was such that no new data has been posted for the item even though it runs on (if I recall) a 120 second polling interval.
Hard to set up controlled tests of this as it is intermittent (again making me think it is an race condition).
But... I can work on the race condition if that's it.
My question is this: Is there a better approach to monitoring such "disappearing" entities? Where disappearance itself is the clear condition? I hate to rewrite them (there's quite a few templates) but the number of bogus alarms has been annoying lately (doing lots of upgrades and changes so lots of brief outages).
My one thought was to turn them into external checks where I can return a value for "no such OID" instead of a failure and no data? Painful. I don't see that the Item Preprocessing allows you to manage poll failures?
Any other thoughts?
Comment