Ad Widget

**Linwood** · 19-07-2016, 16:22

I'm not completely following when you say "first check", do you mean the first data item collected alerts with the second trigger you showed?

The second expression says "if the service has been up at all for 5 minutes". If you want "The service has been up all of the 5 minutes" then use min. Bear in mind that conditions of no polling will result in an empty interval, then the very next single item will be all that it sees in the 5 minute period. If you want to make sure it had polls all during the 5 minutes, you'll need also to add something like a count to allow for cases where the host was not polled (zabbix down, host down, first discovery poll).

FInally the normal way to prevent flapping is to build in a more restrictive test for a trigger to return to OK, for example if you want to alert if a service is down at all, you might build in a second check so it does not come OK until it is up steadily for 5-10 minutes, by using the {TRIGGER.VALUE}, e.g.

({TRIGGER.VALUE}=0 and stuff.last(0)=0 ) or
({TRIGGER.VALUE}=1 and stuff.min(10m)=1 )

This sort of structure would alert as soon as a service shows down, and not mark it OK until it has been up for 10 minutes.

**bildz** · 19-07-2016, 16:37

Linwood,

Thanks for the response. When I refer to the first check, I'm referring to the first poll performed. Say I initiated a stop to the service and the service check polls every 60 seconds. After that first poll, I received an alert that the service was down.

I appreciate the hysteresis example below, but am not sure that it completely resolves the issue I've presented. Please keep in mind that I'm coming from over 18 years experience using Netsaint/Nagios.

Here is the situation I would like to address:

Service stops at 9:00 AM, but is restarted automatically to refresh configuration changes. I'm seeing issues where I'll receive a notification that the service has stopped and then started again. What I'd like to remediate is this.. If the service has been down for 5 minutes, send an alert, otherwise the service is ok. This will ensure that general restarts of the service do not get alerted, as there are many instances in the environment where a configuration is refreshed and requires a service restart.

It's been challenging understanding the true power of triggers and I work best with examples, which I've been unfortunate in locating.

Thanks,

bildz

Originally posted by Linwood

I'm not completely following when you say "first check", do you mean the first data item collected alerts with the second trigger you showed?

The second expression says "if the service has been up at all for 5 minutes". If you want "The service has been up all of the 5 minutes" then use min. Bear in mind that conditions of no polling will result in an empty interval, then the very next single item will be all that it sees in the 5 minute period. If you want to make sure it had polls all during the 5 minutes, you'll need also to add something like a count to allow for cases where the host was not polled (zabbix down, host down, first discovery poll).

FInally the normal way to prevent flapping is to build in a more restrictive test for a trigger to return to OK, for example if you want to alert if a service is down at all, you might build in a second check so it does not come OK until it is up steadily for 5-10 minutes, by using the {TRIGGER.VALUE}, e.g.

({TRIGGER.VALUE}=0 and stuff.last(0)=0 ) or
({TRIGGER.VALUE}=1 and stuff.min(10m)=1 )

This sort of structure would alert as soon as a service shows down, and not mark it OK until it has been up for 10 minutes.

**Linwood** · 19-07-2016, 17:12

Originally posted by bildz

Service stops at 9:00 AM, but is restarted automatically to refresh configuration changes. I'm seeing issues where I'll receive a notification that the service has stopped and then started again. What I'd like to remediate is this.. If the service has been down for 5 minutes, send an alert, otherwise the service is ok. This will ensure that general restarts of the service do not get alerted, as there are many instances in the environment where a configuration is refreshed and requires a service restart.

There are actually two different approaches to this worth mentioning. The less frequent (but one I kind of like) one first:

You can let the trigger fire for even brief outages, but you can then have the alert itself wait a period to see if the problem is restored. This logs an event in the system that is then easy to see (i.e. there was downtime) but no alert is sent. This is done by skipping a step in the alert action; you set so the 2nd step occurs at (say) 5 minutes into the event to send the alert. This is like an escalation step, but with no initial alert. The advantage is that this step never fires if the event clears to OK before the time. I do this for host outages over bad WAN circuits, since I want to log outages, but I really do not want a tech to jump on the issue if it comes back in 1 minute or so.

The other way is in the trigger itself, and I think you were pretty much on the right track, you just need to reverse the sense.

{Template Windows Service Discovery:service_state[{#SERVICENAME}].max(5m)}=0

This will only fire if the service is down for a full 5 minutes. I think that's what you are describing, is that you don't want to get an alert for brief outages, right?

This has one edge condition -- if the host is unreachable for 5 minutes, comes reachable and the service is down when it comes reachable (like on a reboot after an outage). If the service is not up on the very first poll (and no other polls in that interval) it alerts, then if the service has come up by the next poll, immediately goes OK. To work around that, you probably want something like an

{Template Windows Service Discovery:service_state[{#SERVICENAME}].max(5m)}=0 and
{Template Windows Service Discovery:service_state[{#SERVICENAME}].count(5m)}>2

Or whatever number you want to ensure you had enough of a polling period to be representative and allow for the server to reboot fully.

The downside of this is if you have services that restart on crash, and they start crashing, you will not notice because they may always be back up and running well inside the 5 minute period.

Ps. I should add that with regard to the first case, you are probably recording the item value over time, and logging the event is in some degree not necessary as you can just review the item data looking for outages. The problem is that there's no great way in zabbix to "look for any 0 in a few weeks of 1"; on a graph it often cannot be seen unless zoomed in. The events on the other hand have easy reporting, even a top 100.

**bildz** · 19-07-2016, 21:12

I'm intrigued by the first suggestion:

"You can let the trigger fire for even brief outages, but you can then have the alert itself wait a period to see if the problem is restored. This logs an event in the system that is then easy to see (i.e. there was downtime) but no alert is sent. This is done by skipping a step in the alert action; you set so the 2nd step occurs at (say) 5 minutes into the event to send the alert. This is like an escalation step, but with no initial alert. The advantage is that this step never fires if the event clears to OK before the time. I do this for host outages over bad WAN circuits, since I want to log outages, but I really do not want a tech to jump on the issue if it comes back in 1 minute or so."

I'm attempting to configure the action, for a specific application defined as Service, and I am unsure where you set the delayed time to kick off the email. Could you offer some advice on how to best set this up? Thanks!

Attached Files

**Linwood** · 19-07-2016, 21:36

Something like this (attached screen shot). Note that it says step 2 with the default step being 120 seconds so it starts at 2 minutes. I'm not on the system where I had this working, this was a 2.4 system I just had handy, but I think this should work.

I don't think I had to have a do-nothing real first step, but if you did just make it send mail to no one or some such.

Essentially this is just the escalation feature of zabbix, abused a bit.

Attached Files

**bildz** · 19-07-2016, 21:41

Thanks! I've got this configured and I'll give it a shot. Appreciate your assistance!

**bildz** · 19-07-2016, 22:48

The solution works perfectly! I can see the event come in and it recovers, without sending an email. if I leave the service stopped for longer than 5 minutes, the alert is generated. Appreciate the assistance!

**Linwood** · 19-07-2016, 23:10

Originally posted by bildz

The solution works perfectly! I can see the event come in and it recovers, without sending an email. if I leave the service stopped for longer than 5 minutes, the alert is generated. Appreciate the assistance!

No problem, glad it worked out.

Ad Widget

Windows Service Flapping

Windows Service Flapping

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment