Ad Widget

Collapse

Windows Service Flapping

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bildz
    Junior Member
    • Aug 2015
    • 6

    #1

    Windows Service Flapping

    I've been struggling with an issue where my windows services are flapping. By using the Windows Discovery Template (LLD), I've been able to detect new services, which I love from an integrity perspective, but loathe when they stop/start. I've put in a means to restart the service, but what I'd like is to prevent being notified when it stops and starts.

    Alert if service stops
    {Template Windows Service Discovery:service_state[{#SERVICENAME}].last(0)}>0

    What I'd like and have gleaned from the Zabbix docs is

    If service stays above 1 for 5 minutes, alert.
    {Template Windows Service Discovery:service_state[{#SERVICENAME}].max(5m)}>0

    I'm still getting emails as soon as the service stops on the first check, so I know it's not working. Any assistance is much appreciated.

    Thanks,

    bildz
  • Linwood
    Senior Member
    • Dec 2013
    • 398

    #2
    I'm not completely following when you say "first check", do you mean the first data item collected alerts with the second trigger you showed?

    The second expression says "if the service has been up at all for 5 minutes". If you want "The service has been up all of the 5 minutes" then use min. Bear in mind that conditions of no polling will result in an empty interval, then the very next single item will be all that it sees in the 5 minute period. If you want to make sure it had polls all during the 5 minutes, you'll need also to add something like a count to allow for cases where the host was not polled (zabbix down, host down, first discovery poll).

    FInally the normal way to prevent flapping is to build in a more restrictive test for a trigger to return to OK, for example if you want to alert if a service is down at all, you might build in a second check so it does not come OK until it is up steadily for 5-10 minutes, by using the {TRIGGER.VALUE}, e.g.

    ({TRIGGER.VALUE}=0 and stuff.last(0)=0 ) or
    ({TRIGGER.VALUE}=1 and stuff.min(10m)=1 )

    This sort of structure would alert as soon as a service shows down, and not mark it OK until it has been up for 10 minutes.

    Comment

    • bildz
      Junior Member
      • Aug 2015
      • 6

      #3
      Linwood,

      Thanks for the response. When I refer to the first check, I'm referring to the first poll performed. Say I initiated a stop to the service and the service check polls every 60 seconds. After that first poll, I received an alert that the service was down.

      I appreciate the hysteresis example below, but am not sure that it completely resolves the issue I've presented. Please keep in mind that I'm coming from over 18 years experience using Netsaint/Nagios.

      Here is the situation I would like to address:

      Service stops at 9:00 AM, but is restarted automatically to refresh configuration changes. I'm seeing issues where I'll receive a notification that the service has stopped and then started again. What I'd like to remediate is this.. If the service has been down for 5 minutes, send an alert, otherwise the service is ok. This will ensure that general restarts of the service do not get alerted, as there are many instances in the environment where a configuration is refreshed and requires a service restart.

      It's been challenging understanding the true power of triggers and I work best with examples, which I've been unfortunate in locating.

      Thanks,

      bildz



      Originally posted by Linwood
      I'm not completely following when you say "first check", do you mean the first data item collected alerts with the second trigger you showed?

      The second expression says "if the service has been up at all for 5 minutes". If you want "The service has been up all of the 5 minutes" then use min. Bear in mind that conditions of no polling will result in an empty interval, then the very next single item will be all that it sees in the 5 minute period. If you want to make sure it had polls all during the 5 minutes, you'll need also to add something like a count to allow for cases where the host was not polled (zabbix down, host down, first discovery poll).

      FInally the normal way to prevent flapping is to build in a more restrictive test for a trigger to return to OK, for example if you want to alert if a service is down at all, you might build in a second check so it does not come OK until it is up steadily for 5-10 minutes, by using the {TRIGGER.VALUE}, e.g.

      ({TRIGGER.VALUE}=0 and stuff.last(0)=0 ) or
      ({TRIGGER.VALUE}=1 and stuff.min(10m)=1 )

      This sort of structure would alert as soon as a service shows down, and not mark it OK until it has been up for 10 minutes.
      Last edited by bildz; 19-07-2016, 17:08.

      Comment

      • Linwood
        Senior Member
        • Dec 2013
        • 398

        #4
        Originally posted by bildz
        Service stops at 9:00 AM, but is restarted automatically to refresh configuration changes. I'm seeing issues where I'll receive a notification that the service has stopped and then started again. What I'd like to remediate is this.. If the service has been down for 5 minutes, send an alert, otherwise the service is ok. This will ensure that general restarts of the service do not get alerted, as there are many instances in the environment where a configuration is refreshed and requires a service restart.
        There are actually two different approaches to this worth mentioning. The less frequent (but one I kind of like) one first:

        You can let the trigger fire for even brief outages, but you can then have the alert itself wait a period to see if the problem is restored. This logs an event in the system that is then easy to see (i.e. there was downtime) but no alert is sent. This is done by skipping a step in the alert action; you set so the 2nd step occurs at (say) 5 minutes into the event to send the alert. This is like an escalation step, but with no initial alert. The advantage is that this step never fires if the event clears to OK before the time. I do this for host outages over bad WAN circuits, since I want to log outages, but I really do not want a tech to jump on the issue if it comes back in 1 minute or so.

        The other way is in the trigger itself, and I think you were pretty much on the right track, you just need to reverse the sense.

        {Template Windows Service Discovery:service_state[{#SERVICENAME}].max(5m)}=0

        This will only fire if the service is down for a full 5 minutes. I think that's what you are describing, is that you don't want to get an alert for brief outages, right?

        This has one edge condition -- if the host is unreachable for 5 minutes, comes reachable and the service is down when it comes reachable (like on a reboot after an outage). If the service is not up on the very first poll (and no other polls in that interval) it alerts, then if the service has come up by the next poll, immediately goes OK. To work around that, you probably want something like an

        {Template Windows Service Discovery:service_state[{#SERVICENAME}].max(5m)}=0 and
        {Template Windows Service Discovery:service_state[{#SERVICENAME}].count(5m)}>2

        Or whatever number you want to ensure you had enough of a polling period to be representative and allow for the server to reboot fully.

        The downside of this is if you have services that restart on crash, and they start crashing, you will not notice because they may always be back up and running well inside the 5 minute period.

        Ps. I should add that with regard to the first case, you are probably recording the item value over time, and logging the event is in some degree not necessary as you can just review the item data looking for outages. The problem is that there's no great way in zabbix to "look for any 0 in a few weeks of 1"; on a graph it often cannot be seen unless zoomed in. The events on the other hand have easy reporting, even a top 100.
        Last edited by Linwood; 19-07-2016, 17:14.

        Comment

        • bildz
          Junior Member
          • Aug 2015
          • 6

          #5
          I'm intrigued by the first suggestion:

          "You can let the trigger fire for even brief outages, but you can then have the alert itself wait a period to see if the problem is restored. This logs an event in the system that is then easy to see (i.e. there was downtime) but no alert is sent. This is done by skipping a step in the alert action; you set so the 2nd step occurs at (say) 5 minutes into the event to send the alert. This is like an escalation step, but with no initial alert. The advantage is that this step never fires if the event clears to OK before the time. I do this for host outages over bad WAN circuits, since I want to log outages, but I really do not want a tech to jump on the issue if it comes back in 1 minute or so."

          I'm attempting to configure the action, for a specific application defined as Service, and I am unsure where you set the delayed time to kick off the email. Could you offer some advice on how to best set this up? Thanks!
          Attached Files
          Last edited by bildz; 19-07-2016, 21:37.

          Comment

          • Linwood
            Senior Member
            • Dec 2013
            • 398

            #6
            Something like this (attached screen shot). Note that it says step 2 with the default step being 120 seconds so it starts at 2 minutes. I'm not on the system where I had this working, this was a 2.4 system I just had handy, but I think this should work.

            I don't think I had to have a do-nothing real first step, but if you did just make it send mail to no one or some such.

            Essentially this is just the escalation feature of zabbix, abused a bit.
            Attached Files

            Comment

            • bildz
              Junior Member
              • Aug 2015
              • 6

              #7
              Thanks! I've got this configured and I'll give it a shot. Appreciate your assistance!

              Comment

              • bildz
                Junior Member
                • Aug 2015
                • 6

                #8
                The solution works perfectly! I can see the event come in and it recovers, without sending an email. if I leave the service stopped for longer than 5 minutes, the alert is generated. Appreciate the assistance!

                Comment

                • Linwood
                  Senior Member
                  • Dec 2013
                  • 398

                  #9
                  Originally posted by bildz
                  The solution works perfectly! I can see the event come in and it recovers, without sending an email. if I leave the service stopped for longer than 5 minutes, the alert is generated. Appreciate the assistance!
                  No problem, glad it worked out.

                  Comment

                  Working...