Ad Widget

Collapse

re-check an item immediately to avoid false alarms?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • eggcanada
    Junior Member
    • Sep 2014
    • 5

    #1

    re-check an item immediately to avoid false alarms?

    Hi guys, in the past 5 years, I used HP SiteScope for system monitoring.

    Now I'm switching to Zabbix. I'm looking for a feature which SiteScope has.

    For example, I want to monitor diskspace /var and I should get an alarm email when free space is below 10%.

    So I create a trigger to check diskspace every 20minutes .

    Once the trigger is fired, I want the trigger to check again in 20 seconds , if the free space is still below 10%, then send out the email. If second check is above 10%, then ignore this alarm.

    Further more, I can choose to do a 3rd check after 20 seconds to double confirm...

    Based on my experience, this way, I can greatly decrease the number of false alarms.

    I checked Zabbix document, so far I have got a clue of how to implement this.
  • ingus.vilnis
    Senior Member
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Mar 2014
    • 908

    #2
    Hello and welcome!

    Please have a look at the documentation here: https://www.zabbix.com/documentation...on/escalations

    That section with appropriate timing setups will help you to achieve what you have planned.

    Best Regards,
    Ingus

    Comment

    • eggcanada
      Junior Member
      • Sep 2014
      • 5

      #3
      Originally posted by ingus.vilnis
      Hello and welcome!

      Please have a look at the documentation here: https://www.zabbix.com/documentation...on/escalations

      That section with appropriate timing setups will help you to achieve what you have planned.

      Best Regards,
      Ingus
      Hi ingus.vilnis, I don't think escalation can do what I want.

      I.e., an item/trigger's normal interval is 20 minutes. Once the trigger is fired, I want the item to be examed again right away for 1 or 2 times, make sure the trigger is still fired, then send out the alarm.

      Escalation can not re-exam the item.

      Comment

      • eggcanada
        Junior Member
        • Sep 2014
        • 5

        #4
        Hi ingus.vilnis, I checked Escalation, it seems not what I wanted.

        I want the trigger to be rechecked (item re-examed) in 10 or 20 seconds, escalation operations can not do this.

        Comment

        • ingus.vilnis
          Senior Member
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Mar 2014
          • 908

          #5
          Hi,

          I got the idea now. I somehow overlooked the 20 seconds recheck in your initial post.

          Well in that case I can't think of a way how to achieve exactly that. Items are checked at the update interval. As a value comes in, a trigger is evaluated ant there is no way you can force a sudden recheck every 20 seconds after trigger condition is met.

          As a kind of workaround I can suggest you to decrease the update interval and make the trigger to fire only if the last let's say three values have met the condition.


          Best Regards,
          Ingus

          Comment

          • eggcanada
            Junior Member
            • Sep 2014
            • 5

            #6
            SiteScope has the feature I mentioned, and this can avoid most false alarms.

            Let's say we monitor CPU load on a server every 10 minutes. If the CPU check hits the server when there is a spike of load, in Zabbix we get an alarm. But when system admin logs on, the load is gone.

            In SiteScope, this alarm can be reconfirmed but a 2nd and 3rd check in next 10, 20 or 30 seconds.

            In this case, We don't want to change the normal CPU check interval to 30 second.

            Same situation applies to disk space checks, etc.

            Personally, I feel this would be a nice feature for Zabbix's future release.

            Comment

            • Colttt
              Senior Member
              Zabbix Certified Specialist
              • Mar 2009
              • 878

              #7
              you can use a smaller intervall to check this.. (every 10minutes to check the load, really?)
              and can you can create a trigger like that:

              CPULOAD.max(#2)>4

              if you dont find a solution, you can create a feature request, and maybe pay for it
              Debian-User

              Sorry for my bad english

              Comment

              • jan.garaj
                Senior Member
                Zabbix Certified Specialist
                • Jan 2010
                • 506

                #8
                Escalation can be your solution. POC:
                1. trigger - PROBLEM if value<threshold -> decreasing checking period
                Action for 1st trigger - first escalation step will be custom script for decreasing time period to 20sec.

                2. trigger - PROBLEM if last 3 values<threshold -> notification
                Action for 2nd trigger - standard notification

                3. trigger - PROBLEM if min of last 3 values<threshold -> increasing checking period
                Action for 3rd trigger - first escalation step will be custom script for increasing time period to 20minutes

                Problem:
                - it doesn't work for active agents correctly
                - you need to develop custom scripts (API)

                It's not very nice solution, but it can work. Anyway I recommend to use standard expression https://www.zabbix.com/documentation...ers/expression if you want to avoid alarms (last 3 checks fail).

                IMHO standard Zabbix concept is faster:
                check period 1 minute, event if last 3 checks failed -> response time ~3minutes
                than HP SiteScope:
                check period 10minutes, if problem 10,20,30seconds -> response time ~11minutes

                But you will need to store more values in Zabbix.
                Devops Monitoring Expert advice: Dockerize/automate/monitor all the things.
                My DevOps stack: Docker / Kubernetes / Mesos / ECS / Terraform / Elasticsearch / Zabbix / Grafana / Puppet / Ansible / Vagrant

                Comment

                • eggcanada
                  Junior Member
                  • Sep 2014
                  • 5

                  #9
                  Hi jan.garaj, I used 10 minutes for example. Yes the interval can be changed to 1 minute to get quicker response but that's not feasible for some heavy checks.
                  If there is no possible and easy solution in current Zabbix version, then I have to take it for now.

                  Thank you guys!

                  Comment

                  Working...