Ad Widget

Collapse

Zabbix false alerts

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • monismonther
    Junior Member
    • Nov 2010
    • 14

    #1

    Zabbix false alerts

    Hi , I have zabbix configured to monitor some 20-25 servers, they all have the CPU Load Average trigger. the action is to send an email when CPU Load is over 5 for more than 5 minutes.

    What is happening is that each few days I get an alert that CPU Load is high, when I check the graphs and latest evets and values, it has not spiked over the threshold for more than half a minute and the email is sent during a time that the CPU load was low.

    Note: This is the zabbix machine itself, this does not happen with any other host.

    Any ideas how can I troubleshoot this.

    Thanks
  • EnigmA-X
    Senior Member
    Zabbix Certified Specialist
    • Oct 2010
    • 116

    #2
    Can you give us the triggers and items you're using to monitor this?

    Comment

    • monismonther
      Junior Member
      • Nov 2010
      • 14

      #3
      My trigger is

      {mymachine:system.cpu.load[,avg1].last(0)}>5

      The item is CPU Load

      key system.cpu.load[,avg1]
      Update interval (in sec) 5 sec
      Flexible intervals (sec) no felxible intervals
      Delay 50
      Period 1-7,00:00-23:59

      This same item and trigger is setup for all my servers when monitoring CPU Load, but the problem only happens with this host

      Thanks

      Comment

      • EnigmA-X
        Senior Member
        Zabbix Certified Specialist
        • Oct 2010
        • 116

        #4
        It looks like you are measuring your 1 minute average cpu-load (avg1) every 5 seconds. The trigger you're using, only looks at the last received value (last(0)).

        I suggest that you change:
        • your item interval to: (at least) 60s
        • Change your trigger to: {mymachine:system.cpu.load[,avg1].avg(300)}>5


        The trigger will check your 1 minute average cpu-load, and will go to problem state once the average over the last 5 minutes (300 seconds) is over 5.

        Let me know if this works for you

        Comment

        • monismonther
          Junior Member
          • Nov 2010
          • 14

          #5
          But I have already done this but in my action and operation config.

          I have set up escalations so that trigger must remain in PROBLEM state for 5 minutes for the email to be sent.

          Enable escalations (check box ticked)
          Period (seconds) 300 [min 60]

          And in my operations I have set the action to excute at step 2 and the operation windows is showing 5 minutes under the delay field.

          Is'nt this the same??

          Thanks

          Comment

          • EnigmA-X
            Senior Member
            Zabbix Certified Specialist
            • Oct 2010
            • 116

            #6
            I would expect that your option would work as well. Which step number (in the escalation actions) sends the e-mail and what version of Zabbix are you using?

            Comment

            • monismonther
              Junior Member
              • Nov 2010
              • 14

              #7
              Hi , Sorry for late reply


              My Zabbix version is 1.8

              I have configured the action to send an email at step 2

              Note: I got a false alarm today , it spiked for 20 seconds and sent an alarm then went back ok, also by looking at the event view, it spikes like that all day long but no alarm (Noraml because its less than 5 minutes).

              The strange thing is that some times for unknown reasons a simple spike can set the action to OK sending an email while the configuration is set that it shold do so when spiking for more than 5 minutes.

              Any help please these false messages sometimes come in the middle of the night and we need to take them seriously.

              Comment

              • EnigmA-X
                Senior Member
                Zabbix Certified Specialist
                • Oct 2010
                • 116

                #8
                Have you changed the trigger as I suggested in my earlier post?

                Comment

                • monismonther
                  Junior Member
                  • Nov 2010
                  • 14

                  #9
                  No Actually becasue the way I have set it up is the same and you agreed about that.

                  I would also like to mension that it works very well except that I get these false alarms every like 3 or 5 days once, so I dont think its the confiugration problem.

                  Also I would like to add that all other 20 servers have the same exact configuration with no false alarms only this particular server.

                  Why should I change the confiugration?

                  Please understand that I am not trying to be stubern here but trying to route out the actual casue, I highly appreceiate your time and efforts to help me.

                  Thanks

                  Comment

                  • subba5678
                    Senior Member
                    • May 2010
                    • 132

                    #10
                    Hi EnigmA-X,
                    I am getting multiple recover alerts for Zabbix Down agent , i will get only 1 Problem Alerts and 5 recovery alerts for the same only from 5 servers . please help me out . I have only one action created for trigger

                    Thanks,
                    Subbu

                    Comment

                    • monismonther
                      Junior Member
                      • Nov 2010
                      • 14

                      #11
                      Hi any updates please, I got a third false alarm today

                      Comment

                      • MrKen
                        Senior Member
                        • Oct 2008
                        • 652

                        #12
                        Any updates?

                        I would suggest changing your Trigger as suggested below. Get that working, then play around with Escalations.

                        Originally posted by EnigmA-X
                        It looks like you are measuring your 1 minute average cpu-load (avg1) every 5 seconds. The trigger you're using, only looks at the last received value (last(0)).

                        I suggest that you change:
                        • your item interval to: (at least) 60s
                        • Change your trigger to: {mymachine:system.cpu.load[,avg1].avg(300)}>5


                        The trigger will check your 1 minute average cpu-load, and will go to problem state once the average over the last 5 minutes (300 seconds) is over 5.

                        Let me know if this works for you
                        If for example you want coffee with no sugar, you don't put in sugar then try to remove it.

                        MrKen
                        Disclaimer: All of the above is pure speculation.

                        Comment

                        • monismonther
                          Junior Member
                          • Nov 2010
                          • 14

                          #13
                          * Change your trigger to: {mymachine:system.cpu.load[,avg1].avg(300)}>5


                          Can you please guide me how to do this from the web interface. I am really new to this, bare me a little bit. Thanks

                          I am already in the trigger configuration page and I have this in the expression field

                          Expression {myserver:system.cpu.load[,avg1].last(0)}>5

                          And I want it to be like this

                          Expression {myserver:system.cpu.load[,avg1].last(300)}>5

                          There is a select button next to the value on the right, but I dont know what to choose from it.

                          Comment

                          • subba5678
                            Senior Member
                            • May 2010
                            • 132

                            #14
                            Hi MrKen,
                            . I am monitoring totally 50 Servers in Zabbix . In that nearly about 10 servers I am getting multilple recover alerts . for all the servers i am using the same Trigger . please find the my trigger details . Please help me out , I am using this in the production

                            Template_Windows:svs- Rosse TAL765 Zabbix Agent is down {svs - Rosse TAL765:status.min(900)}=2

                            Comment

                            • MrKen
                              Senior Member
                              • Oct 2008
                              • 652

                              #15
                              @monismonther,

                              Don't use '.last(300)}>5'. According to the docs, 300 seconds will be ignored if you use last.
                              For (300) it would be better to use min, max, avg. For me, I use min, which means that the minimum cpu load is greater than 5 for 5 minutes.

                              To change the trigger expression, don't worry about the select button, just change the value in the Expression box. If your trigger is attached to a template, you will need to change it on the template, not on the host.


                              @subba5678,

                              I explained to you about 'status' in your other thread http://www.zabbix.com/forum/showthread.php?t=14333

                              MrKen
                              Disclaimer: All of the above is pure speculation.

                              Comment

                              Working...