Ad Widget

Collapse

Notification storms

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mschlegel
    Member
    • Oct 2008
    • 40

    #1

    Notification storms

    Has anyone else experienced notification storms in their zabbix installations?

    While I've only seen the issue show up related to one of our triggers, the volume of notifications that zabbix generated on the most recent event was staggering. Over 12,000 notifications sent out over the course of about an hour and a half prior to disabling the associated action. More disturbing, however, was a continued storm of notifications after the action was disabled that added the following first line to the message:

    NOTE: Escalation cancelled: Action '....' disabled.

    This is a recent upgrade to 1.8.3 from 1.8.0.

    In the past, disabling the action has stopped all notifications generated by the appropriate trigger. Now, even with all actions disabled, I'm still seeing notifications sent out.

    Any ideas where I might look to figure out what is going on?

    Thank you
  • Dmitry Musatov
    Junior Member
    • Mar 2011
    • 1

    #2
    > Has anyone else experienced notification storms in their zabbix installations?
    This happens for me last night.

    I'm running Zabbix 1.8.4 at 64bit Debian Lenny. The usual zabbix status:

    Number of hosts (monitored/not monitored/templates) 48 20 / 0 / 28
    Number of items (monitored/disabled/not supported) 2195 2163 / 28 / 4
    Number of triggers (enabled/disabled)[problem/unknown/ok] 969 856 / 113 [14 / 11 / 831]
    Number of users (online) 20 2
    Required server performance, new values per second 69.12 -

    Last evening the MySQL (it is running on same host) consumes all available memory and OS decided to use swap hence everything almost hangs because of IO hence a lot of triggers fired. So I disabled some actions to prevent over9000 SMS/mail messages. However it does not help - I got really lot of messages with first line NOTE: Escalation cancelled: Action '....' disabled.
    I have no escalations enabled on any of my actions.

    Here is a cut from the logs:

    monitoring:/var/log/zabbix-server# grep scalation zabbix_server.log
    7472:20110321:191723.160 [Z3005] Query failed: [1213] Deadlock found when trying to get lock; try restarting
    transaction [insert into escalations (escalationid,actionid,triggerid,eventid,status) values (1845,4,13863,295535,0)]
    7479:20110321:191754.868 Escalation cancelled: Event [295535] deleted.
    7479:20110321:192142.337 Escalation cancelled: Event [295535] deleted.
    7479:20110321:192302.152 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192303.952 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192304.711 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192305.468 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192306.010 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192307.950 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192308.587 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192309.382 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192310.023 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192311.728 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192312.526 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192313.211 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192313.612 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192313.851 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192314.758 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192314.910 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192316.607 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192317.318 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192317.338 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192317.551 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192318.090 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192318.117 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192318.242 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192318.578 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192319.015 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192319.168 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192319.873 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192324.570 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192324.694 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192325.110 Escalation cancelled: Action 'SMS Magic Team 2' disabled.
    7479:20110321:192325.133 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192325.250 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192326.940 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192327.061 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192327.193 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192327.459 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192328.136 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192328.138 Escalation cancelled: Action 'SMS Magic Team 2' disabled.
    7479:20110321:192328.494 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192329.226 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192330.632 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192331.223 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192331.770 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192331.834 Escalation cancelled: Action 'SMS Magic Team 2' disabled.
    7479:20110321:192332.290 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192332.862 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192333.486 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192334.050 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192334.214 Escalation cancelled: Action 'SMS Magic Team 2' disabled.
    7479:20110321:192335.023 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192336.158 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192337.468 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192337.916 Escalation cancelled: Action 'SMS Magic Team 2' disabled.
    7479:20110321:192339.337 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192340.081 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192340.630 Escalation cancelled: Action 'SMS Magic Team 2' disabled.
    7479:20110321:192341.074 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192341.483 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192342.015 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192342.286 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192342.727 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192343.483 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192346.055 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192346.528 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192346.870 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192347.383 Escalation cancelled: Action 'SMS notifications' disabled.
    7479:20110321:192348.056 Escalation cancelled: Action 'SMS notifications' disabled.
    10968:20110321:194047.569 [Z3005] Query failed: [1213] Deadlock found when trying to get lock; try restarting
    transaction [insert into escalations (escalationid,actionid,triggerid,eventid,status) values (9,4,16535,297143,0)]
    10973:20110321:194051.175 Escalation cancelled: Event [297143] deleted.
    10967:20110321:194202.151 [Z3005] Query failed: [1213] Deadlock found when trying to get lock; try restarting
    transaction [insert into escalations (escalationid,actionid,triggerid,eventid,status) values (74,4,16343,297498,0)]
    10973:20110321:194203.653 Escalation cancelled: Event [297498] deleted.


    Any ideas what the hell it was?

    Comment

    • untergeek
      Senior Member
      Zabbix Certified Specialist
      • Jun 2009
      • 512

      #3
      An escalation is simply an action. If you take even 1 action on a trigger, it's considered an escalation to the Zabbix DB (at least, that's been my understanding).

      We've dealt with pager storms before when the db got backed up or network communications failed. It's certainly no picnic.

      We created a script to go and mark all escalations as "failed" which we found in another thread here. It kind of clears out the cruft if that happens.

      Comment

      • MrKen
        Senior Member
        • Oct 2008
        • 652

        #4
        Actually, when you get flooded with alerts, just disabling Actions is not enough because there are already numerous alerts already in the database waiting to try again, and again, etc.

        What you need to do is to remove those alerts from the database, as described here: http://www.zabbix.com/forum/showthread.php?t=13326

        But even after disabling all actions, and running the sql commands in that link, I have found that I need to execute the sql commands again 2 or 3 times to finally get all of them. But then there may still be some cached in your smsc that will continue to come for several hours.

        I'm thinking of updating my 'Server X is unreachable' alert such that Server X must be unreachable for say 5 minutes plus the Zabbix Queue is less than maybe 10% of the total items. This way I would only get 5 or 6 alerts for Hosts being unreachable, rather than every Host being unreachable when there is a major network outage.
        Haven't done this yet, but I think it might be workable.

        MrKen
        Disclaimer: All of the above is pure speculation.

        Comment

        • fletch00
          Junior Member
          • Apr 2011
          • 13

          #5
          page storm mitigation

          We are experiencing 100's of notifications sent out with "Server XYZ is unreachable"
          When in fact the Zabbix Queue is exploding and the notifications are bogus

          Did you end up testing your new trigger logic (the queue must be < 10% of the total?)

          thanks

          Comment

          • untergeek
            Senior Member
            Zabbix Certified Specialist
            • Jun 2009
            • 512

            #6
            The tedium and difficulty of adding that logic to hundreds or even thousands of templates triggers is mind-boggling. You'd be far better off putting in a single action for a growing queue. That action could run scripts to disable the sending of notifications, put the entire system into maintenance mode (note: much manual scripting is necessary to achieve this), it could call an API call to disable existing actions, or anything else you can dream up.

            I shudder to think of the database load and the tedium of trying to add that much extra work to each trigger when there are potentially easier responses.

            The next thing to consider, however, is why you're getting a queue backup. If this is a condition you're experiencing with any degree of frequency, the solution is not to bypass the warning signs that your system can't handle something but to find out why and fix it.

            We had stuff like this happen in the past, and only occasionally have it happen now. We have found that our queue starts to back up when the database is unable to write changes quickly enough. Our backend is a massive Oracle database, but it's shared with several other databases. This has occasionally led to the database slowing down when one or more of those others is slammed itself. We watch to see the zabbix internal item "write cache percent free." When this dips below a certain percentage, we know that something is slowing down the database. This is how we catch problems before they sink our zabbix install and cause us to have a "pager storm" event.

            Chances are that you have something slowing down your database writes as well, though it could be something else still. Do the work to troubleshoot it and fix it though instead of just trying to silence the alarm.

            Comment

            • fletch00
              Junior Member
              • Apr 2011
              • 13

              #7
              thanks for taking the time to write your detailed reply.
              We run our 1.8.4 install on a dedicated Oracle instance on a dedicated set of disks so we're puzzled why on an idle weekend day we'd see this queue issue (which like you say is usually tied to a IO write bottleneck)

              The fact we (and other posters here) are experiencing this and talking about it shows its a common issue.
              The trigger modification I am looking at was proposed by another member in another thread and would be made in _two_ places - the windows template and Linux template.

              Writing _robust_ code means a deterministic degradation/failure mode.
              Zabbix failing in this way (releasing 100's of false pages when there are no real issues) points out a FUNDAMENTAL DESIGN FLAW - a monitoring system should have checks in place to prevent this - so we are attempting to make this more robust.

              To the other poster - yes I tried adding &{zabbix[queue].last(0)} and it gives this error on save:
              Unexpected end of element: Check expression part starting from ' {zabbix[queue].last(0)} '


              thanks

              Comment

              • fletch00
                Junior Member
                • Apr 2011
                • 13

                #8
                I found what I wanted in 1.8.5 by browsing the zabbix items:

                &{Zabbix server:zabbix[queue].last(0)}<2000

                this does not exist/work in 1.8.4! - gives error on save:

                Host does not exist. Check expression part starting from ' {Zabbix server:zabbix[queue].last(0)}<2000 '

                Comment

                • untergeek
                  Senior Member
                  Zabbix Certified Specialist
                  • Jun 2009
                  • 512

                  #9
                  Is your Zabbix server's name "Zabbix server" or is it something else?

                  That's what it's looking for. Again, you need to add an item which is collecting the data (it's not automatic, even though it would seem so with a type of "zabbix internal").

                  Comment

                  Working...