Ad Widget

Collapse

Maintenance periods for hosts not working

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mizaring
    Junior Member
    • Jul 2007
    • 26

    #1

    Maintenance periods for hosts not working

    Hi all,

    We installed zabbix 1.8.1 (1.8.0 had the same problem) and since we have systems that reboot every night we used the maintenance periods for those hosts to NOT get called (or mailed) when they reboot. However we're receiving mails for some, but not all, hosts when they reboot during the maintenance period. It seems the "maintenance for host" feature is really broken! I opened a bug report (ZBX-2181) but maybe someone may help with this issue. We will be pushing the new zabbix server into production next week and this bug is a serious blocker for us and could put our use of zabbix in jeopardy (we have over 150 servers and systems monitored).

    I attached screenshots of our config plus the event log where you can see that some actions fired when others didn't when in the maintenance period. It doesn't matter if it's a daily or weekly maintenance, we still get mails when we shouldn't.

    Thanks!
    Attached Files
  • mizaring
    Junior Member
    • Jul 2007
    • 26

    #2
    Still lots of problems with maintenance periods

    Either I'm the only one using the maintenance periods for hosts or I'm getting very unlucky... One time maintenance period fail most of the time, and daily and weekly periods still sends me random emails while the hosts are in their maintenance period. If maintenance for host is still experimental someone should tag it as so in the documentation, I'm having a LOT of problems with that feature, I'm thinking about changing my approach to switch away from that feature.

    Comment

    • danrog
      Senior Member
      • Sep 2009
      • 164

      #3
      It could be related to timezone mismatches. We use it all the time and it works well for us (so far). All our systems are in GMT so I don't have to worry about that.

      It could also be related to the timezone of your local system vs. PHP's timezone. I thought I saw a bug re: PHP and browser timezones not behaving properly.

      Comment

      • mizaring
        Junior Member
        • Jul 2007
        • 26

        #4
        Timezones are set correctly

        I verified the timezone settings on the server to make sure that they were correct at the beginning :

        php.ini : date.timezone = America/Montreal
        /etc/localtime : /usr/share/zoneinfo/America/Montreal

        So I guess it's something in the code... but as you say it could be related to the timezones... anyway I'm gonna look at the code this weekend and try to find out what's happening. I'm gonna post a patch if I can pinpoint the problem.

        Comment

        • btriem
          Member
          • Apr 2006
          • 30

          #5
          same problem

          I appear to be having the same problem. I created a maintenance period for a single host, with [no data collection], yet during the maintenance period its still collecting data. Or am I mis-interpreting what [no data collection] means? I am assuming that during the maintenance period, I am asking it to stop attempting to collect data from that device.
          Last edited by btriem; 12-04-2010, 21:37.

          Comment

          • mizaring
            Junior Member
            • Jul 2007
            • 26

            #6
            Update

            I looked at the code in timer.c and made some cutoms logs to see if some of the hosts failed to get in maintenance mode. From what I saw Zabbix correctly put the hosts in maintenance mode... however sometimes the actions still trigger and send an email. I attached a screenshot showing the problem :

            - This host is in maintenance from 4h00 to 4h30 for reboot.
            - At 04:03:00, 04:03:20, 04:04:19 and 04:04:30 you can see some triggers kickin' in but due to maintenance zabbix doesn't execute the action (see the dash (-) in last column).
            - At 04:06:00 and 04:06:50 two triggers kick in but this time zabbix decide to execute the action (Ok in last column)...
            - At 04:07:30 another trigger comes on but guess what... no action exectued!! (as it's supposed to be).

            I will continue my search but this bug is killing me... Now I have to wake in the middle of the night almost every day to realize that it's still a false alarm.
            Attached Files

            Comment

            • mizaring
              Junior Member
              • Jul 2007
              • 26

              #7
              Worsened with fix for bug ZBX-2305

              Hi all,

              Yesterday I installed nightly build from svn (revision 11536) from the 1.8 branch. I saw bug ZBX-2305 had been resolved and commited, so I thought it could "maybe" help this problem... I was so wrong! In fact due to my busy system some actions were probably in fact not sent due to bug ZBX-2305 inside maintenance period, but now I get them all... which is really worse.

              Is somebody working or at least looking onto this problem? Right now my whole system is unusable and sends tens of false alarms on a daily basis due to maintenance periods not working.

              Just to make sure, the system always doesn't send all the alerts during maintenance period, maybe just 5% of them. My thinking is that there's some sort of race condition in the code between the action/maintenance status.

              If nobody's considering this an important issue just say so, we can't afford to run like this for long. (We have 300 hosts, 6500 items and 3750 triggers so you could imagine the mess with 5% false alarms...!)

              Thanks!

              Comment

              • mizaring
                Junior Member
                • Jul 2007
                • 26

                #8
                Update

                Here's the latest development. I patched timer.c to not only put the host in maintenance but also to automatically put it in a host group called "Maintenance" for the period it is supposed to be in maintenance. I also updated to latest nightly build (>11595) since bug #ZBX-2305 was reopened and "corrected" again.

                Now my new patch is working as it's supposed to, I can see hosts going in and out of the "Maintenance" group following the maintenance periods. I also added a condition in my actions to make sure no action would be executed if the host was in the "Maintenance" group while keeping the "maintenance status" condition. So my guess was that if it was a problem with the maintenance status my problem would be gone since the host group "Maintenance" would prevent the action from firing... I was wrong again!

                Since all my hosts are correctly put in maintenance and also in the "Maintenance" group my conclusion is that there's a BIG problem with the execution of actions in a busy environment. When it's not too busy things seems to work ok, but when I get a couple of servers rebooting at the same time (generating about one hundred events) then actions that shouldn't be executed start sending emails...

                If nobody respond to this thread I will stop posting... since my time will be better invested in trying to figure out the code and see where the racing condition leading to this problem comes from.

                Thanks!

                Comment

                • untergeek
                  Senior Member
                  Zabbix Certified Specialist
                  • Jun 2009
                  • 512

                  #9
                  I saw a thread a while ago where trigger dependencies were not being evaluated properly in cases where the first step of the Action was immediate. It may help if you make it start with notifications at step 2 (after a 60 second delay) rather than immediate. It seems that if heavy DB traffic could prevent instantaneous evaluation of trigger dependencies it could also have the same effect on evaluation of maintenance mode.

                  Comment

                  • mizaring
                    Junior Member
                    • Jul 2007
                    • 26

                    #10
                    Escalation

                    Thanks untergeek for the answer,

                    I am not using escalations right now but I will try to see if it makes any difference. However since our servers are critical (they provide information to doctors in hospitals) I wouldn't want to extend the delay before we know that we have a problem. I wouldn't want this to be a permanent solution either.

                    I also saw an amelioration in the quantity of false emails received since bug #ZBX-2305 has been recorrected. But the thing that still puzzle me is why does the first action condition is evaluated correctly ("Host group = PAR") and not the two other conditions ("Host group <> Maintenance" & "Maintenance status not in maintenance"). Seems like the event correctly identify the host group for the action but ignore any other condition... weird.

                    I'll continue to see how things go for the next few days.

                    Thanks

                    Comment

                    • mizaring
                      Junior Member
                      • Jul 2007
                      • 26

                      #11
                      Action not sending mail to everyone

                      Got another false positive this morning... however I noticed that the action sent the mail only to 3 out of 5 mail addresses, it didn't even try to send it to the others. I continue to think that there's a serious synchronization problem with the way actions are evaluated & executed.

                      Alexei, richlv are you reading this? do you have any idea on what could be the problem?

                      Could someone send me the design documents with the way Zabbix is working, I could look into the code myself and try to help.

                      Thanks

                      Comment

                      • cperera
                        Junior Member
                        • May 2007
                        • 19

                        #12
                        Just for the record, we're hitting the same issue with receiving notifications for servers in a notification period.

                        Comment

                        • Jun.Liu
                          Member
                          • Apr 2007
                          • 91

                          #13
                          Here I have the same problem with you all. version:1.8.2
                          Hope this issue can be resolved soon!

                          Comment

                          • Jun.Liu
                            Member
                            • Apr 2007
                            • 91

                            #14
                            Today I have a digging on the forum and found a workaround for this issue.

                            Add one more condition "maintenance status in maintenance" in the action configuration. then the trigger won't send the alert msg anymore when it's triggered during the maintenance period.

                            Comment

                            • sandy211087
                              Junior Member
                              • Nov 2009
                              • 1

                              #15
                              Maintenance problem for host

                              I think you need to define the maintenance peroid(period type)
                              and for that perticular period it wont be collecting any data

                              Comment

                              Working...