Ad Widget

Collapse

Maintenance & Notification Architecture Re-think

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • IT_Architect
    Member
    • Feb 2006
    • 31

    #1

    Maintenance & Notification Architecture Re-think

    Scenarios that illustrate the problem:
    - I read here where someone is managing 1000 hosts. He must see a lot of notifications. Somebody is working on at least one of them all the time.
    - I get e-mail notifications of maintenance windows from e-commerce data source vendors, and DCs about maintenance times.
    - We have offices in different time zones. We decided they would monitor the notifications at night. We will have Zabbix text our cell phones so we won't waste being glued to our e-mail.

    The reality of the situation.
    I'm attracted to technological marvels like Zabbix. My excuse is, monitoring is financially important. However, it's more often a phone call, happening to notice hours later, or after a half hour of Outlook error messages noticing the problem is one of our servers rather than the normal suspects, Comcast or HotMail.

    How did I get to this point?
    False positives. I can't imagine texting our cell phones with these messages unless they were almost always real. I don't remember when the maintenance windows are going to occur and which servers they are going to impact. I cause a lot of them myself while working on servers. The other office may call me in the middle of the night, and I may finally figure out it was a maintenance window. They learn from that, and then think it's a maintenance window or I'm just working on the server during off hours (or during hours) when there really is a problem that I'm not aware of, like a night last week when a server instantly ran out of space and crashed during a cross-server backup. I don't pay any attention to up-time numbers, but even if I did, they wouldn't be accurate enough to use.

    Leveraging the power of Zabbix:
    Critical:To leverage the power Zabbix already has, there is an essential piece missing, the need to be able to schedule Zabbix deactivate monitoring a certain host, or group, during a pre-determined time period. Most people will set these up as one-time events, but a sizable number will also need re-occurring. However, one-time can do both by simply adding them multiple places, or if it's easy, a way to copy them into the future. When I mess with a server on and off throughout the day, I would simply deactivate monitoring for the day so I don't have have to rely on memory to turn monitoring back on. In my opinion, this human-factors engineering improvement would do more to enhance Zabbix's effectiveness in the real world than anything else.

    Useful: Be able to schedule where the notifications are sent based on time periods. Keep it simple. For those with complex requirements, would be better served by an application outside of Zabbix to accomplish that task. I would guess that most of your installed base are guys like me who try not to work 24 X 7, can't be glued to their e-mail all day, and whose main job is not to watch a wall of monitors errors all day long.

    Useful: Have the scheduled down times not be included in the availability calculations. This is not important to me, but would be for those who need it to backup the worth of their services. I also avoids the scorn of the press and elevates Zabbix's stature when being evaluated.

    Summary: None of this requires granularity below the host level. If you have a simple, effective way I can accomplish the down time automation and/or notification switching now with some functions I can call from a script, I would be glad to hear it. Perhaps something like using a graphical scheduler, like one of the Windows server's Task Scheduler, where I can easily visualize and manage the schedules, and have plink.exe ssh tunnel in and run scripts on the Zabbix server at the appointed time. I already use this method as my centrally managed, visual cron. This might even be the best way to accomplish my requests, if I knew where to look for the hooks.

    Thank you for listening!

    PS: I thought about at the trigger level suggestion I saw here, but I don't know ahead of time everything I will be doing to a server when I work on it, so I can't see myself using it. I may decide to do port upgrades at the same time, or reboot the server. Moreover, when I am working intimately with it, I am checking to make sure I didn't mess up any of the other services, and fix them when I do. I believe that scenario would be typical for most people.
    Last edited by IT_Architect; 09-10-2010, 06:32. Reason: Make it more understandable for non-native English speakers
  • MrKen
    Senior Member
    • Oct 2008
    • 652

    #2
    Originally posted by IT_Architect
    . . . .there is an essential piece missing, the need to be able to schedule Zabbix deactivate monitoring a certain host, or group, during a pre-determined time period.
    Have you not heard about Maintenance mode? http://www.zabbix.com/documentation/1.8/manual/maintenance_mode_for_gui

    Either that or just do it manually, if it's just one host or host group.

    Originally posted by IT_Architect
    Useful: Be able to schedule where the notifications are sent based on time periods. Keep it simple. For those with complex requirements, would be better served by an application outside of Zabbix to accomplish that task. I would guess that most of your installed base are guys like me who try not to work 24 X 7, can't be glued to their e-mail all day, and whose main job is not to watch a wall of monitors errors all day long.
    You can already do this too! In the User's media set-up you can define periods (days, hours) in which each media is to be used. For example, emails during working hours, sms after-hours.

    MrKen
    Disclaimer: All of the above is pure speculation.

    Comment

    • IT_Architect
      Member
      • Feb 2006
      • 31

      #3
      Thank you for your reply

      Originally posted by MrKen
      The link you posted is not related to deactivating monitoring of a host or group for a predefined period. The link refers to disabling the user interface so people cannot make changes while the database is being maintained. "Zabbix GUI can be temporarily disabled in order to prohibit access to the front-end. This can be useful for protection of Zabbix database from any changes initiated by users, thus protecting integrity of database. "

      Originally posted by MrKen
      Either that or just do it manually, if it's just one host or host group.
      That also doesn't address the functionality of deactivating monitoring of a host or group for a predefined time period. For example, to do it manually for a data center maintenance period it would entail setting an alarm clock for 2 AM in the morning, getting up, going to a computer, turning host monitoring off for a host or group, and setting your alarm again for 3 AM, to get up and turn it back on. Nobody is going to do that. However, if you don't do that, Zabbix will send you false alarms at 2 AM when they start maintaining the servers. You will attempt to figure out what the problem is and discover it was a false alarm triggered by a maintenance period in the Data Center. After that, you're not likely to ever respond to a Zabbix alarm at night. During the day, you can turn it off manually when you maintain a server, but sometimes you will forget to turn it back on again. It may be quite awhile before you notice you forgot and turn it back on. So you learn from that and just don't turn it off anymore to avoid the risk of forgetting to turn it back on. The net result of both of these scenarios is more than 90% of the messages from Zabbix will be false alarms. You surely wouldn't want to be texted with all of these false alarms, so you send them to e-mail. When you get around to processing your e-mail, you will process the Zabbix messages last, because the odds are heavily in favor that all of the Zabbix messages will be false alarms. What has happened is Zabbix has become a source of self-inflicted spam for you, and you will be exactly where I am now, where you're more likely to learn about a problem hours later from a phone call, or when you discover during your own use, that something isn't working.

      Thus, neither of these two responses address the critical need to be able to schedule the deactivation of monitoring for a host, or group of hosts during periods of scheduled maintenance for the purpose of eliminating false notifications, nor do I see a real-life-usable work-around.

      Originally posted by MrKen
      You can already do this too! In the User's media set-up you can define periods (days, hours) in which each media is to be used. For example, emails during working hours, sms after-hours.MrKen
      WOW! I was initially confused by what you wrote, but learned your response was exactly correct. I have multiple installations of Zabbix, but all but one are on 1.6. The one that I upgraded to 1.8 a few days ago does indeed have EXACTLY what I need. That's perfect! Thank you for pointing that out.

      Summary:
      The only remaining issue I have is the most critical one, and that is being able to schedule the deactivation of monitoring for a host, and group of hosts, during maintenance windows to prevent false alarms. I would be happy to accept a solid work-around such as how I could write a script to deactivate and reactivate monitoring for a host and group of hosts. I don't need to have Zabbix schedule and run the script. I can manage that outside of Zabbix

      Thanks!
      Last edited by IT_Architect; 09-10-2010, 19:33.

      Comment

      • MrKen
        Senior Member
        • Oct 2008
        • 652

        #4
        Originally posted by IT_Architect

        I have multiple installations of Zabbix, but all but one are on 1.6. The one that I upgraded to 1.8 a few days ago does indeed have EXACTLY what I need. That's perfect! Thank you for pointing that out.
        This functionality is available in 1.6, and even in 1.4. And judging by the image in the 1.4 manual, it was available in 1.1

        MrKen
        Disclaimer: All of the above is pure speculation.

        Comment

        • IT_Architect
          Member
          • Feb 2006
          • 31

          #5
          Originally posted by MrKen
          This functionality is available in 1.6, and even in 1.4. And judging by the image in the 1.4 manual, it was available in 1.1 MrKen...In the User's media set-up you can define periods
          Hi MrKen,

          I'm going to have to say you are wrong on this one too. None of the 1.6 User setup windows even have the word Media on them. I don't know where you're seeing it, but I'm guessing you don't have a version 1.6 to look at.

          Other: Having been a programmer and dba for a long time, I looked through the data structures and wrote a php script that will activate and deactivate hosts or groups of hosts. It works perfect, and I just finished putting all of the error checking. The Maintenance Calendar they have in Zabbix is perfect, but I don't see that it does anything useful. Even if it disables changes from the GUI, the database would be changing many times a second from monitor data. I couldn't believe that it didn't also disable monitoring, so I tried it. The manual is right. It does nothing to stop monitoring. The Task Scheduler on the Windows servers that I use for everything else and hoped to use here, won't work because it doesn't understand end times. Soooo I'm going to need to come up with a scheduler that does. One option is to use some of the scheduling code from one of the ERP packages I've written. I'd have to modify extensively it because it has far too much functionality for this application. Another option is to find a simple system on the web that understands beginning times, end times, and durations.
          Last edited by IT_Architect; 11-10-2010, 03:36.

          Comment

          • MrKen
            Senior Member
            • Oct 2008
            • 652

            #6
            Looks like 1.6.5 to me!
            Attached Files
            Disclaimer: All of the above is pure speculation.

            Comment

            • IT_Architect
              Member
              • Feb 2006
              • 31

              #7
              Oh no! I'm going to have to eat crow on this one. I looked all over that screen for the word Media before, and before I posted. They hid it in plain sight on me. The only thing different between 1.6 and 1.8 is where they put it. Crawling back under my rock.

              What remains is the glaring lack of a way within Zabbix to discontinue monitoring of hosts during maintenance periods to prevent the many false alarms that I, and it must be everyone else, are getting. Incorporating this functionality it would be huge boost to Zabbix's real-world usability as a monitoring solution.

              I have a php script I can post if there is interest that can be used as a work-around. The problem with it being outside of Zabbix is if you change the name of a host, group, password, etc., it will break, and you will need to provide your own means of scheduling it.

              Thanks!
              Last edited by IT_Architect; 11-10-2010, 15:43.

              Comment

              • jpriceit
                Junior Member
                • Feb 2008
                • 12

                #8
                Originally posted by IT_Architect
                What remains is the glaring lack of a way within Zabbix to discontinue monitoring of hosts during maintenance periods to prevent the many false alarms that I, and it must be everyone else, are getting. Incorporating this functionality it would be huge boost to Zabbix's real-world usability as a monitoring solution.
                I think this option solves that problem. I am just now trying this for the first time, but it would appear to do so. Note: Using v1.8.3 release.

                Edit: I would also like to point out that this entire maintenance feature is either not documented or is difficult to find in the manual.
                Attached Files
                Last edited by jpriceit; 17-11-2010, 21:17.

                Comment

                • IT_Architect
                  Member
                  • Feb 2006
                  • 31

                  #9
                  Originally posted by jpriceit
                  Does this option not achieve that goal? I am just now trying this for the first time, but it would appear to do so. Note: Using v1.8.3 release.
                  All I can say is try it. Since I never got anything useful out of it, I wrote my own during which my expectations changed. I wrote a PHP script that accepts inputs from the command line or other scripts. It allows groups inside of groups. Example:
                  - I have a group that all hosts that a Zabbix instance is servicing in one group.
                  - You need to have two Zabbix instances in a data center in case a Zabbix machine goes down. Example Dallas1-Z1, Dallas1-Z2.
                  - I have a Data Center Group that includes both of those groups, Example Dallas1, so that when the DC is under maintenance, I can simply schedule Dallas1 for maintenance, and both groups and anything outside of the DC that is monitoring Dallas1 do not monitor anything at Dallas1 during that period.
                  - I also have Global Groups. For instance, in the case where you have a data provider that supplies data to web apps scattered across DCs, I schedule that group, and it will automatically make sure those application checks are not made. This is useful in a hosting situation where you want to monitor the server, but not the web applications of certain domains.
                  - This notification system for the Zabbix servers has been wonderful because in my case, I've been virtual for years. When I need to work on a physical machine, you guessed it, all of the virtual machines on that server are in a group, and whatever is monitoring gets the message not to during the scheduled maintenance period. There is no more matrix in my head of who's watching what. I can move virtual machines across servers with very few changes.

                  Summary:
                  It's been a dream. When I get maintenance notices, I just put them on the schedule for 5 minutes before the scheduled down time, and until 30 minutes after the scheduled down time. I can easily see at any time when something will be down. After the expiration period, the checks kick in. If the application server data is messed up, the web application checks fail, and I'll know before morning that I need to get on the phone with the data vendor so come morning, I don't start the day off losing money. I can take expired schedules, change the times, and re-use them. I now have Zabbix text me for disaster-level events when there is a problem, because I know if I get a text, I am losing money, no maybe about it.

                  With Zabbix capability to watch services and applications, and this to cut out all the false alarms, it has freed my mind to where I don't worry about the servers anymore. The only thing I see now is my daily report that tells me how the backups went, what needs to be updated, and server messages from the previous day that show me server load and disk space problems. If I want to analyze a problem, I can go into Zabbix and call up a graph. I live in a lot calmer environment now. If the Zabbix scheduler doesn't work for you, maybe you will want to make up something like this.

                  Comment

                  • jpriceit
                    Junior Member
                    • Feb 2008
                    • 12

                    #10
                    I got a chance to test this today. One of our hosts that was scheduled to have updates installed was rebooted multiple times (it had a lot of windows updates to be installed).

                    I didn't receive a single alert for that host during this time.

                    Comment

                    • IT_Architect
                      Member
                      • Feb 2006
                      • 31

                      #11
                      Originally posted by jpriceit
                      I would also like to point out that this entire maintenance feature is either not documented or is difficult to find in the manual. ...I got a chance to test this today. One of our hosts that was scheduled to have updates installed was rebooted multiple times (it had a lot of windows updates to be installed). I didn't receive a single alert for that host during this time.
                      Perfect! That's something that couldn't be answered satisfactorily before, and you must have done something different than I when you did it. That might work for most people. What I have is better now, but if I could have gotten it to work, I perhaps would have taken it.

                      Thanks for the feedback!

                      Comment

                      • danrog
                        Senior Member
                        • Sep 2009
                        • 164

                        #12
                        We use maintenance mode and have over 1000 hosts (a lot setup with only snmp traps) and we don't receive a single alarm during maintenance. The key is to setup (as another poster mentioned) maintenance with no data collection AND add to the action Maintenace status = not in maintenance. We also don't get many if any false alarms. I spent about a week tweaking triggers and actions when we first switched to Zabbix. Taking the time upfront planning it out definitely helped our deployment.

                        Comment

                        • IT_Architect
                          Member
                          • Feb 2006
                          • 31

                          #13
                          Originally posted by danrog
                          The key is to setup (as another poster mentioned) maintenance with no data collection AND add to the action Maintenace status = not in maintenance.
                          There we go. That's a clearly spelled out key piece of information. I wouldn't go back to this after what I have now since I've come to rely on nested groups, cross-Zabbix-server groups, and global Zabbix servers notifications.
                          Last edited by IT_Architect; 18-11-2010, 14:12.

                          Comment

                          • untergeek
                            Senior Member
                            Zabbix Certified Specialist
                            • Jun 2009
                            • 512

                            #14
                            Maintenance mode is so critical to our operations that I wrote shell scripts to directly access the database with the same commands as the UI.

                            We now are able to enter maintenance for a host or a group within moments with the same precision or with intervals by hours and/or minutes.

                            Granted, this bypasses security constraints, but only my team has access to the server with the scripts.


                            Code:
                            $ maint.sh 
                            Usage: maint.sh [OPTIONS]
                                      -i (run in interactive mode)
                                      -m (run in manual mode)
                                      -e [Maintenance ID] (end maintenance now)
                                      -s (show scheduled maintenance for next 24 hours)
                                      -x (silence all alerts and delete all escalations)
                                      -z (put all groups in maintenance (-x will be set also))
                                    Manual options 
                                      -H [hours] -M [minutes] -S [seconds] (duration calculations)
                                      -C [CR Number] -I [IN Number] 
                                      -g [Group search term]
                                      -h [Host search term]
                                      -n ["Maintenance Name/Title (enclose in quotes)"]
                                            Defaults to "Added by $FULLNAME on $DATE by script"
                                      -d ["Maintenance Description (enclose in quotes)"] (Only if no CR/IN)
                                            Defaults to "Quick maintenance window added by script without CR or IN"
                                      -T [Comma separated list of recipients - in addition to [email protected]]
                                      -? (Display this help)
                            We have our unix logins mapped to the same as our zabbix logins, so that's how we know who created a given maintenance. The other functionality which was so enjoyable was the ability to quickly end a maintenance window when a server was done being maintained.

                            This is by no means a complete implementation. It does not allow for creation of repetitive maintenance (e.g., weekly or daily). We still use the UI for that. This tool is for quick maintenance window creation and for showing servers/groups currently in maintenance, etc.

                            Comment

                            • fmrapid
                              Member
                              • Aug 2010
                              • 43

                              #15
                              Maintenance script

                              Would you care to share the maintenance management script you have created. This is certainly something that is of much interest to all here.

                              I can also see a will to convert the script to using the API for a more consistent approach if possible.

                              You can put it up on the wiki or link it somewhere else, taking care to strip out any passwords.

                              Thank you,

                              fmrapid

                              Comment

                              Working...