PDA

View Full Version : Maintenance & Notification Architecture Re-think


IT_Architect
08-10-2010, 22:16
Scenarios that illustrate the problem:
- I read here where someone is managing 1000 hosts. He must see a lot of notifications. Somebody is working on at least one of them all the time.
- I get e-mail notifications of maintenance windows from e-commerce data source vendors, and DCs about maintenance times.
- We have offices in different time zones. We decided they would monitor the notifications at night. We will have Zabbix text our cell phones so we won't waste being glued to our e-mail.

The reality of the situation.
I'm attracted to technological marvels like Zabbix. My excuse is, monitoring is financially important. However, it's more often a phone call, happening to notice hours later, or after a half hour of Outlook error messages noticing the problem is one of our servers rather than the normal suspects, Comcast or HotMail. :o

How did I get to this point?
False positives. I can't imagine texting our cell phones with these messages unless they were almost always real. I don't remember when the maintenance windows are going to occur and which servers they are going to impact. I cause a lot of them myself while working on servers. The other office may call me in the middle of the night, and I may finally figure out it was a maintenance window. They learn from that, and then think it's a maintenance window or I'm just working on the server during off hours (or during hours) when there really is a problem that I'm not aware of, like a night last week when a server instantly ran out of space and crashed during a cross-server backup. I don't pay any attention to up-time numbers, but even if I did, they wouldn't be accurate enough to use. :(

Leveraging the power of Zabbix:
Critical:To leverage the power Zabbix already has, there is an essential piece missing, the need to be able to schedule Zabbix deactivate monitoring a certain host, or group, during a pre-determined time period. Most people will set these up as one-time events, but a sizable number will also need re-occurring. However, one-time can do both by simply adding them multiple places, or if it's easy, a way to copy them into the future. When I mess with a server on and off throughout the day, I would simply deactivate monitoring for the day so I don't have have to rely on memory to turn monitoring back on. In my opinion, this human-factors engineering improvement would do more to enhance Zabbix's effectiveness in the real world than anything else.

Useful: Be able to schedule where the notifications are sent based on time periods. Keep it simple. For those with complex requirements, would be better served by an application outside of Zabbix to accomplish that task. I would guess that most of your installed base are guys like me who try not to work 24 X 7, can't be glued to their e-mail all day, and whose main job is not to watch a wall of monitors errors all day long.

Useful: Have the scheduled down times not be included in the availability calculations. This is not important to me, but would be for those who need it to backup the worth of their services. I also avoids the scorn of the press and elevates Zabbix's stature when being evaluated.

Summary: None of this requires granularity below the host level. If you have a simple, effective way I can accomplish the down time automation and/or notification switching now with some functions I can call from a script, I would be glad to hear it. Perhaps something like using a graphical scheduler, like one of the Windows server's Task Scheduler, where I can easily visualize and manage the schedules, and have plink.exe ssh tunnel in and run scripts on the Zabbix server at the appointed time. I already use this method as my centrally managed, visual cron. :D This might even be the best way to accomplish my requests, if I knew where to look for the hooks.

Thank you for listening!

PS: I thought about at the trigger level suggestion I saw here, but I don't know ahead of time everything I will be doing to a server when I work on it, so I can't see myself using it. I may decide to do port upgrades at the same time, or reboot the server. Moreover, when I am working intimately with it, I am checking to make sure I didn't mess up any of the other services, and fix them when I do. I believe that scenario would be typical for most people.

MrKen
09-10-2010, 07:28
. . . .there is an essential piece missing, the need to be able to schedule Zabbix deactivate monitoring a certain host, or group, during a pre-determined time period.


Have you not heard about Maintenance mode? http://www.zabbix.com/documentation/1.8/manual/maintenance_mode_for_gui (http://www.zabbix.com/forum/../documentation/1.8/manual/maintenance_mode_for_gui)

Either that or just do it manually, if it's just one host or host group.


Useful: Be able to schedule where the notifications are sent based on time periods. Keep it simple. For those with complex requirements, would be better served by an application outside of Zabbix to accomplish that task. I would guess that most of your installed base are guys like me who try not to work 24 X 7, can't be glued to their e-mail all day, and whose main job is not to watch a wall of monitors errors all day long.


You can already do this too! In the User's media set-up you can define periods (days, hours) in which each media is to be used. For example, emails during working hours, sms after-hours.

MrKen

IT_Architect
09-10-2010, 17:29
Thank you for your reply

Have you not heard about Maintenance mode? http://www.zabbix.com/documentation/1.8/manual/maintenance_mode_for_gui (http://www.zabbix.com/forum/../documentation/1.8/manual/maintenance_mode_for_gui)

The link you posted is not related to deactivating monitoring of a host or group for a predefined period. The link refers to disabling the user interface so people cannot make changes while the database is being maintained. "Zabbix GUI can be temporarily disabled in order to prohibit access to the front-end. This can be useful for protection of Zabbix database from any changes initiated by users, thus protecting integrity of database. "

Either that or just do it manually, if it's just one host or host group.

That also doesn't address the functionality of deactivating monitoring of a host or group for a predefined time period. For example, to do it manually for a data center maintenance period it would entail setting an alarm clock for 2 AM in the morning, getting up, going to a computer, turning host monitoring off for a host or group, and setting your alarm again for 3 AM, to get up and turn it back on. Nobody is going to do that. However, if you don't do that, Zabbix will send you false alarms at 2 AM when they start maintaining the servers. You will attempt to figure out what the problem is and discover it was a false alarm triggered by a maintenance period in the Data Center. After that, you're not likely to ever respond to a Zabbix alarm at night. During the day, you can turn it off manually when you maintain a server, but sometimes you will forget to turn it back on again. It may be quite awhile before you notice you forgot and turn it back on. So you learn from that and just don't turn it off anymore to avoid the risk of forgetting to turn it back on. The net result of both of these scenarios is more than 90% of the messages from Zabbix will be false alarms. You surely wouldn't want to be texted with all of these false alarms, so you send them to e-mail. When you get around to processing your e-mail, you will process the Zabbix messages last, because the odds are heavily in favor that all of the Zabbix messages will be false alarms. What has happened is Zabbix has become a source of self-inflicted spam for you, and you will be exactly where I am now, where you're more likely to learn about a problem hours later from a phone call, or when you discover during your own use, that something isn't working.

Thus, neither of these two responses address the critical need to be able to schedule the deactivation of monitoring for a host, or group of hosts during periods of scheduled maintenance for the purpose of eliminating false notifications, nor do I see a real-life-usable work-around. :(

You can already do this too! In the User's media set-up you can define periods (days, hours) in which each media is to be used. For example, emails during working hours, sms after-hours.MrKen

WOW! I was initially confused by what you wrote, but learned your response was exactly correct. I have multiple installations of Zabbix, but all but one are on 1.6. The one that I upgraded to 1.8 a few days ago does indeed have EXACTLY what I need. That's perfect! Thank you for pointing that out. :o

Summary:
The only remaining issue I have is the most critical one, and that is being able to schedule the deactivation of monitoring for a host, and group of hosts, during maintenance windows to prevent false alarms. I would be happy to accept a solid work-around such as how I could write a script to deactivate and reactivate monitoring for a host and group of hosts. I don't need to have Zabbix schedule and run the script. I can manage that outside of Zabbix

Thanks!

MrKen
10-10-2010, 04:34
I have multiple installations of Zabbix, but all but one are on 1.6. The one that I upgraded to 1.8 a few days ago does indeed have EXACTLY what I need. That's perfect! Thank you for pointing that out. :o



This functionality is available in 1.6, and even in 1.4. And judging by the image in the 1.4 manual, it was available in 1.1 ;)

MrKen

IT_Architect
10-10-2010, 14:20
This functionality is available in 1.6, and even in 1.4. And judging by the image in the 1.4 manual, it was available in 1.1 ;)MrKen...In the User's media set-up you can define periodsHi MrKen,

I'm going to have to say you are wrong on this one too.:D None of the 1.6 User setup windows even have the word Media on them. I don't know where you're seeing it, but I'm guessing you don't have a version 1.6 to look at.

Other: Having been a programmer and dba for a long time, I looked through the data structures and wrote a php script that will activate and deactivate hosts or groups of hosts. It works perfect, and I just finished putting all of the error checking. The Maintenance Calendar they have in Zabbix is perfect, but I don't see that it does anything useful. Even if it disables changes from the GUI, the database would be changing many times a second from monitor data. I couldn't believe that it didn't also disable monitoring, so I tried it. The manual is right. It does nothing to stop monitoring. The Task Scheduler on the Windows servers that I use for everything else and hoped to use here, won't work because it doesn't understand end times. Soooo I'm going to need to come up with a scheduler that does. One option is to use some of the scheduling code from one of the ERP packages I've written. I'd have to modify extensively it because it has far too much functionality for this application. Another option is to find a simple system on the web that understands beginning times, end times, and durations.

MrKen
11-10-2010, 03:40
Looks like 1.6.5 to me!

IT_Architect
11-10-2010, 04:53
Oh no! I'm going to have to eat crow on this one.:D I looked all over that screen for the word Media before, and before I posted. They hid it in plain sight on me. The only thing different between 1.6 and 1.8 is where they put it. Crawling back under my rock.:o

What remains is the glaring lack of a way within Zabbix to discontinue monitoring of hosts during maintenance periods to prevent the many false alarms that I, and it must be everyone else, are getting. Incorporating this functionality it would be huge boost to Zabbix's real-world usability as a monitoring solution.

I have a php script I can post if there is interest that can be used as a work-around. The problem with it being outside of Zabbix is if you change the name of a host, group, password, etc., it will break, and you will need to provide your own means of scheduling it.

Thanks!

jpriceit
17-11-2010, 20:10
What remains is the glaring lack of a way within Zabbix to discontinue monitoring of hosts during maintenance periods to prevent the many false alarms that I, and it must be everyone else, are getting. Incorporating this functionality it would be huge boost to Zabbix's real-world usability as a monitoring solution.
I think this option solves that problem. I am just now trying this for the first time, but it would appear to do so. Note: Using v1.8.3 release.

Edit: I would also like to point out that this entire maintenance feature is either not documented or is difficult to find in the manual.

IT_Architect
17-11-2010, 22:04
Does this option not achieve that goal? I am just now trying this for the first time, but it would appear to do so. Note: Using v1.8.3 release.All I can say is try it. Since I never got anything useful out of it, I wrote my own during which my expectations changed. I wrote a PHP script that accepts inputs from the command line or other scripts. It allows groups inside of groups. Example:
- I have a group that all hosts that a Zabbix instance is servicing in one group.
- You need to have two Zabbix instances in a data center in case a Zabbix machine goes down. Example Dallas1-Z1, Dallas1-Z2.
- I have a Data Center Group that includes both of those groups, Example Dallas1, so that when the DC is under maintenance, I can simply schedule Dallas1 for maintenance, and both groups and anything outside of the DC that is monitoring Dallas1 do not monitor anything at Dallas1 during that period.
- I also have Global Groups. For instance, in the case where you have a data provider that supplies data to web apps scattered across DCs, I schedule that group, and it will automatically make sure those application checks are not made. This is useful in a hosting situation where you want to monitor the server, but not the web applications of certain domains.
- This notification system for the Zabbix servers has been wonderful because in my case, I've been virtual for years. When I need to work on a physical machine, you guessed it, all of the virtual machines on that server are in a group, and whatever is monitoring gets the message not to during the scheduled maintenance period. There is no more matrix in my head of who's watching what. I can move virtual machines across servers with very few changes.

Summary:
It's been a dream. When I get maintenance notices, I just put them on the schedule for 5 minutes before the scheduled down time, and until 30 minutes after the scheduled down time. I can easily see at any time when something will be down. After the expiration period, the checks kick in. If the application server data is messed up, the web application checks fail, and I'll know before morning that I need to get on the phone with the data vendor so come morning, I don't start the day off losing money. I can take expired schedules, change the times, and re-use them. I now have Zabbix text me for disaster-level events when there is a problem, because I know if I get a text, I am losing money, no maybe about it.

With Zabbix capability to watch services and applications, and this to cut out all the false alarms, it has freed my mind to where I don't worry about the servers anymore. The only thing I see now is my daily report that tells me how the backups went, what needs to be updated, and server messages from the previous day that show me server load and disk space problems. If I want to analyze a problem, I can go into Zabbix and call up a graph. I live in a lot calmer environment now. If the Zabbix scheduler doesn't work for you, maybe you will want to make up something like this.

jpriceit
17-11-2010, 22:10
I got a chance to test this today. One of our hosts that was scheduled to have updates installed was rebooted multiple times (it had a lot of windows updates to be installed).

I didn't receive a single alert for that host during this time.

IT_Architect
17-11-2010, 23:02
I would also like to point out that this entire maintenance feature is either not documented or is difficult to find in the manual. ...I got a chance to test this today. One of our hosts that was scheduled to have updates installed was rebooted multiple times (it had a lot of windows updates to be installed). I didn't receive a single alert for that host during this time. Perfect! That's something that couldn't be answered satisfactorily before, and you must have done something different than I when you did it. That might work for most people. What I have is better now, but if I could have gotten it to work, I perhaps would have taken it.

Thanks for the feedback!

danrog
18-11-2010, 03:48
We use maintenance mode and have over 1000 hosts (a lot setup with only snmp traps) and we don't receive a single alarm during maintenance. The key is to setup (as another poster mentioned) maintenance with no data collection AND add to the action Maintenace status = not in maintenance. We also don't get many if any false alarms. I spent about a week tweaking triggers and actions when we first switched to Zabbix. Taking the time upfront planning it out definitely helped our deployment.

IT_Architect
18-11-2010, 13:01
The key is to setup (as another poster mentioned) maintenance with no data collection AND add to the action Maintenace status = not in maintenance.There we go. That's a clearly spelled out key piece of information. I wouldn't go back to this after what I have now since I've come to rely on nested groups, cross-Zabbix-server groups, and global Zabbix servers notifications.

untergeek
22-11-2010, 18:09
Maintenance mode is so critical to our operations that I wrote shell scripts to directly access the database with the same commands as the UI.

We now are able to enter maintenance for a host or a group within moments with the same precision or with intervals by hours and/or minutes.

Granted, this bypasses security constraints, but only my team has access to the server with the scripts.


$ maint.sh
Usage: maint.sh [OPTIONS]
-i (run in interactive mode)
-m (run in manual mode)
-e [Maintenance ID] (end maintenance now)
-s (show scheduled maintenance for next 24 hours)
-x (silence all alerts and delete all escalations)
-z (put all groups in maintenance (-x will be set also))
Manual options
-H [hours] -M [minutes] -S [seconds] (duration calculations)
-C [CR Number] -I [IN Number]
-g [Group search term]
-h [Host search term]
-n ["Maintenance Name/Title (enclose in quotes)"]
Defaults to "Added by $FULLNAME on $DATE by script"
-d ["Maintenance Description (enclose in quotes)"] (Only if no CR/IN)
Defaults to "Quick maintenance window added by script without CR or IN"
-T [Comma separated list of recipients - in addition to hs_team@REDACTED.com]
-? (Display this help)

We have our unix logins mapped to the same as our zabbix logins, so that's how we know who created a given maintenance. The other functionality which was so enjoyable was the ability to quickly end a maintenance window when a server was done being maintained.

This is by no means a complete implementation. It does not allow for creation of repetitive maintenance (e.g., weekly or daily). We still use the UI for that. This tool is for quick maintenance window creation and for showing servers/groups currently in maintenance, etc.

fmrapid
22-11-2010, 19:27
Would you care to share the maintenance management script you have created. This is certainly something that is of much interest to all here.

I can also see a will to convert the script to using the API for a more consistent approach if possible.

You can put it up on the wiki or link it somewhere else, taking care to strip out any passwords.

Thank you,

fmrapid

untergeek
22-11-2010, 20:01
Here is my current script. There may be some inconsistencies and poor choices for implementation, but it works.

Please note that we are an Oracle shop and all queries are formulated with this in mind. This would have to be adapted for MySQL or any other supported DB.

I have removed usernames and passwords and changed URLs and email addresses to use example.com

We force the usage of Change Requests (CR) or Incidents (IN) to create a maintenance window, or notification is sent of an infraction. This should be avoidable/editable per your setup.

I would be pleased and gratified of any changes or improvements or points you'd like to make. I'm a UNIX sysadmin and Java application admin by trade and not a coder. I'm merely handy with shell scripting. Some of the commented out bits reflect a recent change from Solaris as our Zabbix Server host OS to RHEL 5. Again, adapt as needed.

lukaswu
04-05-2011, 12:01
The key is to setup (as another poster mentioned) maintenance with no data collection AND add to the action Maintenace status = not in maintenance.

Could you sched some light on this? In my case I do not use "Actions" at all.

Well, in 1.8.4 Maintenance is broken again, at least it does not work for me. I found already submitted bug on this (in commercial supported version). Soon I will be testing 1.8.5, though on the change list I did not notice patch for this problem.

--
luk

untergeek
04-05-2011, 12:41
I am well and truly puzzled. We've used maintenance mode since 1.8.0 and never had it not work, once we figured out that we needed "maintenance status = not in maintenance" in the action conditions. We even still set up our maintenance windows with data collection so we're collecting data if it can be collected during the maintenance period. We've upgraded at each level, from 1.8.0 to .1, .2, etc. right up through 1.8.5. It has always worked. We had 2 Zabbix servers in separate environments, one for staging and one for production. We've added a third for our failover site. Maintenance mode has worked as expected in each case. I don't know why people have been saying that a given release has "broken" for them. It hasn't for us.

lukaswu
05-05-2011, 10:51
I am well and truly puzzled. We've used maintenance mode since 1.8.0 and never had it not work, once we figured out that we needed "maintenance status = not in maintenance" in the action conditions. We even still set up our maintenance windows with data collection so we're collecting data if it can be collected during the maintenance period. We've upgraded at each level, from 1.8.0 to .1, .2, etc. right up through 1.8.5. It has always worked. We had 2 Zabbix servers in separate environments, one for staging and one for production. We've added a third for our failover site. Maintenance mode has worked as expected in each case. I don't know why people have been saying that a given release has "broken" for them. It hasn't for us.
Allright, maybe I was wrong and this feature has never worked- I needed it in fact in 1.8.4.

Again, I do not use "Actions" (hope you are refereing to Configuration/Actions) and maintenance mode simply does not work in 1.8.4. Period.

When I set machine into maintenance mode I expect no alarm would show up regadless for status of data collection. Otherwise it does not make any sense. In my case when I set machine in "no data collection state" no alarms show up unless I reboot machine or shut down the Zabbix agent- we have item working with "no data" in case we lose network and it always shows up. Again maintanence mode should put server on a hook and Zabbix server should ignore ANY errors in described scope. I presume if you do not use "no data" in item(s), you would be unaware in fact when the Zabbix server loses communication with monitored server and in this case mainatance mode apparently may work (in fact items are set into Unsupported state).

To confirm other people have problems too, see:

https://support.zabbix.com/browse/ZBX-3692
http://www.zabbix.com/forum/archive/index.php/t-21496.html
http://www.zabbix.com/forum/archive/index.php/t-16442.html

Kind regards.

--
luk

untergeek
05-05-2011, 12:59
I apologize. I misunderstood you. Let me try again:

1. You are only using visual cues in the Dashboard to determine whether a host or item is up/down (No actions).
2. Maintenance mode with no data collection (which effectively disables the entire host) is the method you're employing.
3. You're still seeing errors in the Dashboard.

Do I understand correctly? If so, then I have some additional follow-up questions.

1. Were the items already alerting before the maintenance?
2. Do you see the items which are in an alerted state in the "Last 20 issues" section of the Dashboard? If so, are the host names orange (instead of blue)?

If the answer to #1 is yes, then according to my understanding you should still see the items even after the maintenance has begun. Escalations already in place exist outside of maintenance.

If the answer to #2 is orange, then maintenance is at least properly happening, whether before the items alert or not. I find that even hosts already alerting will show up orange in the Dashboard. Maintenance will only prevent notifications from going out for new alerts, not for pre-existing ones. I believe that will be the case even when "No data collection" is selected. In my case, because we always continue with data collection throughout maintenance, we see alerts in the Dashboard but no notifications come through. In your case I think that "no data collection" should prevent new alerts from showing up, even in the dashboard. However, it will not prevent existing alerts from continuing to appear.

At my location, we only care about whether the notifications come through or not. In fact, we depend upon Zabbix continuing to show host & item status (with the orange links to show maintenance) while we work through issues. We have visual confirmation that they have cleared up or persist. I am not sure what to say about the other side of the coin, what it should look like and whether or not it's a bug that it does not behave the way it is expected. If this worked previously, it may either be a bug now, or it was "broken" then (i.e. you had the desired functionality but that itself was not intended by the Zabbix team).