Scenarios that illustrate the problem:
- I read here where someone is managing 1000 hosts. He must see a lot of notifications. Somebody is working on at least one of them all the time.
- I get e-mail notifications of maintenance windows from e-commerce data source vendors, and DCs about maintenance times.
- We have offices in different time zones. We decided they would monitor the notifications at night. We will have Zabbix text our cell phones so we won't waste being glued to our e-mail.
The reality of the situation.
I'm attracted to technological marvels like Zabbix. My excuse is, monitoring is financially important. However, it's more often a phone call, happening to notice hours later, or after a half hour of Outlook error messages noticing the problem is one of our servers rather than the normal suspects, Comcast or HotMail.
How did I get to this point?
False positives. I can't imagine texting our cell phones with these messages unless they were almost always real. I don't remember when the maintenance windows are going to occur and which servers they are going to impact. I cause a lot of them myself while working on servers. The other office may call me in the middle of the night, and I may finally figure out it was a maintenance window. They learn from that, and then think it's a maintenance window or I'm just working on the server during off hours (or during hours) when there really is a problem that I'm not aware of, like a night last week when a server instantly ran out of space and crashed during a cross-server backup. I don't pay any attention to up-time numbers, but even if I did, they wouldn't be accurate enough to use.
Leveraging the power of Zabbix:
Critical:To leverage the power Zabbix already has, there is an essential piece missing, the need to be able to schedule Zabbix deactivate monitoring a certain host, or group, during a pre-determined time period. Most people will set these up as one-time events, but a sizable number will also need re-occurring. However, one-time can do both by simply adding them multiple places, or if it's easy, a way to copy them into the future. When I mess with a server on and off throughout the day, I would simply deactivate monitoring for the day so I don't have have to rely on memory to turn monitoring back on. In my opinion, this human-factors engineering improvement would do more to enhance Zabbix's effectiveness in the real world than anything else.
Useful: Be able to schedule where the notifications are sent based on time periods. Keep it simple. For those with complex requirements, would be better served by an application outside of Zabbix to accomplish that task. I would guess that most of your installed base are guys like me who try not to work 24 X 7, can't be glued to their e-mail all day, and whose main job is not to watch a wall of monitors errors all day long.
Useful: Have the scheduled down times not be included in the availability calculations. This is not important to me, but would be for those who need it to backup the worth of their services. I also avoids the scorn of the press and elevates Zabbix's stature when being evaluated.
Summary: None of this requires granularity below the host level. If you have a simple, effective way I can accomplish the down time automation and/or notification switching now with some functions I can call from a script, I would be glad to hear it. Perhaps something like using a graphical scheduler, like one of the Windows server's Task Scheduler, where I can easily visualize and manage the schedules, and have plink.exe ssh tunnel in and run scripts on the Zabbix server at the appointed time. I already use this method as my centrally managed, visual cron.
This might even be the best way to accomplish my requests, if I knew where to look for the hooks.
Thank you for listening!
PS: I thought about at the trigger level suggestion I saw here, but I don't know ahead of time everything I will be doing to a server when I work on it, so I can't see myself using it. I may decide to do port upgrades at the same time, or reboot the server. Moreover, when I am working intimately with it, I am checking to make sure I didn't mess up any of the other services, and fix them when I do. I believe that scenario would be typical for most people.
- I read here where someone is managing 1000 hosts. He must see a lot of notifications. Somebody is working on at least one of them all the time.
- I get e-mail notifications of maintenance windows from e-commerce data source vendors, and DCs about maintenance times.
- We have offices in different time zones. We decided they would monitor the notifications at night. We will have Zabbix text our cell phones so we won't waste being glued to our e-mail.
The reality of the situation.
I'm attracted to technological marvels like Zabbix. My excuse is, monitoring is financially important. However, it's more often a phone call, happening to notice hours later, or after a half hour of Outlook error messages noticing the problem is one of our servers rather than the normal suspects, Comcast or HotMail.

How did I get to this point?
False positives. I can't imagine texting our cell phones with these messages unless they were almost always real. I don't remember when the maintenance windows are going to occur and which servers they are going to impact. I cause a lot of them myself while working on servers. The other office may call me in the middle of the night, and I may finally figure out it was a maintenance window. They learn from that, and then think it's a maintenance window or I'm just working on the server during off hours (or during hours) when there really is a problem that I'm not aware of, like a night last week when a server instantly ran out of space and crashed during a cross-server backup. I don't pay any attention to up-time numbers, but even if I did, they wouldn't be accurate enough to use.

Leveraging the power of Zabbix:
Critical:To leverage the power Zabbix already has, there is an essential piece missing, the need to be able to schedule Zabbix deactivate monitoring a certain host, or group, during a pre-determined time period. Most people will set these up as one-time events, but a sizable number will also need re-occurring. However, one-time can do both by simply adding them multiple places, or if it's easy, a way to copy them into the future. When I mess with a server on and off throughout the day, I would simply deactivate monitoring for the day so I don't have have to rely on memory to turn monitoring back on. In my opinion, this human-factors engineering improvement would do more to enhance Zabbix's effectiveness in the real world than anything else.
Useful: Be able to schedule where the notifications are sent based on time periods. Keep it simple. For those with complex requirements, would be better served by an application outside of Zabbix to accomplish that task. I would guess that most of your installed base are guys like me who try not to work 24 X 7, can't be glued to their e-mail all day, and whose main job is not to watch a wall of monitors errors all day long.
Useful: Have the scheduled down times not be included in the availability calculations. This is not important to me, but would be for those who need it to backup the worth of their services. I also avoids the scorn of the press and elevates Zabbix's stature when being evaluated.
Summary: None of this requires granularity below the host level. If you have a simple, effective way I can accomplish the down time automation and/or notification switching now with some functions I can call from a script, I would be glad to hear it. Perhaps something like using a graphical scheduler, like one of the Windows server's Task Scheduler, where I can easily visualize and manage the schedules, and have plink.exe ssh tunnel in and run scripts on the Zabbix server at the appointed time. I already use this method as my centrally managed, visual cron.
This might even be the best way to accomplish my requests, if I knew where to look for the hooks.Thank you for listening!
PS: I thought about at the trigger level suggestion I saw here, but I don't know ahead of time everything I will be doing to a server when I work on it, so I can't see myself using it. I may decide to do port upgrades at the same time, or reboot the server. Moreover, when I am working intimately with it, I am checking to make sure I didn't mess up any of the other services, and fix them when I do. I believe that scenario would be typical for most people.

Comment