My site's current Zabbix server environment is zabbix 4.4.7 on RHEL 7.8 x86_64 . We've been using Zabbix since the 2.0.x days, so this question is about a change or improvement to an existing environment.
We're looking to adjust our existing alerting in the following way:
For most of our enterprise systems, we are configured to alert 7x24 if there is a problem, and that's working fine.
My workgroup has taken on monitoring a lot of client systems that have limited operational hours, though, and we're trying to adjust our actions to allow for flexible, per-host alerting periods. For example, we've begun monitoring a client that is only used from 7:00 AM until 5:00 PM during the weekday. We would like to have Zabbix continue to detect problems for this client on a 7x24 basis, but only alert our staff during business hours during the week. We don't need to be woken up in the middle of the night or paged on a weekend if there is a problem on this client, hence the desire for a per-host custom alerting period. We have other clients that would have slightly different alerting periods, so we're looking for a method that allows us to easily customize it per-host.
I thought that adding "Event time" to the list of action conditions and then having a user macro define the alerting period would be the perfect solution for this problem.
I modified one of our exising error actions to include Event time as part of the conditions, like this:

I defined {$ERROR_ACTION_PERIOD} globally to 1-7,00:00-24:00 , so that the default for any host is 7x24 alerting.
I then modified the first host to have {$ERROR_ACTION_PERIOD} => 1-5,06:00-17:00
Experienced Zabbix users can tell where this is going. It works as long as the original problem event was generated within the time period defined by the macro, but if the problem was first detected outside of that period and problem event generation mode is set to the default of "Single", then the conditions will never match because the event started outside the time period. That's as designed, but it makes it unusable for our desired alerting goals.
We know that it's possible to enable "Multiple problem event generation", but it's not at all clear from the docs what the downsides are to doing that. Since our triggers for most things come from templates, we would need to enable multiple problem event generation for a large number of our hosts, whether they need it or not.
We know that triggers support time period evaluation too, but we've rejected that method here because setting the time period at the trigger level means that the problem is not registered at all until the host's specific "in use" period is reached.
We know about recurring maintenance periods as a way to suppress alerts, and right now that seems like it might be the closest match for our requirements. I avoided a per-host recurring maintenance period in my first attempt because we'll have to periodically extend the maintenance period and we'll potentially need a dozen or more of these to handle these hosts with different alerting requirements.
I'm looking for any advice from other Zabbix admins that have had to do something like this and what methods you've used to accomplish it.
I would also be interested to hear from sites that are using "Multiple problem" event generation on a widespread basis (not just for log items, but for basically everything), to understand more about what the downsides are for sing it widely.
Thanks,
Tim
We're looking to adjust our existing alerting in the following way:
- easily configure a custom alerting period on a per-host basis
- continue to detect problems for our hosts on a 7x24 basis, so the issue still appears in the web interface and a problem event is still registered
- if the problem happens outside of the alerting period, no alert action is generated until the alerting period is reached, at which point alerting should start.
For most of our enterprise systems, we are configured to alert 7x24 if there is a problem, and that's working fine.
My workgroup has taken on monitoring a lot of client systems that have limited operational hours, though, and we're trying to adjust our actions to allow for flexible, per-host alerting periods. For example, we've begun monitoring a client that is only used from 7:00 AM until 5:00 PM during the weekday. We would like to have Zabbix continue to detect problems for this client on a 7x24 basis, but only alert our staff during business hours during the week. We don't need to be woken up in the middle of the night or paged on a weekend if there is a problem on this client, hence the desire for a per-host custom alerting period. We have other clients that would have slightly different alerting periods, so we're looking for a method that allows us to easily customize it per-host.
I thought that adding "Event time" to the list of action conditions and then having a user macro define the alerting period would be the perfect solution for this problem.
I modified one of our exising error actions to include Event time as part of the conditions, like this:
I defined {$ERROR_ACTION_PERIOD} globally to 1-7,00:00-24:00 , so that the default for any host is 7x24 alerting.
I then modified the first host to have {$ERROR_ACTION_PERIOD} => 1-5,06:00-17:00
Experienced Zabbix users can tell where this is going. It works as long as the original problem event was generated within the time period defined by the macro, but if the problem was first detected outside of that period and problem event generation mode is set to the default of "Single", then the conditions will never match because the event started outside the time period. That's as designed, but it makes it unusable for our desired alerting goals.
We know that it's possible to enable "Multiple problem event generation", but it's not at all clear from the docs what the downsides are to doing that. Since our triggers for most things come from templates, we would need to enable multiple problem event generation for a large number of our hosts, whether they need it or not.
We know that triggers support time period evaluation too, but we've rejected that method here because setting the time period at the trigger level means that the problem is not registered at all until the host's specific "in use" period is reached.
We know about recurring maintenance periods as a way to suppress alerts, and right now that seems like it might be the closest match for our requirements. I avoided a per-host recurring maintenance period in my first attempt because we'll have to periodically extend the maintenance period and we'll potentially need a dozen or more of these to handle these hosts with different alerting requirements.
I'm looking for any advice from other Zabbix admins that have had to do something like this and what methods you've used to accomplish it.
I would also be interested to hear from sites that are using "Multiple problem" event generation on a widespread basis (not just for log items, but for basically everything), to understand more about what the downsides are for sing it widely.
Thanks,
Tim
Comment