Ad Widget

**lamont** · 26-07-2008, 01:05

I've got a set of production boxes which on a daily basis do administrative tasks, but never at a consistent time every day. Most of the servers just bounce themselves and are out of service for a minute or three. I've got these servers monitored with web checks on the zabbix server and these tests go off every night.

I've thought about simply increasing the trigger threshold so that it requires 5 failures in a row, or 3 out of 5 failures or similar playing around with trigger math. However, I've been getting good data on transient single failures due to java garbage collection algorithms which suspend all the threads for several seconds while the servers are still taking traffic in the load balancer, and I'd like to still remain sensitive to triggering on those kinds of transient issues.

I also have a set of servers which don't simply bounce the service every night for a couple minutes, but are down for over half an hour as they do a bunch of pre-processing (operationally this is terrible for numerous reasons, but that is the way they behave -- at least most of the time they're not all bounced at the same time). So even if I increase the trigger threshold to require more sequential failures, these boxes will still alarm every night.

What I would like to do is client-side when the tomcat is administratively shut down to disable most of the triggers on the host. Is there a better way to do this than to simply disable the zabbix-agent as part of the tomcat shutdown?

And...

While I was writing this... If I do a tcp check of the port that the software is listening on and that tcp check fails then I have a pretty good indication that the software is administratively down. I could make that an informational trigger and still display it as an IT service, but not send out e-mail announcements, and setup dependencies so that when this check fails, other triggers would be suppressed... That way on a dashboard level you could still see that the host was not taking traffic, but it would suppress the error condition of the host listening for traffic but the software not working, which is much worse condition. Its possible that the software could have crashed before then, but typically the software will do things (log FATALs, run out of memory, grind at 100% CPU, etc) which are things that I would not suppress based on the software not being up.

There might be some other clever way of having a checkpoint file that the startup/shutdown scripts would modify to note that the software was administratively down or not, and I might be able to pull that information into zabbix and use that to determine if a trigger failure was due to software failure or the software being administratively down.

Anyone already got a working, implemented solution to this problem?

Ad Widget

Disabling Triggers from Client-Side

Disabling Triggers from Client-Side