We are monitoring over 500 hosts spread across several sites geographically (using zabbix proxies at each site).
The alerter process for Zabbix will often spike to 100% busy for a minute or two, but calm back down. It has progressively gotten worse, to the point where the process was busy (100%) for a couple days at a time, causing Zabbix to fail to connect to monitored hosts and throwing many false alarms. Postgresql also has a very hard time when this happens.
I have observed the logs and don't see any correlation with the spikes. When I attach an strace to the pid, I see a lot of:
Is there a way to start multiple Alerter processes to share the load? What do we need in order to prevent this from happening?
The alerter process for Zabbix will often spike to 100% busy for a minute or two, but calm back down. It has progressively gotten worse, to the point where the process was busy (100%) for a couple days at a time, causing Zabbix to fail to connect to monitored hosts and throwing many false alarms. Postgresql also has a very hard time when this happens.
I have observed the logs and don't see any correlation with the spikes. When I attach an strace to the pid, I see a lot of:
Code:
select(0, NULL, NULL, NULL, {0, 100000}) = 0 (Timeout)
read(4, 0x289a9e3, 5) = -1 EAGAIN (Resource temporarily unavailable)
Comment