Dear Zabbix community,
looking for some help in tuning our configuration to deal with the following problem. Most of our data is collected by 9 Zabbix proxies (Zabbix server itself only monitors the proxy nodes and a handful of other nodes), and today 3 of them went offline due to data centre issues. I created a maintenance entry turning off data collection for the hosts behind the proxies and for the proxy nodes themselves. Still, after a while the "utilization of poller data collector processes" crept up and went above 75%. Normally this utilization is below 1%. Then I set all the hosts behind these proxies to disabled, bumped the Zabbix server config to StartPollers=50 and StartPollersUnreachable=10, and restarted the Zabbix server. But the high utilization of pollers kept coming back. Eventually also the pollers on the other 6 Zabbix proxies (which are fine otherwise) reach a high utilization! Then I've also bumped these proxies to StartPollers=10 and StartPollersUnreachable=5 (they had been fine with default settings before). This kept things okay for about one hour, though notably the "utilization of poller data collector processes" was still elevated (5-10%) and also the "utilization of proxy poller data collector processes" was at roughly 30% constant -- I had been running with the setting StartProxyPollers=10, so looks like each unavailable proxy is keeping one proxy data collector busy? But then the utilization suddenly shot up again, for proxy data collectors to 70% and for data collectors to 50% and rising, so then I've bumped to StartProxyPollers=25 and restarted the server. That's where I'm now, and the utilization of proxy data collectors is about 12% (3/25) and data collectors fluctuating around a few % (still higher than normal) with occasional spikes above 10%. But I fear the performance will degrade again, as I don't understand the reason for the observed behaviour. Can't see anything suspicious in the log files.
Zabbix server is 4.0.23, database is Postgresql 9.6, system information:
Proxies and agents are run in passive mode. Number of processed values per second is around 50, when all things are up.
Relevant settings in zabbix_server.conf:
Any suggestions what I can tune? When all proxies are up the load is very low on all collector processes, I don't understand why proxies being offline wreak havoc like this. Let me know what other data I can provide to figure out this issue. Many thanks!
looking for some help in tuning our configuration to deal with the following problem. Most of our data is collected by 9 Zabbix proxies (Zabbix server itself only monitors the proxy nodes and a handful of other nodes), and today 3 of them went offline due to data centre issues. I created a maintenance entry turning off data collection for the hosts behind the proxies and for the proxy nodes themselves. Still, after a while the "utilization of poller data collector processes" crept up and went above 75%. Normally this utilization is below 1%. Then I set all the hosts behind these proxies to disabled, bumped the Zabbix server config to StartPollers=50 and StartPollersUnreachable=10, and restarted the Zabbix server. But the high utilization of pollers kept coming back. Eventually also the pollers on the other 6 Zabbix proxies (which are fine otherwise) reach a high utilization! Then I've also bumped these proxies to StartPollers=10 and StartPollersUnreachable=5 (they had been fine with default settings before). This kept things okay for about one hour, though notably the "utilization of poller data collector processes" was still elevated (5-10%) and also the "utilization of proxy poller data collector processes" was at roughly 30% constant -- I had been running with the setting StartProxyPollers=10, so looks like each unavailable proxy is keeping one proxy data collector busy? But then the utilization suddenly shot up again, for proxy data collectors to 70% and for data collectors to 50% and rising, so then I've bumped to StartProxyPollers=25 and restarted the server. That's where I'm now, and the utilization of proxy data collectors is about 12% (3/25) and data collectors fluctuating around a few % (still higher than normal) with occasional spikes above 10%. But I fear the performance will degrade again, as I don't understand the reason for the observed behaviour. Can't see anything suspicious in the log files.
Zabbix server is 4.0.23, database is Postgresql 9.6, system information:
Code:
Number of hosts (enabled/disabled/templates) 351 143 / 114 / 94 Number of items (enabled/disabled/not supported) 11532 6781 / 4617 / 134 Number of triggers (enabled/disabled [problem/ok]) 5574 3152 / 2422 [5 / 3147]
Relevant settings in zabbix_server.conf:
Code:
StartPollers=50 StartIPMIPollers=3 StartPollersUnreachable=10 CacheSize=1G CacheUpdateFrequency=30 HistoryCacheSize=256M HistoryIndexCacheSize=64M TrendCacheSize=64M ValueCacheSize=1G Timeout=20 UnreachablePeriod=120 UnreachableDelay=120 LogSlowQueries=3000 StartProxyPollers=25 ProxyConfigFrequency=60 ProxyDataFrequency=10
Comment