Hello all,
We are facing issues with our Zabbix server / Proxies. Main ZBX server is EC2 instance in AWS (Ubuntu 20.04 LTS, Instance type: t3.large, Zabbix 4.4.8) + we have 6 active proxies (3 in AWS and 3 hosted in data centers = MH).
Main problem are 2 MH proxies - they take care of most of hosts in our environment (VA3 Proxy has 83 hosts/10264 items, RUH Proxy has 23 hosts/2467 items).
AWS <-> MH connection is set via AWS Load Balancer with URL alias (acting as ZBX sever access point on ports 10050 and 10051) and Citrix LB attaching public IPs to private ones for MH Proxies (again with ports 10050 and 10051 opened). Both sides have access rules set to allow only specified connections so our problem is not caused from outside (=DDOS attack etc).
Zabbix server started to be unstable when we switched to active clients (we need this for auto registration of clients, IIS log monitoring etc.). I did some performance tuning which helped with cache issues and poller warning but the main issue persist:
Two of our MH proxies randomly disconnecting and causing monitoring to go crazy (all clients show as unreachable) - when this happens proxy Last seen (age) time grows, but this happens only for one of two affected. Age grows for one (the other one seems to have bigger latency but it's visible) than they 'switch' and one is visible and so on. When this happens, I'm still able to telnet from Proxy to server without issues - first I was suspecting that underlying network issue is causing this, but our network team confirmed that there are no drops of connection.

There are recurring patterns in performance graphs - when proxy drops, Zabbix queue goes up and number of processed values per second goes down rapidly:

Another recurring anomaly is Utilization of trapper data collector goes up:

Rest of server performance seems fine:

Last 4 similar patterns are after performance tuning - server survives first spike of history data sync, and load is constant...then 'something' happens and connection goes down.
The only solution for our problem (apart form disabling all proxy clients and enabling them slowly) is to restart server and proxies.
Cache usage graph:

Server + Proxies config files: https://drive.google.com/open?id=1U9...jxU9MX6uz4d3o1
Changes in server config:
StartPollers=10
StartPollersUnreachable=15
StartDiscoverers=3
StartHTTPPollers=3
CacheSize=1000M
StartDBSyncers=5
HistoryCacheSize=800M
ValueCacheSize=1G
LogSlowQueries=3000
VA3 proxy config:
StartPollers=10
StartPreprocessors=5
StartPollersUnreachable=15
Timeout=5
I will be happy for any suggestion
Regards,
Adam
We are facing issues with our Zabbix server / Proxies. Main ZBX server is EC2 instance in AWS (Ubuntu 20.04 LTS, Instance type: t3.large, Zabbix 4.4.8) + we have 6 active proxies (3 in AWS and 3 hosted in data centers = MH).
Main problem are 2 MH proxies - they take care of most of hosts in our environment (VA3 Proxy has 83 hosts/10264 items, RUH Proxy has 23 hosts/2467 items).
AWS <-> MH connection is set via AWS Load Balancer with URL alias (acting as ZBX sever access point on ports 10050 and 10051) and Citrix LB attaching public IPs to private ones for MH Proxies (again with ports 10050 and 10051 opened). Both sides have access rules set to allow only specified connections so our problem is not caused from outside (=DDOS attack etc).
Zabbix server started to be unstable when we switched to active clients (we need this for auto registration of clients, IIS log monitoring etc.). I did some performance tuning which helped with cache issues and poller warning but the main issue persist:
Two of our MH proxies randomly disconnecting and causing monitoring to go crazy (all clients show as unreachable) - when this happens proxy Last seen (age) time grows, but this happens only for one of two affected. Age grows for one (the other one seems to have bigger latency but it's visible) than they 'switch' and one is visible and so on. When this happens, I'm still able to telnet from Proxy to server without issues - first I was suspecting that underlying network issue is causing this, but our network team confirmed that there are no drops of connection.
There are recurring patterns in performance graphs - when proxy drops, Zabbix queue goes up and number of processed values per second goes down rapidly:
Another recurring anomaly is Utilization of trapper data collector goes up:
Rest of server performance seems fine:
Last 4 similar patterns are after performance tuning - server survives first spike of history data sync, and load is constant...then 'something' happens and connection goes down.
The only solution for our problem (apart form disabling all proxy clients and enabling them slowly) is to restart server and proxies.
Cache usage graph:
Server + Proxies config files: https://drive.google.com/open?id=1U9...jxU9MX6uz4d3o1
Changes in server config:
StartPollers=10
StartPollersUnreachable=15
StartDiscoverers=3
StartHTTPPollers=3
CacheSize=1000M
StartDBSyncers=5
HistoryCacheSize=800M
ValueCacheSize=1G
LogSlowQueries=3000
VA3 proxy config:
StartPollers=10
StartPreprocessors=5
StartPollersUnreachable=15
Timeout=5
I will be happy for any suggestion
Regards,
Adam