We have on the order of 30 active proxies monitoring 725K items, nearly all are network switches doing SNMP polling. Our server is a single box running postgres/timescale Zabbix 7.05 hitting 4700 VPS
and I have to add a hundred or so more switches to monitor, I'm worried I'll push the box over the edge where it'll never be able to keep up.
My queue spikes every 2 minutes, from a low of 4K to a high of 15K. One of my proxies always seem to have queued data by 5 or 10 seconds. I'm unclear if the server queue graph is because this specific proxy is slow/broken or because of the server itself.
I need to migrate to newer hardware, but what configuration changes should I be making?
A server restart takes on the order of 30 minutes, as the history sync takes a while and all the data from proxies has to be processed.
The biggest performance issue I see is disk writes. Zabbix says disk utilization is low, but iotop shows me doing 30-100 M/s constantly.
Throwing more CPU and memory at it doesn't seem to make a bit of difference.
I have 8 cores and even during those stressful reboots, my load doesn't get over 6, and I have 48Gb of memory with 34G as buffer/cache, so neither zabbix nor postgres is asking for more.
Disk load
Server queue during restart
normal queue

Comment