I have been using Zabbix 2.4.8 to monitor a large ISP network for almost 2 years. It works very well!
My zabbix server is in the cloud on AWS. I have 2 Zabbix proxies in my network's core datacenter that split the load of monitoring my ~2000 hosts.
The load is split evenly, with each proxy monitoring about 1000 hosts, and requiring 700-850 VPS.
This works very well when there is no disruption of communication between proxies and server.
However, we sometimes suffer upstream connectivity outages at our datacenter. This cuts the proxies off from the server from anywhere between 15 mins and a few hours.
I've found the proxies / server have a lot of trouble "catching up" after a disruption like that.
What i see when connectivity is restored is that all hosts monitored by the proxies have delays in data/graphs. Even though admin/proxies shows a "last seen" of just a few seconds ago...graphs for proxy monitored hosts will run a delay of 45mins to an 1.5 hours for a while....so it looks like no data is collecting, but in fact data is collecting but just filling in with a delay.
It seems like since I'm close to the max VPS limit of 1000 imposed on proxies by ZBX_MAX_HRECORDS. I think after an outage, all this history data doesn't fit through the "pipe" with the realtime data....and data is sent sequentially.
I wish i could tell the proxy to prioritize the live data over the old data, but I don't think I can do that.
--reduce the amount of history data stored - I’d actually rather just have a gap during the outage and just drop the data it couldn't send to server, if it meant i wouldn't have this delay problem. Would i do this by reducing HistoryCacheSize to 0 on the proxies, or is there another setting for this?
--recompile zabbix proxy with ZBX_MAX_HRECORDS more than 1000 https://zabbix.com/forum/showthread.php?t=56509 This would probably really help - but is labor intensive and makes future upgrades harder
--add a third proxy. This would be the easiest fix for me right now because it’s relatively quick
I may be imagining things, but i feel like if i reboot proxies and server after the outage....they "catch up" more quickly and get back to realtime data.
Any help with this is greatly appreciated!
My zabbix server is in the cloud on AWS. I have 2 Zabbix proxies in my network's core datacenter that split the load of monitoring my ~2000 hosts.
The load is split evenly, with each proxy monitoring about 1000 hosts, and requiring 700-850 VPS.
This works very well when there is no disruption of communication between proxies and server.
However, we sometimes suffer upstream connectivity outages at our datacenter. This cuts the proxies off from the server from anywhere between 15 mins and a few hours.
I've found the proxies / server have a lot of trouble "catching up" after a disruption like that.
What i see when connectivity is restored is that all hosts monitored by the proxies have delays in data/graphs. Even though admin/proxies shows a "last seen" of just a few seconds ago...graphs for proxy monitored hosts will run a delay of 45mins to an 1.5 hours for a while....so it looks like no data is collecting, but in fact data is collecting but just filling in with a delay.
It seems like since I'm close to the max VPS limit of 1000 imposed on proxies by ZBX_MAX_HRECORDS. I think after an outage, all this history data doesn't fit through the "pipe" with the realtime data....and data is sent sequentially.
I wish i could tell the proxy to prioritize the live data over the old data, but I don't think I can do that.
--reduce the amount of history data stored - I’d actually rather just have a gap during the outage and just drop the data it couldn't send to server, if it meant i wouldn't have this delay problem. Would i do this by reducing HistoryCacheSize to 0 on the proxies, or is there another setting for this?
--recompile zabbix proxy with ZBX_MAX_HRECORDS more than 1000 https://zabbix.com/forum/showthread.php?t=56509 This would probably really help - but is labor intensive and makes future upgrades harder
--add a third proxy. This would be the easiest fix for me right now because it’s relatively quick
I may be imagining things, but i feel like if i reboot proxies and server after the outage....they "catch up" more quickly and get back to realtime data.
Any help with this is greatly appreciated!
Comment