I've been experiencing an interesting issue lately whenever a network outage between our Zabbix proxies and the Zabbix server. When an outage occurs, the proxy caches the historical data like it is supposed to and then starts sending that data to the server as intended, however, here's where the problem comes in....
We have many checks that use the "nodata" function for triggers. For instance, because Proxies do not support the internal check to see if a Zabbix agent is up, we check to see if no data has been received from a Zabbix agent for over 5 minutes. If no data has been received, then a trigger is set off. Here's where the problem comes in.
Let's say that there is a network outage between the server and proxy for 20+ minutes. When the connections are re-established, the proxy starts sending data to the Server. From the best I've been able to ascertain, it is only able to send a very finite amount of data every second (I'm not sure which daemon process controls sending data to the server). As a result of that, it takes a long while for the Proxy to "catch up" on the data it is sending. Now, remember those triggers that use the "nodata" function? It appears that while the proxy is sending data to the server, it sends the OLDEST data first. Because of that it takes a long while (read 10+ minutes) for the proxy to catch back up enough for recent values to appear on the server. During that entire time, the various "nodata" triggers all remain in the PROBLEM state because the old data comes in first.
I hope that makes sense to someone. My real questions here are - How can I increase the rate at which the poller sends data and is there any way to make the Proxy send the NEWEST data first and simply catch up on the older data in the background?
Please keep in mind that the Zabbix Proxy in question is never overloaded. CPU remains low and so does disk I/O. The proxy uses SQLite for its database (and it works totally fine except in the instance mentioned above).
We have many checks that use the "nodata" function for triggers. For instance, because Proxies do not support the internal check to see if a Zabbix agent is up, we check to see if no data has been received from a Zabbix agent for over 5 minutes. If no data has been received, then a trigger is set off. Here's where the problem comes in.
Let's say that there is a network outage between the server and proxy for 20+ minutes. When the connections are re-established, the proxy starts sending data to the Server. From the best I've been able to ascertain, it is only able to send a very finite amount of data every second (I'm not sure which daemon process controls sending data to the server). As a result of that, it takes a long while for the Proxy to "catch up" on the data it is sending. Now, remember those triggers that use the "nodata" function? It appears that while the proxy is sending data to the server, it sends the OLDEST data first. Because of that it takes a long while (read 10+ minutes) for the proxy to catch back up enough for recent values to appear on the server. During that entire time, the various "nodata" triggers all remain in the PROBLEM state because the old data comes in first.
I hope that makes sense to someone. My real questions here are - How can I increase the rate at which the poller sends data and is there any way to make the Proxy send the NEWEST data first and simply catch up on the older data in the background?
Please keep in mind that the Zabbix Proxy in question is never overloaded. CPU remains low and so does disk I/O. The proxy uses SQLite for its database (and it works totally fine except in the instance mentioned above).
Comment