I have Googled the heck out of this problem and I am at my wits end so hopping to get some advice here.
I have just built and prepared a Zabbix4 system using Postgres on a brad new Dell Server (not a VM). It is running on Ubuntu 18.04.01 with ZFS root, 16gb RAM, 2TB RAID10 storage. The performance of the machine is fantastic. Network, Disk and DB performance are all well below 10% utilisation, with plenty of headroom. I have built this system with the intention of replacing our ageing Zabbix 2.2 sever.
In order to facilitate the cutover from Zabbix2.2 to zabbix4 I currently have 2 proxy servers (one for each version of zabbix) and there are 10 agents that connect through these proxies, these agents are version 2.2.14 and are configured to send their data to both proxies. Here is the problem, the agents that feed the zabbix2 proxy/server are working perfectly and in realtime, however the same agents that feed the zabbix4 proxy/server are always delayed (sometime up to 4 or 5 hours!), this results in some data being occasionally missed and dropped (ie: holes in graphs)
I have checked the following;
- Disk I/O, there is next to no disk activity
- Network I/O, also very low
- DB, monitoring with PGADMIN shows around only 50 transactions per second, with occasional spike to 600 or so.
- Have checked the time is synced across the agent, proxy and server (although the proxy and agents use a different NTP source to the server, this is unavoidable as the agents are on customer premises)
- There are no Unsupported Items in the list of monitored items.
- The zabbix server is set for 50 pollers, 10 trappers, 20 pingers, 30 http pollers
- The zabbix data gathering processes show only about 15% busy, see attached image. (we have around 12000 items all up), similar graphs for the proxy show no discernible problems.
Some trouble shooting I have done;
- Tried the proxy in active and passive mode, the problem persists in either mode.
- Turned the number of pollers and trappers up on both the server and proxy, no difference.
- Increased the interval for many of the checks.
- Removed all but one agent from the proxy, however even with only a small amount of items they still get delayed.
So the question is what is the difference between the working proxy and the non working one ??? Well obviously there is a zabbix version difference, as I do not see the same delayed data on zabbix2. However, the data coming from the zabbix4 proxy is coming via a VPN, and the zabbix2 proxy is direct linked. But I also have other Zabbix4 proxies working through a VPN just fine.
Some other important things you will need to know. The zabbix4 server is currently 4.0.0rc3 on Ubuntu 18.04.01, the affected zabbix proxy is 4.0.0 and it is running on Solaris 11.3, also the agents are Zabbix 2.2.14 and are running on Solaris 11.3.
I'm starting to think I have struck some sort of bug, so any assistance is greatly appreciated.
I have just built and prepared a Zabbix4 system using Postgres on a brad new Dell Server (not a VM). It is running on Ubuntu 18.04.01 with ZFS root, 16gb RAM, 2TB RAID10 storage. The performance of the machine is fantastic. Network, Disk and DB performance are all well below 10% utilisation, with plenty of headroom. I have built this system with the intention of replacing our ageing Zabbix 2.2 sever.
In order to facilitate the cutover from Zabbix2.2 to zabbix4 I currently have 2 proxy servers (one for each version of zabbix) and there are 10 agents that connect through these proxies, these agents are version 2.2.14 and are configured to send their data to both proxies. Here is the problem, the agents that feed the zabbix2 proxy/server are working perfectly and in realtime, however the same agents that feed the zabbix4 proxy/server are always delayed (sometime up to 4 or 5 hours!), this results in some data being occasionally missed and dropped (ie: holes in graphs)
I have checked the following;
- Disk I/O, there is next to no disk activity
- Network I/O, also very low
- DB, monitoring with PGADMIN shows around only 50 transactions per second, with occasional spike to 600 or so.
- Have checked the time is synced across the agent, proxy and server (although the proxy and agents use a different NTP source to the server, this is unavoidable as the agents are on customer premises)
- There are no Unsupported Items in the list of monitored items.
- The zabbix server is set for 50 pollers, 10 trappers, 20 pingers, 30 http pollers
- The zabbix data gathering processes show only about 15% busy, see attached image. (we have around 12000 items all up), similar graphs for the proxy show no discernible problems.
Some trouble shooting I have done;
- Tried the proxy in active and passive mode, the problem persists in either mode.
- Turned the number of pollers and trappers up on both the server and proxy, no difference.
- Increased the interval for many of the checks.
- Removed all but one agent from the proxy, however even with only a small amount of items they still get delayed.
So the question is what is the difference between the working proxy and the non working one ??? Well obviously there is a zabbix version difference, as I do not see the same delayed data on zabbix2. However, the data coming from the zabbix4 proxy is coming via a VPN, and the zabbix2 proxy is direct linked. But I also have other Zabbix4 proxies working through a VPN just fine.
Some other important things you will need to know. The zabbix4 server is currently 4.0.0rc3 on Ubuntu 18.04.01, the affected zabbix proxy is 4.0.0 and it is running on Solaris 11.3, also the agents are Zabbix 2.2.14 and are running on Solaris 11.3.
I'm starting to think I have struck some sort of bug, so any assistance is greatly appreciated.
Comment