I have a medium sized Zabbix Setup. I have one Central Zabbix Server and Multiple Zabbix Proxies, one at each Site I'm monitoring. All of those are setup with the Official Docker Containers, the main Server:
* postgres:11-alpine
* zabbix/zabbix-web-nginx-pgsql:alpine-4.0-latest
* zabbix/zabbix-snmptraps:alpine-4.0-latest
* zabbix/zabbix-server-pgsql:alpine-4.0-latest
The Proxies are all just a single Docker image:
* zabbix/zabbix-proxy-sqlite3:ubuntu-4.0-latest
The Proxies mostly monitor other VMs on in the same VMWare vCenter.
The Problem that arises is that on the Proxies in the Logs I see a very high amount of network errors that all look somewhat like this:
Zabbix agent item "some.item" on host "SOME HOST" failed: first network error, wait for 15 seconds
From that it arises, that there is a High Amount of False Positive Problems in Zabbix. Mostly "Zabbix agent on SOME HOST is unreachable for 5 minutes", but sometimes also other Problems that are triggered by .nodata().
There is also a high amount of missing item Data, since the hosts with network errors are considered "offline" for a bit and no items from them are checked.
I've also tried to investigate it a bit and found the source code that produces this error: https://github.com/zabbix/zabbix/blo.../poller.c#L302
Unfortunatly the same message seems to be triggerd in 3 different failure cases: https://github.com/zabbix/zabbix/blo.../poller.c#L749
Therefore I couldn't really find out anything that way. I also of cause looked at cpu, ram, disk and network usage on the proxies and couldn't find anything that looked out of the norm for me.
How should I proceed to find out the cause of these errors? Has anyone else had this happen to them?
* postgres:11-alpine
* zabbix/zabbix-web-nginx-pgsql:alpine-4.0-latest
* zabbix/zabbix-snmptraps:alpine-4.0-latest
* zabbix/zabbix-server-pgsql:alpine-4.0-latest
The Proxies are all just a single Docker image:
* zabbix/zabbix-proxy-sqlite3:ubuntu-4.0-latest
The Proxies mostly monitor other VMs on in the same VMWare vCenter.
The Problem that arises is that on the Proxies in the Logs I see a very high amount of network errors that all look somewhat like this:
Zabbix agent item "some.item" on host "SOME HOST" failed: first network error, wait for 15 seconds
From that it arises, that there is a High Amount of False Positive Problems in Zabbix. Mostly "Zabbix agent on SOME HOST is unreachable for 5 minutes", but sometimes also other Problems that are triggered by .nodata().
There is also a high amount of missing item Data, since the hosts with network errors are considered "offline" for a bit and no items from them are checked.
I've also tried to investigate it a bit and found the source code that produces this error: https://github.com/zabbix/zabbix/blo.../poller.c#L302
Unfortunatly the same message seems to be triggerd in 3 different failure cases: https://github.com/zabbix/zabbix/blo.../poller.c#L749
Therefore I couldn't really find out anything that way. I also of cause looked at cpu, ram, disk and network usage on the proxies and couldn't find anything that looked out of the norm for me.
How should I proceed to find out the cause of these errors? Has anyone else had this happen to them?

Comment