Zabbix 6.0.14. © 2001–2023, Zabbix SIA
Linux (ubuntu 20.04)
Standard E4as v4 (4 vcpus, 32 GiB memory)
Zabbix Agents version running on Windows (Windows Server 2019 Datacenter), the monitored hosts.
zabbix_agent2-6.0.23-windows-amd64-openssl.msi
Monitored hosts:68
Required server performance, new values per second 56.16
We have had a system running for a long time, and it has worked great.
We have have frequently tuned it when adding new hosts and monitored housekeeping and other processes in order to have a smooth environment running.
But the last month some monitored host's started to behave differently.
We noticed it in the frontend first:
Connection to Zabbix server "localhost" timed out: Possible reasons:
1. Incorrect server IP/DNS in the "zabbix.conf.php".
2. Firewall is blocking TCP connection.
- Connection timed out
After checking the inbound flows to Zabbix server, the graph was sky high, after more investigation it turns out that 2-3 monitored hosts was sending too much data to Zabbix server.
(example)
sudo tail -f zabbix_server.log
1357:20240130:133326.485 failed to accept an incoming connection: connection rejected, getpername() faild: [107] Transport endpoint is not connected.
The fix we did was to view/find what monitored host was sending too much data view network tools, and when the monitored hosts were found, we stopped the Agent 2.
In some case that worked in other cases we had to stop Zabbix server, then Zabbix server agent and start it up again.
Agent 2 logs (example)
# Host logs, it was pilling up and doing to much.
2022/10/13 09:23:13.119956 [101] cannot connect to [ZABBIX-IP:10051]: dial tcp :0->ZABBIX-IP:10051: i/o timeout
2022/10/13 09:23:13.119956 [101] active check configuration update from host [MONITORED-HOST] started to fail
[ ..the same logs were just rolling every second ]
It was/is almost as the ports were exhausted.
# check telnet local host
telnet localhost 10050, was always success.
telnet localhost 10051, was very slow and sometimes not responding.
The next step we did was to upgrade to a new agent version.
zabbix_agent2-6.0.26-windows-amd64-openssl.msi
But we are stilling seeing the issue from time to time.
The Zabbix dashboards, System performance, Zabbix server health and Zabbix server processes is normal when this happens.
Housekeeping is also normal.
We also tuned:
StartPollers
Timeout
And looked / searched at many links after this error or behavior to find a fix, but not success yet.
I hope someone can point me in the right direction.
Maybe this is not enough information with respect to the configuration, environment and more.
Regards
Linux (ubuntu 20.04)
Standard E4as v4 (4 vcpus, 32 GiB memory)
Zabbix Agents version running on Windows (Windows Server 2019 Datacenter), the monitored hosts.
zabbix_agent2-6.0.23-windows-amd64-openssl.msi
Monitored hosts:68
Required server performance, new values per second 56.16
We have had a system running for a long time, and it has worked great.
We have have frequently tuned it when adding new hosts and monitored housekeeping and other processes in order to have a smooth environment running.
But the last month some monitored host's started to behave differently.
We noticed it in the frontend first:
Connection to Zabbix server "localhost" timed out: Possible reasons:
1. Incorrect server IP/DNS in the "zabbix.conf.php".
2. Firewall is blocking TCP connection.
- Connection timed out
After checking the inbound flows to Zabbix server, the graph was sky high, after more investigation it turns out that 2-3 monitored hosts was sending too much data to Zabbix server.
(example)
sudo tail -f zabbix_server.log
1357:20240130:133326.485 failed to accept an incoming connection: connection rejected, getpername() faild: [107] Transport endpoint is not connected.
The fix we did was to view/find what monitored host was sending too much data view network tools, and when the monitored hosts were found, we stopped the Agent 2.
In some case that worked in other cases we had to stop Zabbix server, then Zabbix server agent and start it up again.
Agent 2 logs (example)
# Host logs, it was pilling up and doing to much.
2022/10/13 09:23:13.119956 [101] cannot connect to [ZABBIX-IP:10051]: dial tcp :0->ZABBIX-IP:10051: i/o timeout
2022/10/13 09:23:13.119956 [101] active check configuration update from host [MONITORED-HOST] started to fail
[ ..the same logs were just rolling every second ]
It was/is almost as the ports were exhausted.
# check telnet local host
telnet localhost 10050, was always success.
telnet localhost 10051, was very slow and sometimes not responding.
The next step we did was to upgrade to a new agent version.
zabbix_agent2-6.0.26-windows-amd64-openssl.msi
But we are stilling seeing the issue from time to time.
The Zabbix dashboards, System performance, Zabbix server health and Zabbix server processes is normal when this happens.
Housekeeping is also normal.
We also tuned:
StartPollers
Timeout
And looked / searched at many links after this error or behavior to find a fix, but not success yet.
I hope someone can point me in the right direction.
Maybe this is not enough information with respect to the configuration, environment and more.
Regards
Comment