Zabbix agent item on host failed: first network error, wait for 15 seconds

jeffm

Junior Member

Joined: Apr 2020

Posts: 2
#1

Zabbix agent item on host failed: first network error, wait for 15 seconds

30-04-2020, 02:37

There are a lot of messages similar to this appearing in the log,
3297:20200429:235705.672 Zabbix agent item "system.cpu.util[,user]" on host "hostname-deleted" failed: first network error, wait for 15 seconds
3301:20200429:235720.692 resuming Zabbix agent checks on host "hostname-deleted": connection restored
This is generating a lot of flapping false alerts. Yet, running zabbix_get can retrieve the item and is very fast,
$ time zabbix_get -s hostname-deleted -k 'system.cpu.util[,user]'
0.133534

real 0m0.012s
user 0m0.003s
sys 0m0.002s
Running this test ten times in a row using a bash for-loop doesn't show any problems.
This is running on a VM in GCP g1-small (1 vCPU, 1.7 GB memory)
The database is PostgresSQL 9.6 using Cloud SQL managed service with an N1 type machine, 2 vCPU, 7.5GB of RAM, and uses SSD for storage.
So far, I've eliminated network issues, such as, firewalls and, also, seLinux.
More background. This is part of moving our infrastructure from on-premises to GCP. As part of this move Zabbix is being upgraded from version 3 to version 4. We have three environments. Dev, Test, and Prod. The image is created via packer and customised at run time using start up scripts. This has worked for Dev and Test. It is the final environment, Prod, that is experiencing the issue. The only difference of note is that both Dev and Test have database servers which are shared with other databases. In Prod a new database instance was created specifically for Zabbix.

Zabbix version: 4.4.8
Number of hosts being monitored: 158
Number of Items: 5853
Number of triggers: 4119
Required server performance, new values per second 58.35
load average: 1.10, 0.67, 0.53
CPU is about 20% busy.

Does anyone have any suggestions on what to check next?
What other information would help pin point the source of the problem?

Thanks in advance.
Tags: None
jeffm

Junior Member

Joined: Apr 2020

Posts: 2
#2

01-05-2020, 02:58

This turned out to be related to IPv6 versus IPv4 addressing. In the on-premises data centre there was both IPv4 and IPv6 addresses in use. In GCP there can only be IPv4. This means that it can't use IPv6 to reach the old server in the on premises data centre. I had select the IPv4 addresses as the default in"
Agent interfaces" yet it seems that Zabbix fell back to the IPv6 for some reason. The log message doesn't indicate how the server was trying to contact the agent. I discovered
the use of IPv6 by a red error message box at the top of the BUI while trying various things to rectify the problem. Removing all reference to IPv6 addresses in the config
seems to have fix the "first network error" messages.
Comment

Ad Widget

Zabbix agent item on host failed: first network error, wait for 15 seconds

Zabbix agent item on host failed: first network error, wait for 15 seconds

Comment