Ad Widget

Collapse

Zabbix agent item on host failed: first network error, wait for 15 seconds

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jeffm
    Junior Member
    • Apr 2020
    • 2

    #1

    Zabbix agent item on host failed: first network error, wait for 15 seconds


    There are a lot of messages similar to this appearing in the log,
    3297:20200429:235705.672 Zabbix agent item "system.cpu.util[,user]" on host "hostname-deleted" failed: first network error, wait for 15 seconds
    3301:20200429:235720.692 resuming Zabbix agent checks on host "hostname-deleted": connection restored
    This is generating a lot of flapping false alerts. Yet, running zabbix_get can retrieve the item and is very fast,
    $ time zabbix_get -s hostname-deleted -k 'system.cpu.util[,user]'
    0.133534

    real 0m0.012s
    user 0m0.003s
    sys 0m0.002s
    Running this test ten times in a row using a bash for-loop doesn't show any problems.
    This is running on a VM in GCP g1-small (1 vCPU, 1.7 GB memory)
    The database is PostgresSQL 9.6 using Cloud SQL managed service with an N1 type machine, 2 vCPU, 7.5GB of RAM, and uses SSD for storage.
    So far, I've eliminated network issues, such as, firewalls and, also, seLinux.
    More background. This is part of moving our infrastructure from on-premises to GCP. As part of this move Zabbix is being upgraded from version 3 to version 4. We have three environments. Dev, Test, and Prod. The image is created via packer and customised at run time using start up scripts. This has worked for Dev and Test. It is the final environment, Prod, that is experiencing the issue. The only difference of note is that both Dev and Test have database servers which are shared with other databases. In Prod a new database instance was created specifically for Zabbix.

    Zabbix version: 4.4.8
    Number of hosts being monitored: 158
    Number of Items: 5853
    Number of triggers: 4119
    Required server performance, new values per second 58.35
    load average: 1.10, 0.67, 0.53
    CPU is about 20% busy.

    Does anyone have any suggestions on what to check next?
    What other information would help pin point the source of the problem?

    Thanks in advance.
  • jeffm
    Junior Member
    • Apr 2020
    • 2

    #2
    This turned out to be related to IPv6 versus IPv4 addressing. In the on-premises data centre there was both IPv4 and IPv6 addresses in use. In GCP there can only be IPv4. This means that it can't use IPv6 to reach the old server in the on premises data centre. I had select the IPv4 addresses as the default in"
    Agent interfaces" yet it seems that Zabbix fell back to the IPv6 for some reason. The log message doesn't indicate how the server was trying to contact the agent. I discovered
    the use of IPv6 by a red error message box at the top of the BUI while trying various things to rectify the problem. Removing all reference to IPv6 addresses in the config
    seems to have fix the "first network error" messages.

    Comment

    Working...