Ad Widget

Collapse

near-constant "first network error, wait for 15 seconds"

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ddrucker
    Member
    • Feb 2019
    • 35

    #1

    near-constant "first network error, wait for 15 seconds"

    After updating from 6.4 to 7.0.2, my logs are constantly scrolling (5-10 per second) with things like:

    27041:20240805:101152.822 Zabbix agent item "vfs.fs.size[/nisaba/micvna,used]" on host "x5backup" failed: first network error, wait for 15 seconds
    27041:20240805:101152.831 Zabbix agent item "system.cpu.util[,interrupt]" on host "qc" failed: first network error, wait for 15 seconds
    27041:20240805:101153.115 Zabbix agent item "system.cpu.util[,iowait]" on host "mic-dicom-router-mercure" failed: first network error, wait for 15 seconds
    27041:20240805:101154.826 Zabbix agent item "system.cpu.util[,guest_nice]" on host "pluto" failed: first network error, wait for 15 seconds
    27041:20240805:101158.196 resuming Zabbix agent checks on host "proxmox01": connection restored
    27041:20240805:101158.251 resuming Zabbix agent checks on host "dell-scg": connection restored
    27041:20240805:101200.023 Zabbix agent item "vfs.fs.size[/,free]" on host "micvna" failed: another network error, wait for 15 seconds
    27041:20240805:101200.254 resuming Zabbix agent checks on host "mclean4t": connection restored
    27041:20240805:101200.862 Zabbix agent item "net.if.in[tap801i0]" on host "proxalone.mlean.harvard.edu" failed: first network error, wait for 15 seconds
    27041:20240805:101204.031 Zabbix agent item "system.cpu.util[,user]" on host "proxmox01" failed: first network error, wait for 15 seconds
    27041:20240805:101204.956 Zabbix agent item "system.localtime" on host "94t2" failed: another network error, wait for 15 seconds
    27041:20240805:101206.030 Zabbix agent item "system.cpu.util[,nice]" on host "mickey" failed: first network error, wait for 15 seconds
    27041:20240805:101207.821 resuming Zabbix agent checks on host "qc": connection restored

    Is this normal? Using zabbix_get on any of these keys/hosts always succeeds (even if tried within a second of seeing the failure in the log). None of the hosts involved (either the Zabbix server or the host being monitored) are heavily loaded (typically less than 1 load average).

  • kamil1
    Member
    • Aug 2024
    • 40

    #2
    Hey there,
    I've got a few ideas that might help with the debugging

    1. Capture traffic to check agains packet loss and any irregularities executing the following command:
    tcpdump -npi any -s 0 -w /your_path/to/save/file.pcap host <agent_ip>

    Then check if there are any packet loss (for example using Wireshark).

    You can also execute the following to find out where packets are delayed or dropped:
    LC_ALL=C ping -s 1400 -c 100 -i 0.01 <agent_ip>

    2. Also, please increase DebugLevel to 4 in zabbix_agentd.conf and share if there are more informative logs.
    3. Share the relevant parts of your `zabbix_server.conf`, `zabbix_agentd.conf` - related to timeouts and proxy.

    Thanks,
    Kamil

    Comment

    • ddrucker
      Member
      • Feb 2019
      • 35

      #3
      Code:
      LogFile=/var/log/zabbix/zabbix_server.log
      LogFileSize=0
      PidFile=/run/zabbix/zabbix_server.pid
      SocketDir=/run/zabbix
      DBName=zabbix
      DBUser=zabbix
      DBPassword=mypasswordhere
      SNMPTrapperFile=/var/log/snmptrap/snmptrap.log
      ValueCacheSize=32M
      Timeout=15
      FpingLocation=/usr/bin/fping
      Fping6Location=/usr/bin/fping6
      LogSlowQueries=3000
      StartProxyPollers=4
      StatsAllowedIP=127.0.0.1
      The above is my entire zabbix_server.conf.

      I've tested extensively between the zabbix server and several monitored machines - no (or minimal, 1 or 2 a minute) loss even under quite heavy traffic.

      Not sure where you mean to increase DebugLevel. I assume you mean the agent being monitored, but that doesn't change anything in the logs on the server, and on the agent it did not have an effect - in that, nothing was seen at the times the server was saying network error.

      Comment

      • Markku
        Senior Member
        Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
        • Sep 2018
        • 1784

        #4
        I agree with kamil1 that if the application (Zabbix) says "network error", one of the best ways to proceed is to capture and analyze the actual application traffic at the time of the error message. Did you do that?

        Markku

        Comment

        • ddrucker
          Member
          • Feb 2019
          • 35

          #5
          I don't see anything obviously wrong, but I don't really know how to read it. I started capturing, and stopped the capture immediately after I saw this in the logs:

          27041:20240809:092900.022 Zabbix agent item "system.cpu.util[,guest_nice]" on host "micvna" failed: first network error, wait for 15 seconds

          Can you help?
          BTW, a data point: the load average on my zabbix server is 0.01 over the last 15 minutes, and even if I disable all monitored hosts except one, that one still gives the above error every minute or two. So this isn't a "server is overloaded" thing!
          Attached Files
          Last edited by ddrucker; 09-08-2024, 15:43.

          Comment

          • ddrucker
            Member
            • Feb 2019
            • 35

            #6
            Oh, I should make a longer capture - the request for system.cpu.util[,guest_nice] doesn't even appear in that pcap.

            Comment

            • ddrucker
              Member
              • Feb 2019
              • 35

              #7
              Here's another capture, during which I observed:

              root@weathertop:/home/ddrucker# tail -f /var/log/zabbix/zabbix_server.log|grep micvna
              27041:20240809:101202.382 resuming Zabbix agent checks on host "micvna": connection restored
              27041:20240809:101220.103 Zabbix agent item "system.cpu.util[,system]" on host "micvna" failed: first network error, wait for 15 seconds
              27041:20240809:101235.055 resuming Zabbix agent checks on host "micvna": connection restored
              27041:20240809:101357.834 Zabbix agent item "system.cpu.load[percpu,avg1]" on host "micvna" failed: first network error, wait for 15 seconds​
              Attached Files

              Comment

              • Markku
                Senior Member
                Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                • Sep 2018
                • 1784

                #8
                Ok let's see the second capture. This first:

                27041:20240809:101220.103 Zabbix agent item "system.cpu.util[,system]" on host "micvna" failed: first network error, wait for 15 seconds
                There is no "system.cpu.util[,system]" request visible in the capture, until 10:12:35.

                27041:20240809:101235.055 resuming Zabbix agent checks on host "micvna": connection restored
                So that is correct in that sense.

                And this:

                27041:20240809:101357.834 Zabbix agent item "system.cpu.load[percpu,avg1]" on host "micvna" failed: first network error, wait for 15 seconds
                "system.cpu.load[percpu,avg1]" is requested at:
                - 10:12:17
                - 10:12:51
                so again that error above is kind of correct.

                The capture has been taken on the server side, right? What kind of network configuration do you have on the server, does it have several NICs? What does the routing table say ("ip route" usually)?

                Markku

                Comment

                • ddrucker
                  Member
                  • Feb 2019
                  • 35

                  #9
                  Yes, this capture was taken on the zabbix server. It has a single NIC.

                  root@weathertop:/home/ddrucker# ip route
                  default via 172.29.158.1 dev ens18 onlink
                  172.29.158.0/24 dev ens18 proto kernel scope link src 172.29.158.193
                  root@weathertop:/home/ddrucker#

                  Comment

                  • Markku
                    Senior Member
                    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                    • Sep 2018
                    • 1784

                    #10
                    Ok, so there is no apparent reason why single requests would be invisible in the supplied capture, so it looks like the Zabbix poller (the new async agent poller in the case of Zabbix 7.0) didn't actually send the requests out. (Unless you find some specific problem in the server platform itself.)

                    From that perspective there is nothing to look for in the agent side, but you could increase the Zabbix server DebugLevel to 4 to see if that gives to more information. Or, maybe increasing just the agent poller log level is enough (like sudo zabbix_server -R log_level_increase="agent poller"), you'll see.

                    Markku

                    Comment

                    • ddrucker
                      Member
                      • Feb 2019
                      • 35

                      #11
                      Aha! After increasing log level, I discovered there's a

                      27041:20240809:131016.823 cannot resolve DNS name: nodename nor servname provided, or not known

                      line for every failure. On the subsequent attempt, it contacts the host just fine. No idea why - especially since I supposedly have a locally-caching nameserver (systemd-resolved). But switching every host from DNS to IP fixed the problem.
                      But that shouldn't be needed. Is Zabbix not waiting long enough for DNS resolution? I would have thought gethostbyname is blocking.

                      Comment

                      • Markku
                        Senior Member
                        Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                        • Sep 2018
                        • 1784

                        #12
                        Related (new in 7.0.0): https://support.zabbix.com/browse/ZBXNEXT-8620 (Add async DNS resolver for HTTP, SNMP and Zabbix agent)

                        Markku

                        Comment

                        • ddrucker
                          Member
                          • Feb 2019
                          • 35

                          #13
                          Hmm. So what's the solution here? I'd rather not have everything by IP - it's not that things change so often, it just seems like a hack.
                          Yes, my upstream DNS servers are sometimes not instant - but that shouldn't matter, given that I'm using systemd-resolved and the cache is active!

                          Comment

                          • Markku
                            Senior Member
                            Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                            • Sep 2018
                            • 1784

                            #14
                            What does the systemd-resolved cache show, are the relevant records cached appropriately?

                            Unfortunately I don't know how the libevent DNS resolution and systemd-resolved-provided DNS resolution interface with each other. Can you try without systemd-resolved (even if it provides the caching to you), how does that change the issue?

                            Markku

                            Comment

                            • ddrucker
                              Member
                              • Feb 2019
                              • 35

                              #15
                              Yes, the relevant records were cached.

                              OK, so - first, I switched a bunch of my hosts back from IP to DNS. They immediately started to give the original error.
                              Then I disabled caching in systemd-resolved. No change in behavior.
                              Then I uninstalled systemd-resolved entirely, changing resolv.conf to point directly at my upstream nameservers. No change in behavior.
                              Finally I switched hosts back to IP, which resolved the errors.

                              Comment

                              Working...