Ad Widget

Collapse

first network error and connection restored

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • RZ_CloudComm
    Junior Member
    • Jul 2016
    • 2

    #1

    first network error and connection restored

    Hi

    We are using the Zabbix 3.0.2 and we have some issues with graphs: some data is missing so the graphs are sometimes not really perfect. I got the following errors in /var/log/zabbix/zabbix_server.log:
    Code:
     20057:20160704:140732.221 Zabbix agent item "vfs.fs.size[/boot,free]" on host "be-srv-asterisk-ssw0" failed: first network error, wait for 30 seconds
     19943:20160704:140733.493 Zabbix agent item "system.cpu.util[,user]" on host "be-srv-asterisk-ssw9" failed: first network error, wait for 30 seconds
     20023:20160704:140741.284 resuming Zabbix agent checks on host "be-srv-asterisk-ssw4": connection restored
     20061:20160704:140743.265 resuming Zabbix agent checks on host "be-srv-asterisk-ssw3": connection restored
    And this on all hosts (17)... I'm sure there is no network issue between the zabbix server and the zabbix agent (ping, traceroute show no problems).

    We have changed the advanced settings of zabbix server to the following hoping it will help but here is no result
    Code:
    StartPollers=80
    StartPollersUnreachable=50
    StartTrappers=20
    StartPingers=10
    StartVMwareCollectors=1
    SNMPTrapperFile=/var/log/snmptt/snmptt.log
    StartDBSyncers=8
    UnreachablePeriod=55
    UnreachableDelay=30
    LogSlowQueries=1000
    Other advanced settings are unchanged (=default).

    The other zabbix server information:
    Code:
    Required server performance, new values per second	10.69
    
     free -m
                 total       used       free     shared    buffers     cached
    Mem:          2815       2664        151          1        175       1672
    -/+ buffers/cache:        816       1998
    Swap:         2047         14       2033
    This is a VM with 3 GB of RAM, 2 vCPU's (Xeon E5620 2.4Ghz), 32GB NFS hdd. CentOS is the OS...

    Do you have any idea how to fix this issue?

    Best regards,
    RZ
  • mortuletti
    Member
    • May 2016
    • 76

    #2
    Hi RZ!
    Problem can be related to Item check timeout.
    By default Zabbix Agent (passive checks) timeout is 4 seconds on server side and 3 seconds on Agent side.
    For some reason hosts does not reply is such time period.

    What types of checks you have?
    Increase log level and check log file again. ("zabbix_server -R log_level_increase" and "zabbix_server -R log_level_decrease" commands)

    To avoid gaps in graphs can try to switch checks to Zabbix Agent (Active).

    If you do not plan to connect more servers in nearest time, for 17 hosts and 10 NVPS i would recommend to reduce number of started processes, so recommended config:

    StartPollers=80 > 5-10
    StartPollersUnreachable=50 > 5-10
    StartTrappers=20 > 5-10
    StartPingers=10 > 2-3
    StartVMwareCollectors=1
    SNMPTrapperFile=/var/log/snmptt/snmptt.log
    StartDBSyncers=8 > 2 (each can manage up to 1000 NVPS)
    UnreachablePeriod=55
    UnreachableDelay=30
    LogSlowQueries=1000 > 3000
    Timeout=4

    So, do not start too much processes. Each process check configuration for "something to do" and block caches for some time. It can make delays as well.

    Server parameters look very good!

    Regards,
    Alexander

    Comment

    • RZ_CloudComm
      Junior Member
      • Jul 2016
      • 2

      #3
      Thanks for your answer!

      I have increased the log level and I see the following:
      Code:
        1935:20160707:160850.003 End of get_value_agent():TIMEOUT_ERROR
        1935:20160707:160850.003 Item [be-srv-asterisk-ssw13:system.users.num] error: Get value from agent failed: ZBX_TCP_READ() timed out
        1935:20160707:160850.003 End of get_value():TIMEOUT_ERROR
        1935:20160707:160850.003 In deactivate_host() hostid:10141 itemid:26867 type:0
        1935:20160707:160850.003 query [txnlev:1] [begin;]
        1935:20160707:160850.004 query [txnlev:1] [update hosts set errors_from=1467900530,disable_until=1467900560 where hostid=10141]
        1935:20160707:160850.004 query [txnlev:1] [commit;]
        1929:20160707:160850.005 __zbx_zbx_setproctitle() title:'poller #19 [got 0 values in 0.003444 sec, getting values]'
        1929:20160707:160850.006 In get_values()
        1924:20160707:160850.006 __zbx_zbx_setproctitle() title:'poller #15 [got 0 values in 0.005806 sec, getting values]'
        1971:20160707:160850.006 __zbx_zbx_setproctitle() title:'poller #58 [got 0 values in 0.003543 sec, getting values]'
        1919:20160707:160850.006 __zbx_zbx_setproctitle() title:'poller #11 [got 0 values in 0.003499 sec, getting values]'
        1956:20160707:160850.006 __zbx_zbx_setproctitle() title:'poller #43 [got 0 values in 0.006320 sec, getting values]'
        1929:20160707:160850.006 In DCconfig_get_poller_items() poller_type:0
        1924:20160707:160850.006 In get_values()
        1971:20160707:160850.007 In get_values()
        1919:20160707:160850.007 In get_values()
        1956:20160707:160850.007 In get_values()
        1929:20160707:160850.007 End of DCconfig_get_poller_items():1
        1933:20160707:160850.007 __zbx_zbx_setproctitle() title:'poller #23 [got 1 values in 0.006334 sec, getting values]'
        1924:20160707:160850.007 In DCconfig_get_poller_items() poller_type:0
        1971:20160707:160850.007 In DCconfig_get_poller_items() poller_type:0
        1919:20160707:160850.007 In DCconfig_get_poller_items() poller_type:0
        1956:20160707:160850.007 In DCconfig_get_poller_items() poller_type:0
        1935:20160707:160850.007 Zabbix agent item "system.users.num" on host "be-srv-asterisk-ssw13" failed: first network error, wait for 30 seconds
      I changed one passive item check to active. We will see if it helps...

      Comment

      • DmitryL
        Senior Member
        Zabbix Certified SpecialistZabbix Certified Professional
        • May 2016
        • 278

        #4
        Hello RZ_CloudComm,

        Increase the timeout period in zabbix_server and agent config file.
        Just play with numbers... maybe start with 15s.

        After that restart server/agent and check server logs for timeout errors.

        Best regards,
        Dmitry

        Comment

        • mortuletti
          Member
          • May 2016
          • 76

          #5
          Hi RZ_CloudComm!
          Did timeout change solved you case?
          Please, keep in mind, timeout parameter on the zabbix_server side should be bigger than at zabbix_agent.
          Br, Alexander

          Comment

          • maaathieu59
            Junior Member
            • Jul 2016
            • 5

            #6
            Hi

            I have a strange issue that looks like this one, but with snmp checks. After a short issue (~ 5 minutes), some of my items a not checked until one hour. This looks like this in the logs:

            1474:20160711:173540.239 SNMP agent item "traffic_in_eth0" on host "serveur" failed: first network error, wait for 15 seconds
            1541:20160711:183547.013 resuming SNMP agent checks on host "serveur": connection restored


            the config looks like this:

            ### Option: UnreachablePeriod
            # After how many seconds of unreachability treat a host as unavailable.
            #
            # Mandatory: no
            # Range: 1-3600
            # Default:
            UnreachablePeriod=900

            ### Option: UnavailableDelay
            # How often host is checked for availability during the unavailability period, in seconds.
            #
            # Mandatory: no
            # Range: 1-3600
            # Default:
            UnavailableDelay=60

            ### Option: UnreachableDelay
            # How often host is checked for availability during the unreachability period, in seconds.
            #
            # Mandatory: no
            # Range: 1-3600
            # Default:
            UnreachableDelay=15


            I don't understand why it's so long to resume the checks. Any clue ?

            Comment

            • mortuletti
              Member
              • May 2016
              • 76

              #7
              Hi!
              PHP Code:
              I don't understand why it's so long to resume the checks
              By default it was:
              UnreachablePeriod=45
              UnavailableDelay=60
              UnreachableDelay=15
              Idea is to offload Zabbix server if devices is unavailable for some time or removed.

              In regards of SNMP devices will be good to review Items and extend check interval for metrics like Device name, Alias, location utc to some 24h or more.

              As well, try to switch off "Use bulk requests" on SNMP interfaces for problematic hosts.

              Br, Alexander

              Comment

              • filipi_saci
                Junior Member
                • Apr 2021
                • 5

                #8
                Increase timeout worked for me!

                Thanks a lot!

                Comment

                • Fahad
                  Junior Member
                  • Dec 2025
                  • 1

                  #9

                  Hi All,

                  I have same issue have increased the timeout to 30 but still same errors. My logs look like following. if any can help please support.

                  [root@prod-zabbix-server ~]# tail -f /var/log/zabbix/zabbix_server.log
                  258638:20251207:170702.478 Zabbix agent item "system.run[grep -Ei '{$RESOLVERS}' /etc/resolv.conf | wc -l]" on host "440300LVAPP001" failed: first network error, wait for 15 seconds
                  258638:20251207:170705.478 Zabbix agent item "agent.ping" on host "1927407LVWJA0018" failed: first network error, wait for 15 seconds
                  258638:20251207:170705.489 resuming Zabbix agent checks on host "1110161WVSQL004": connection restored
                  258638:20251207:170709.480 resuming Zabbix agent checks on host "1927407LVGlobal001": connection restored
                  258638:20251207:170716.481 resuming Zabbix agent checks on host "254220WVSQL006": connection restored
                  258638:20251207:170717.552 resuming Zabbix agent checks on host "440300LVAPP001": connection restored
                  258638:20251207:170720.479 resuming Zabbix agent checks on host "1927407LVWJA0018": connection restored
                  258638:20251207:170726.478 Zabbix agent item "service.info[EVault InfoStage BUAgent,state]" on host "835951WPAPP003" failed: first network error, wait for 15 seconds
                  258638:20251207:170741.478 Zabbix agent item "net.if.in[eth1]" on host "1927407LVWJA0094" failed: first network error, wait for 15 seconds
                  258638:20251207:170741.541 resuming Zabbix agent checks on host "835951WPAPP003": connection restored
                  258638:20251207:170742.477 Zabbix agent item "service.info[Schedule,state]" on host "1110161WVSQL003" failed: first network error, wait for 15 seconds
                  258638:20251207:170744.477 Zabbix agent item "net.if.in[int2]" on host "182323LPWEB001" failed: first network error, wait for 15 seconds
                  258638:20251207:170747.478 Zabbix agent item "agent.ping" on host "182323LPAPP001" failed: first network error, wait for 15 seconds
                  258638:20251207:170751.477 Zabbix agent item "agent.ping" on host "1927407LVWJA0012" failed: first network error, wait for 15 seconds
                  258638:20251207:170753.477 Zabbix agent item "mh.lsi.raid.discovery.controllers" on host "1927407LPGATE001" failed: first network error, wait for 15 seconds
                  258638:20251207:170754.478 Zabbix agent item "net.if.out[eth0]" on host "19274074VGLOBAL005" failed: first network error, wait for 15 seconds
                  258638:20251207:170755.477 Zabbix agent item "evault[Discovery]" on host "1927407LVWJA0046" failed: first network error, wait for 15 seconds
                  258638:20251207:170756.479 resuming Zabbix agent checks on host "1927407LVWJA0094": connection restored
                  258638:20251207:170757.477 Zabbix agent item "evault[Discovery]" on host "1927407LVWJA0087" failed: first network error, wait for 15 seconds
                  258638:20251207:170757.490 resuming Zabbix agent checks on host "1110161WVSQL003": connection restored
                  258638:20251207:170758.478 Zabbix agent item "service.info[MSSQLSERVER,startup]" on host "219163WPSQL007" failed: first network error, wait for 15 seconds
                  258638:20251207:170759.480 resuming Zabbix agent checks on host "182323LPWEB001": connection restored
                  258638:20251207:170802.478 resuming Zabbix agent checks on host "182323LPAPP001": connection restored
                  258638:20251207:170803.477 Zabbix agent item "net.if.out[eth1]" on host "1927407LVWJA0027" failed: first network error, wait for 15 seconds
                  258638:20251207:170806.478 Zabbix agent item "agent.ping" on host "1927407LVWJA0040" failed: first network error, wait for 15 seconds​

                  Comment

                  Working...