Ad Widget

Collapse

Zabbix clients become unreachable

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cbidwell
    Senior Member
    • Aug 2006
    • 127

    #1

    Zabbix clients become unreachable

    Hi all,

    I've got two clients (v1.4.4) which are on separate networks than the server (v1.4.4) and behind firewalls, which have tcp/10050 and tcp/10051 open accordingly, that after a period of time these two machines just lose communication. I don't think it's a firewall issue. Once I restart the zabbix_agentd service, it restores communication.

    The server can still telnet to tcp/10050 on the client side. The client still has zabbix_agentd running:

    zabbix 24885 1 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
    zabbix 24887 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
    zabbix 24888 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
    zabbix 24889 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
    zabbix 24890 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
    zabbix 24891 24885 0 10:32 ? 00:00:00 /usr/local/bin/zabbix_agentd
    root 26167 24843 0 11:02 pts/0 00:00:00 grep zabbix

    This is what's in my zabbix_server.log:

    17154:20071228:180739 Parameter [proc.num[zabbix_server]] will be checked after 240 seconds on host [client1]
    17156:20071228:180758 Timeout while answering request
    17155:20071228:180800 Timeout while answering request
    17156:20071228:180810 Get value from agent failed. Error: ZBX_TCP_READ() failed [Connection reset by peer]
    17153:20071228:180835 Timeout while answering request
    17157:20071228:180846 Get value from agent failed. Error: ZBX_TCP_READ() failed [Connection reset by peer]
    17157:20071228:180846 Host [client2]: first network error, wait for 15 seconds
    17157:20071228:180846 Parameter [vfs.fs.inode[/tmp,pfree]] will be checked after 120 seconds on host [client2]
    17156:20071228:180900 Timeout while answering request
    17154:20071228:180902 Timeout while answering request
    17155:20071228:180904 Timeout while answering request
    17155:20071228:180904 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
    17155:20071228:180904 Host [client2]: first network error, wait for 15 seconds
    17155:20071228:180904 Parameter [net.if.in[eth0,bytes]] will be checked after 20 seconds on host [client2]
    17154:20071228:180916 Timeout while answering request
    17154:20071228:180926 Timeout while answering request
    17153:20071228:180942 Timeout while answering request
    17155:20071228:180954 Timeout while answering request
    17154:20071228:181020 Timeout while answering request
    17154:20071228:181020 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
    17154:20071228:181020 Host [client2]: first network error, wait for 15 seconds
    17154:20071228:181020 Parameter [vfs.fs.inode[/opt,pfree]] will be checked after 120 seconds on host [client2]
    17156:20071228:181024 Timeout while answering request
    17156:20071228:181024 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
    17156:20071228:181024 Host [client1]: first network error, wait for 15 seconds
    17156:20071228:181024 Parameter [vfs.fs.size[/,pused]] will be checked after 120 seconds on host [client1]
    17157:20071228:181058 Timeout while answering request
    17157:20071228:181107 Timeout while answering request


    I was thinking of disabling all of my hosts for a period of time except these two clients, changing my debugging to 4 and seeing what is produced.

    Any recommendations?

    Thanks,
    Chris
    Last edited by cbidwell; 28-12-2007, 22:38.
  • radamand
    Member
    • Aug 2008
    • 89

    #2
    I am having the exact same problem. I have about 60 hosts defined but about 7 or 8 of them will occasionally fail with "ZBX_TCP_READ() failed [Interrupted system call]", I log into the host, find that the agentd is running just fine, just not responding. The host's agent log file only shows;

    26444:20080901:155528 Timeout while answering request
    26444:20080901:155528 Getting list of active checks failed. Will retry after 60 seconds
    26444:20080901:215150 Timeout while answering request
    26444:20080901:215150 Getting list of active checks failed. Will retry after 60 seconds
    26444:20080902:161438 Getting list of active checks failed. Will retry after 60 seconds
    26444:20080902:161538 Getting list of active checks failed. Will retry after 60 seconds

    The server log file shows;

    15539:20080904:195204 Timeout while answering request
    15539:20080904:195204 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
    15539:20080904:195204 Host [pltncavsm08] will be checked after 60 seconds


    If I restart the host' agentd it starts communicating again for a few hours or maybe days, before it fails again...

    Comment

    • Antras
      Junior Member
      • Oct 2007
      • 12

      #3
      Did you changed parameter StartAgents in the zabbix_agent.conf.
      I had the same problem, if StartAgents was 2. When i set it to 5 (default), everything began to work.

      Comment

      • radamand
        Member
        • Aug 2008
        • 89

        #4
        Since I have 10 hosts out of ~60 having this problem now I have lots of room to experiment...

        I have tried one with only 5 agents, one with 15, no difference.

        one with a 10 second timeout, one with 30 second timeout, no difference.

        one with lots of data points, one with only 2, no difference.

        one with active agent, one with passive, no difference.

        Comment

        • Daniel Carnevalli
          Junior Member
          • Nov 2008
          • 14

          #5
          Hi guys,

          I have the same problem, and I don't know how resolve that. Anybody resolved?
          If its possible can you let me know how?

          Tks regards

          Comment

          • jsosic
            Member
            • Apr 2008
            • 47

            #6
            same problem here too...

            Comment

            • Fabio_BR
              Junior Member
              • Oct 2009
              • 1

              #7
              I solved my problem with...

              I solved my problem with...
              editing my zabbix Agentd and Server.conf.

              AGENTD.CONF
              Uncomment

              # IP address to bind agent
              # If missing, bind to all available IPs
              ListenIP=127.0.0.1

              SERVER.CONF
              Uncomment

              # Number of pre-forked instances of pollers
              # Default value is 5
              # This parameter must be between 0 and 255
              StartPollers=5

              # Number of pre-forked instances of trappers
              # Default value is 5
              # This parameter must be between 0 and 255
              StartTrappers=5

              # Location of fping. Default is /usr/sbin/fping
              # Make sure that fping binary has root permissions and SUID flag set
              FpingLocation=/usr/sbin/fping

              verify the file host.conf

              order bind, hosts

              Best regards
              Fabio
              Last edited by Fabio_BR; 02-10-2009, 12:11.

              Comment

              • windsurf51
                Junior Member
                • Sep 2009
                • 20

                #8
                I don't understand your solution (for agentd part)

                I have the same problem, some checks are ok , but some others have this kind of error and server becomes unreachable for 10mn

                In Zabbix Server logs:

                991258:20091031:002544 Item [XXXX:check[YYYYY]] error: Get value from agent failed: ZBX_TCP_READ() failed [Interrupted system call]
                991258:20091031:002544 Host [XXXX]: first network error, wait for 15 seconds
                991258:20091031:002544 Parameter [YYYYY]] will be checked after 300 seconds on host [XXXX

                it has a serious impact on main queue and other servers

                Comment

                • bashman
                  Senior Member
                  • Dec 2009
                  • 432

                  #9
                  I have the same problem:

                  Code:
                  first network error, wait for 15 seconds
                  another network error, wait for 15 seconds
                  error: Get value from agent failed: Cannot connect to ... [Interrupted system call]
                  My zabbix_agend.conf timeout:

                  Code:
                  Timeout=3
                  My zabbix_server.conf:

                  Code:
                  ############ ADVANCED PARAMETERS ################
                  
                  ### Option: StartPollers
                  #       Number of pre-forked instances of pollers.
                  #       You shouldn't run more than 30 pollers normally.
                  #
                  # Mandatory: no
                  # Range: 0-255
                  # Default:
                  StartPollers=25
                  
                  ### Option: StartIPMIPollers
                  #       Number of pre-forked instances of IPMI pollers.
                  #
                  # Mandatory: no
                  # Range: 0-255
                  # Default:
                  StartIPMIPollers=5
                  
                  ### Option: StartPollersUnreachable
                  #       Number of pre-forked instances of pollers for unreachable hosts.
                  #
                  # Mandatory: no
                  # Range: 0-255
                  # Default:
                  StartPollersUnreachable=1
                  
                  ### Option: StartTrappers
                  #       Number of pre-forked instances of trappers
                  #
                  # Mandatory: no
                  # Range: 0-255
                  # Default:
                  StartTrappers=5
                  
                  ### Option: StartPingers
                  #       Number of pre-forked instances of ICMP pingers.
                  #
                  # Mandatory: no
                  # Range: 0-255
                  # Default:
                  StartPingers=1
                  
                  ### Option: StartDiscoverers
                  #       Number of pre-forked instances of discoverers.
                  #
                  # Mandatory: no
                  # Range: 0-255
                  # Default:
                  StartDiscoverers=1
                  
                  ### Option: StartHTTPPollers
                  #       Number of pre-forked instances of HTTP pollers.
                  #
                  # Mandatory: no
                  # Range: 0-255
                  # Default:
                  StartHTTPPollers=1
                  
                  ### Option: Timeout
                  #       Specifies how long we wait for agent, SNMP device or external check (in seconds).
                  #
                  # Mandatory: no
                  # Range: 1-30
                  # Default:
                  Timeout=3
                  
                  ### Option: TrapperTimeout
                  #       Specifies how many seconds trapper may spend processing new data.
                  #
                  # Mandatory: no
                  # Range: 1-300
                  # Default:
                  TrapperTimeout=300
                  Do I solve this problem increasing Timeout?
                  978 Hosts / 16.901 Items / 8.703 Triggers / 44 usr / 90,59 nvps / v1.8.15

                  Comment

                  • bashman
                    Senior Member
                    • Dec 2009
                    • 432

                    #10
                    I increased timeout on zabbix_server to 5:

                    Code:
                    ### Option: Timeout
                    #       Specifies how long we wait for agent, SNMP device or external check (in seconds).
                    #
                    # Mandatory: no
                    # Range: 1-30
                    # Default:
                    Timeout=5
                    But the problem still remains.
                    978 Hosts / 16.901 Items / 8.703 Triggers / 44 usr / 90,59 nvps / v1.8.15

                    Comment

                    • bashman
                      Senior Member
                      • Dec 2009
                      • 432

                      #11
                      I tried a 30 seconds timeout and the problem was resolved.
                      978 Hosts / 16.901 Items / 8.703 Triggers / 44 usr / 90,59 nvps / v1.8.15

                      Comment

                      Working...