Ad Widget

Collapse

agents going into "Not Available" state, returning "ZBX_TCP_READ() failed"

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • tolland
    Junior Member
    • Apr 2008
    • 20

    #1

    agents going into "Not Available" state, returning "ZBX_TCP_READ() failed"

    Hi,

    A few times a week some of my zabbix agents (1.4.4 on RHEL4) got into "not available" mode, the agent is still running and the server has the following entries in the log;

    5597:20080819:223237 Timeout while answering request
    5609:20080819:223244 Timeout while answering request
    5609:20080819:223244 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
    5609:20080819:223244 Host [i-b3ad76da] will be checked after 60 seconds

    Restarting the agent seems to correct the problem. So whats the problem with the agent thats causing it to go into this state?

    Thanks,

    Toll
    Last edited by tolland; 21-08-2008, 17:22.
  • tolland
    Junior Member
    • Apr 2008
    • 20

    #2
    From reading the other posts on this message, I think this is due to me setting the data send timeout too low. (Was 3, now 10)

    Comment

    • tolland
      Junior Member
      • Apr 2008
      • 20

      #3
      So I have updated the "Timeout=30" parameter on the agent and restarted, and they come back up for a while, but they are still going back in the "ZBX_TCP_READ() failed [Interrupted system call] " state after a day or 2.

      The server is logging;
      1822:20080821:145428 Host [i-1fdd0576] will be checked after 60 seconds
      1822:20080821:145453 Timeout while answering request
      1822:20080821:145453 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]

      and the agent logs with DebugLevel=4 look like;

      ]
      23337:20080821:143347 Before read
      23337:20080821:143348 In parse_list_of_checks() [ZBX_EOF
      ]
      23337:20080821:143348 In disable_all_metrics()
      23337:20080821:143348 Parsed [ZBX_EOF]
      23337:20080821:143348 In process_active_checks('194.149.xx.xx',10051)
      23337:20080821:143348 In get_min_nextcheck()
      23337:20080821:143348 Sleeping for 60 seconds
      23337:20080821:143448 In process_active_checks('194.149.xx.xx',10051)
      23337:20080821:143448 In get_min_nextcheck()
      23337:20080821:143448 Sleeping for 60 seconds
      23337:20080821:143548 In refresh_metrics('194.149.xx.xx',10051)
      23337:20080821:143548 get_active_checks('194.149.xx.xx',10051)
      23337:20080821:143548 Sending [ZBX_GET_ACTIVE_CHECKS
      ec2-67-202-35-146.compute-1.amazonaws.com
      ]
      23337:20080821:143548 Before read
      23337:20080821:143548 In parse_list_of_checks() [ZBX_EOF
      ]

      Comment

      • tolland
        Junior Member
        • Apr 2008
        • 20

        #4
        additionally I did a ethreal dump of the traffic between the server and agent, and on the connection from the server to the agent ie;

        server:xxxx -> agent:10050

        the server is sending ;
        "agent.ping"

        every minute or so, and the server sends a FIN, ACK and closes the connection after 10 seconds. The agent stats in the unavailable state.

        I should add to all of this, that the server is connecting to the agents over a WAN link.
        Last edited by tolland; 21-08-2008, 18:52.

        Comment

        • tolland
          Junior Member
          • Apr 2008
          • 20

          #5
          I looked at the traffic between an agent in the "ZBX_TCP_READ() failed" state and the server and its sending this;

          ZBXD.A.......ZBX_GET_ACTIVE_CHECKS
          ec2-xx-101-xxx-xxx.compute-1.amazonaws.com
          ZBX_EOF

          Comment

          • tolland
            Junior Member
            • Apr 2008
            • 20

            #6
            Originally posted by tolland
            additionally I did a ethreal dump of the traffic between the server and agent, and on the connection from the server to the agent ie;

            server:xxxx -> agent:10050

            the agent is sending ;
            "agent.ping"

            every minute or so, and the server closes the connection without replying. The agent stays in the unavailable state.
            Anyway, so I have compared this to the traffic on a working and available agent, and the traffic there , the server responds to the "agent.ping" sting with "ZBXD.........1"

            agent.ping
            ZBXD.........1

            So it looks like the problem is with the agent not responding to the "agent.ping" passive check?
            Last edited by tolland; 21-08-2008, 18:54.

            Comment

            • tolland
              Junior Member
              • Apr 2008
              • 20

              #7
              Any ideas on this one?

              Comment

              • Alexei
                Founder, CEO
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Sep 2004
                • 5654

                #8
                It looks very much like a timeout because of communication problems. Do a simple "telnet hostname 10050" to check if agent is accepting connections. There is something wrong here and it does not look like a ZABBIX fault.

                You may also check Server variable in agent configuration file and agent log file.
                Alexei Vladishev
                Creator of Zabbix, Product manager
                New York | Tokyo | Riga
                My Twitter

                Comment

                • tolland
                  Junior Member
                  • Apr 2008
                  • 20

                  #9
                  Originally posted by Alexei
                  It looks very much like a timeout because of communication problems. Do a simple "telnet hostname 10050" to check if agent is accepting connections. There is something wrong here and it does not look like a ZABBIX fault.

                  You may also check Server variable in agent configuration file and agent log file.
                  Ok, so here is my line in zabbix configuration -> hosts;

                  i-31d10958 ec2-67-202-35-160.compute-1.amazonaws.com 67.202.35.160 10050 Template_Linux, Template_App_Apache2.2 Monitored Not available ZBX_TCP_READ() failed [Interrupted system call]

                  So on the server where zabbix server is running

                  Code:
                  [root@dcdev01 ~]# ps -ef | grep zabbix
                  zabbix    1807     1  0 Aug21 ?        00:00:01 /usr/local/sbin/zabbix_server
                  zabbix    1808  1807  0 Aug21 ?        00:12:26 /usr/local/sbin/zabbix_server
                  ...
                  zabbix    1824  1807  0 Aug21 ?        00:00:00 /usr/local/sbin/zabbix_server
                  root      5736  5699  0 09:34 pts/0    00:00:00 grep zabbix
                  So from the zabbix server, I can connect to the zabbix agent;
                  Code:
                  [root@dcdev01 ~]# telnet ec2-67-202-35-160.compute-1.amazonaws.com 10050
                  Trying 67.202.35.160...
                  [COLOR="Red"]Connected to ec2-67-202-35-160.compute-1.amazonaws.com [/COLOR](67.202.35.160).
                  Escape character is '^]'.
                  
                  agent.ping
                  
                  test
                  
                  ^]
                  telnet>
                  On the client with the zabbix agent running;

                  Code:
                  $ ssh -i xxxxxxxxx [email protected]
                  CentOS release 5.2 (Final)
                  Kernel \r on an \m
                  
                  Last login: Tue Sep  2 09:49:29 2008 from xxx.xxx.xxx.xxx
                  [root@domU-12-31-38-00-6D-54 ~]# ps -ef | grep zabb
                  zabbix    2530     1  0 Aug15 ?        00:00:00 /usr/local/sbin/zabbix_agentd
                  zabbix    2536  2530  0 Aug15 ?        00:00:03 /usr/local/sbin/zabbix_agentd
                  zabbix    2537  2530  0 Aug15 ?        00:00:13 /usr/local/sbin/zabbix_agentd
                  zabbix    2538  2530  0 Aug15 ?        00:00:13 /usr/local/sbin/zabbix_agentd
                  zabbix    2539  2530  0 Aug15 ?        00:00:14 /usr/local/sbin/zabbix_agentd
                  zabbix    2540  2530  0 Aug15 ?        00:00:00 /usr/local/sbin/zabbix_agentd
                  root      5882  5851  0 09:43 ttyp0    00:00:00 grep zabb
                  When looking for the incoming connection I can see that there are loads of connections in the CLOSE_WAIT like this;


                  Code:
                  tcp       12      0 10.252.114.162:10050        xxx.xxx.xxx.xxx:52316         CLOSE_WAIT  
                  tcp       12      0 10.252.114.162:10050        xxx.xxx.xxx.xxx:55650         CLOSE_WAIT  
                  tcp       12      0 10.252.114.162:10050        xxx.xxx.xxx.xxx:33122         CLOSE_WAIT  
                  tcp       12      0 10.252.114.162:10050        xxx.xxx.xxx.xxx:49514         CLOSE_WAIT  
                  tcp       12      0 10.252.114.162:10050        xxx.xxx.xxx.xxx:53355         
                  ...
                  tcp       12      0 10.252.114.162:10050        xxx.xxx.xxx.xxx:54384         CLOSE_WAIT  
                  tcp       12      0 10.252.114.162:10050        xxx.xxx.xxx.xxx:41848         CLOSE_WAIT  
                  tcp       12      0 10.252.114.162:10050        xxx.xxx.xxx.xxx:50553         CLOSE_WAIT  
                  tcp       12      0 10.252.114.162:10050        xxx.xxx.xxx.xxx:39033         CLOSE_WAIT
                  from googling there are unresolved forum threads with CLOSE_WAIT connection status problems;




                  So from looking at the docs on CLOSE wait connections;
                  CLOSE_WAIT means that the local end of the connection has received
                  a FIN from the other end, but the OS is waiting for the program at the
                  local end to actually close its connection.
                  So the OS is waiting on zabbix to close the connection.

                  This still doesn't mean its not a local networking issues, but I would be interested on how I should debug this.

                  The zabbix_agentd.log file has this;
                  Code:
                   seconds
                    2540:20080830:175154 Timeout while answering request
                    2540:20080830:175154 Getting list of active checks failed. Will retry after 60
                   seconds
                    2540:20080831:165355 Timeout while answering request
                    2540:20080831:165355 Getting list of active checks failed. Will retry after 60
                   seconds
                    2540:20080831:175707 Timeout while answering request
                    2540:20080831:175707 Getting list of active checks failed. Will retry after 60
                   seconds
                    2540:20080831:185017 Timeout while answering request
                    2540:20080831:185017 Getting list of active checks failed. Will retry after 60
                   seconds
                    2540:20080902:003319 Timeout while answering request
                    2540:20080902:003319 Getting list of active checks failed. Will retry after 60
                   seconds
                  [root@domU-12-31-38-00-6D-54 zabbix]#
                  and the server has this

                  Code:
                  ZBX_TCP_READ() failed [Interrupted system call]
                    1822:20080902:100246 Host [i-31d10958] will be checked after 60 seconds
                    1822:20080902:100311 Timeout while answering request
                    1822:20080902:100311 Get value from agent failed. Error: ZBX_TCP_READ() failed [Interrupted system call]
                  additionally, the zabbix agentd.conf references the server by IP address eg;

                  Server=xxx.xxx.xxx.xxx

                  Comment

                  Working...