Ad Widget

Collapse

Agents going unavailable

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • satchelp
    Junior Member
    • Jul 2008
    • 13

    #1

    Agents going unavailable

    I have an interesting problem on Zabbix server and agent 1.4.6.

    I have two servers that randomly set themselves to Unreachable and stop recording new values.

    If I restart the agent it will start collecting for a few hours and then set itself as unreachable again. I can see the zabbix_agentd process running - however I cannot telnet to the default port from the Zabbix server.

    I have another server with similar symptoms, but I am able to telnet to the agent and pull values. These servers are on the same subnet as the Zabbix server so there are no firewalls in place that would prevent it from collecting data or communicating over the network. I can see in the server log that a generic "network error" had occurred but I am unsure what this could mean or why it cannot recover.

    Doing a netstat on the servers being monitored, I can see an extremely high number of connections back to the zabbix server in the CLOSE_WAIT state. One of them has 3 ESTABLISHED connections and 195 connections in the CLOSE_WAIT state to the Zabbix server.

    I am monitoring probably 50 servers nearly identical (hardware, software) to the ones having problems with. Only these 3 are having a problem.

    Log:
    29015:20090515:003521 Host [Server1]: first network error, wait for 15 seconds
    29015:20090515:003521 Parameter [perf_counter["\Memory\Page Faults/sec"]] will be checked after 300 seconds on host [Server1]
    29026:20090515:005706 Host [Server1]: first network error, wait for 15 seconds
    29026:20090515:005706 Parameter [perf_counter[\System\File Write Bytes/sec]] will be checked after 300 seconds on host [Server1]
    29015:20090515:040009 Host [Server1: first network error, wait for 15 seconds
    29057:20090515:040036 Host [Server1]: another network error, wait for 15 seconds
    29057:20090515:040101 Host [Server1] will be checked after 60 seconds

    Any ideas?
  • MrKen
    Senior Member
    • Oct 2008
    • 652

    #2
    Hi satchelp,

    In your zabbix_server.conf try increasing the timeout for UnreachableDelay and UnreachablePeriod, and maybe the UnavailableDelay.

    MrKen
    Disclaimer: All of the above is pure speculation.

    Comment

    • satchelp
      Junior Member
      • Jul 2008
      • 13

      #3
      Agent still goes unavailable

      I tried increasing the timeout for UnreachableDelay and UnreachablePeriod with no effect. I have also tried the experimental NoTimeWait option as I saw a lot of sessions opned in the time_wait state between this host and the zabbix server

      I have tried different versions of the agent (from 1.6.4 all the way back to 1.4.1)


      The zabbix agent on server had not been responding since Friday June 5th, and then today it randomly started working again.

      But there were two interesting errors in the logs. Specifically, the zabbix_agentd.log had the following:
      5104:20090608:121428 In disable_all_metrics()
      5104:20090608:121428 Parsed [ZBX_EOF]
      5104:20090608:121428 In process_active_checks('192.168.1.105',10051)
      5104:20090608:121428 In get_min_nextcheck()
      5104:20090608:121428 Sleeping for 60 seconds
      5104:20090608:121528 In process_active_checks('192.168.1.105',10051)
      5104:20090608:121528 In get_min_nextcheck()
      5104:20090608:121528 Sleeping for 60 seconds
      4536:20090608:121609 Process listener error: ZBX_TCP_READ() failed [An existing connection was forcibly closed by the remote host.


      And then the next check sort of worked and threw another error (I added some spaces so the items wouldn't have smiley faces in them...):
      4536:20090608:121609 Processing request.
      4536:20090608:121609 In check_security()
      4536:20090608:121609 Requested [perf_counter["\PhysicalDisk(12 M: )\Disk Write Bytes/sec"]]
      4536:20090608:121610 Sending back [434674.700343]
      4536:20090608:121610 Processing request.
      4536:20090608:121610 In check_security()
      4536:20090608:121610 Requested [perf_counter["\PhysicalDisk(12 M: )\Disk Read Bytes/sec"]]
      4536:20090608:121611 Sending back [456087.093847]
      4536:20090608:121611 Process listener error: ZBX_TCP_WRITE() failed [An established connection was aborted by the software in your host machine.


      Since these two errors the agent has been reporting correctly, but based on past experience it will likely stop sometime tonight. Has anyone seen anything like this?

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        Do a Google search on those 2 errors. I think you will see it is not related to Zabbix... but is related to other apps on those hosts.

        One of those errors points to the use of McAfee virus scan closing down the ports.

        You happen to be running McAfee?

        Comment

        Working...