Ad Widget

Collapse

Many thousands of TIME_WAIT connections on the agent side

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mbrijun
    Member
    • Mar 2006
    • 63

    #1

    Many thousands of TIME_WAIT connections on the agent side

    Hello,

    several of my Windows 2008 servers started failing on the "agent.ping" check. The "netstat -an" shows literally thousands of TCP connections in "TIME_WAIT" state.

    Since my templates use passive checks, I would expect to see a small number of connections, but in my case I have thousands. My problem is that the servers cannot easily restarted, a downtime needs to be arranged, etc.

    Are there any short term solutions. The version of the agent is 1.6.6. I have tried stopping the agent, but the connections seem to linger indefinitelly. Would upgrading the agent and starting it up remove these stale connections?

    Thank you.
    Last edited by mbrijun; 14-08-2012, 13:15.
  • mbrijun
    Member
    • Mar 2006
    • 63

    #2
    RESOLVED: Many thousands of TIME_WAIT connections on the agent side

    I have managed to resolve the problem by running the following script from the zabbix server. It takes 3 parameters: the IP of the remote server that is suffering from the massive amount of TIME_WAIT, the start of the range of the source port numbers and the end of the range of the source port numbers. For example

    ./time_wait_cure.py 192.168.1.10 32768 61000

    You will have to run the script several times, it will run faster each time you run it. Please restart the zabbix agent before each run.

    Code:
    #!/usr/bin/env python
    
    import socket
    import sys
    import time
    
    REMOTE_PORT = 10050
    
    def main():
        remote_host = sys.argv[1]
        for local_port in range(int(sys.argv[2]), int(sys.argv[3])):
            s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            s.settimeout(0.05)
            print 'localhost:%s  --> %s:%s' % (local_port, remote_host, REMOTE_PORT)
            try:
                s.bind(('', local_port))
            except:
                print '    could not bind to local address'
            else:
                try:
                    s.connect((remote_host, REMOTE_PORT))
                except:
                    print '    could not connect'
                else:
                    print '    connected OK, closing'
                    s.close()
    
    
    if __name__ == '__main__':
        main()
    Last edited by mbrijun; 14-08-2012, 13:27.

    Comment

    • arnvid
      Junior Member
      • Aug 2012
      • 2

      #3
      We have a similar issue with some of our Win2003 and Win2008 servers. Several hundred TIME_WAIT socket pairs being listed..

      Although its not a huge problem in our case - upgrading to the 2.0.0 client made the number of TIME_WAITs go down to 20-25ish so it might be worth to look into.

      Our previous version was 1.8.8

      Comment

      • chojin
        Member
        Zabbix Certified Specialist
        • Jul 2011
        • 64

        #4
        We have a similar problem on some Win 2008 servers.
        Thousands of TIME_WAIT connections, consuming all of the available 65000-ish ports on the foreign host. This results in the zabbix agent no longer responding to queries.

        We are using Zabbix agent 2.0.0
        Last edited by chojin; 21-09-2012, 16:30.

        Comment

        • mbrijun
          Member
          • Mar 2006
          • 63

          #5
          I am starting to think the only real solution is to switch from passive checks to active checks. That way the zabbix server will not "bombard" the client with multiple requests. Instead, the client will call home and submit the data.

          Comment

          • chojin
            Member
            Zabbix Certified Specialist
            • Jul 2011
            • 64

            #6
            Looks like this is a common problem encountered with Windows 2008 or 7 when it is running for longer than 497 days.
            The problem is described here: http://support.microsoft.com/kb/2553549

            Comment

            Working...