Ad Widget

Collapse

Zabbix Agents problem with windows 2008

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • sstitdepartment
    Junior Member
    • Mar 2013
    • 2

    #1

    Zabbix Agents problem with windows 2008

    Hello,

    We have a brand new Zabbix server 2.0.5 running on CentOS 6.3, and we currently are only monitoring 2 servers.

    The Zabbix agent has version 2.0.4.31980 and both hosts are up and operational.

    The Zabbix server is reporting that both machines are unreachable and when I do a netstat on port 10050 I am seeing that the connections are not closing.

    Zabbix is the only service configured to use port 10050.

    Not sure what the problem is and not sure if this is a known problem?

    Thanks,
    Vinnie
  • tchjts1
    Senior Member
    • May 2008
    • 1605

    #2
    Zabbix service is up and running on those servers?

    What is your zabbix_agentd.log on those servers telling you?

    Comment

    • clahti
      Senior Member
      • Jan 2007
      • 126

      #3
      More information

      We originally had 88 hosts monitored, a mix of Windows and Linux hosts. After Zabbix ran, monitoring 5700 items from generic windows and linux templates, our firewall fell to it's knees and our entire network was f*'d. It turns out our firewall's TCP session table was getting completely filled up with sessions to our zabbix server from Windows hosts which were 1 subnet hop away. Upon investigating our Windows servers we found via netstat that the agent connections were not dying, essentially consuming more and more memory on the windows hosts until they ceased to perform.

      We have now turned off Zabbix for all hosts but two Windows 2008r2 servers for troubleshooting. I have cranked up the logging on the agent to debug and restarted the service. within seconds I run netstat -an | find "10050" and more than 10 entries are there. Wait 2 seconds, rerun same command and there are more than 30 entries there, then 50. At some point the connections are closing, but not at the rate they are being created. Here is the zabbix_agentd.log in debug level for this whole time (I don't see any problems):

      Code:
        4632:20130327:104718.946 Starting Zabbix Agent [SSI-BOSTON-01]. Zabbix 2.0.4 (revision 31980).
        4632:20130327:104718.948 In init_collector_data()
        4632:20130327:104718.948 End of init_collector_data()
        4632:20130327:104718.949 In init_perf_collector()
        4632:20130327:104718.949 End of init_perf_collector()
         424:20130327:104718.949 agent #0 started [collector]
         424:20130327:104718.949 In init_cpu_collector()
         424:20130327:104718.950 In get_counter_name() pdhIndex:238
        5276:20130327:104719.028 agent #1 started[listener]
       38080:20130327:104719.028 agent #2 started[listener]
       38868:20130327:104719.028 agent #3 started[listener]
       36100:20130327:104719.029 agent #4 started [active checks]
       36100:20130327:104719.029 In init_active_metrics()
       36100:20130327:104719.029 Buffer: first allocation for 100 elements
       36100:20130327:104719.029 In send_buffer() host:'127.0.0.1' port:10051 values:0/100
       36100:20130327:104719.029 End of send_buffer():SUCCEED
       36100:20130327:104719.029 refresh_active_checks('127.0.0.1',10051)
         424:20130327:104719.333 End of get_counter_name():SUCCEED
         424:20130327:104719.333 In get_counter_name() pdhIndex:6
         424:20130327:104719.333 End of get_counter_name():SUCCEED
         424:20130327:104719.333 In add_perf_counter() counter:'\Processor(_Total)\% Processor Time' interval:900
         424:20130327:104719.334 add_perf_counter(): PerfCounter '\Processor(_Total)\% Processor Time' successfully added
         424:20130327:104719.334 In add_perf_counter() counter:'\Processor(0)\% Processor Time' interval:900
         424:20130327:104719.334 add_perf_counter(): PerfCounter '\Processor(0)\% Processor Time' successfully added
         424:20130327:104719.334 In get_counter_name() pdhIndex:2
         424:20130327:104719.335 End of get_counter_name():SUCCEED
         424:20130327:104719.335 In get_counter_name() pdhIndex:44
         424:20130327:104719.335 End of get_counter_name():SUCCEED
         424:20130327:104719.335 In add_perf_counter() counter:'\System\Processor Queue Length' interval:900
         424:20130327:104719.338 add_perf_counter(): PerfCounter '\System\Processor Queue Length' successfully added
         424:20130327:104719.338 End of init_cpu_collector():SUCCEED
         424:20130327:104719.339 In collect_perfstat()
       36100:20130327:104720.032 Get active checks error: cannot connect to [[127.0.0.1]:10051]: [0x0000274D] No connection could be made because the target machine actively refused it.
       36100:20130327:104720.032 In process_active_checks('127.0.0.1',10051)
       36100:20130327:104720.033 End of process_active_checks()
       36100:20130327:104720.033 In get_min_nextcheck()
       36100:20130327:104720.033 In send_buffer() host:'127.0.0.1' port:10051 values:0/100
       36100:20130327:104720.033 End of send_buffer():SUCCEED
       36100:20130327:104720.033 Sleeping for 1 second(s)
      
      <snip many similar>

      Comment

      • tchjts1
        Senior Member
        • May 2008
        • 1605

        #4
        Good response. We noticed on our Windows servers the other day that there were a lot of TIME_WAIT sessions open as well. usually between 50 and 90.

        In fact, I opened up a Zabbix ticket regarding it. They pointed me to an issue that had already been opened. You can see it here: https://support.zabbix.com/browse/ZBX-3948

        But to summarize, Richlv has written this:
        that's a tcp stack feature of the operating system. when a connection is closed, it remains in TIME_WAIT state for some time to ensure that packets are dealt with properly in case of connection closing ACKs not properly reaching both sides or in case of Mysterious Packets from the Past arriving.

        I know that doesn't answer your current question. So, what is it saying on Zabbix server in zabbix_server.log regarding either of those 2 servers?

        "Target server actively refused connection" almost sounds like the entry for either Server= or ServerActive= isn't set correctly.

        You mention recently upgrading. Do the zabbix_agentd.conf files you are using have the ServerActive= parameter in them? That is what is used to determine where the host connects to for it's active checks. That should be the Zabbix server IP or DNS.

        Comment

        • clahti
          Senior Member
          • Jan 2007
          • 126

          #5
          We are not yet using active checks, so this value is set to 127.0.01 per the default windows installer. I am going to try active checks to see if this helps as well.

          Comment

          • beli
            Junior Member
            • Mar 2013
            • 12

            #6
            Did you check that you have the connection allowed in local firewall on the Windows servers?

            Comment

            Working...