Ad Widget

Collapse

Agents can't find server

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • SSD
    Junior Member
    • Oct 2012
    • 3

    #1

    Agents can't find server

    Hi

    We've been using Zabbix for over a year now in an MSP style with remote agents doing active checks on Windows servers. Over the weekend, for an as-yet unknown reason, almost half of our agents suddenly stopped communicating with our server and, looking at the last times they contacted the Zabbix server it seems to coincide (mostly) with a reboot of the monitored machine.

    Last Tuesday (7 days ago) we migrated to a new public IP address but, on the surface, this doesn't seem to be the issue as the monitored servers can /all/ perform successful nslookup, tracert, ping and telnet hostname:10051 tests against our new IP address and the hostname for our zabbix server. I've also flushed DNS and ARP caches, double-checked agent configs, port forward rules, rebooted servers, restarted the zabbix server, restarted our router and firewall.

    The zabbix server log shows nothing relevant (as you'd expect) and the only thing relevant in the agent logs (level 4) are these lines:

    18660:20121030:103752.280 Get active checks error: ZBX_TCP_READ() failed [A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.]
    18660:20121030:103752.280 In process_active_checks('zabbix.mydomain.com',10051)
    18660:20121030:103752.280 End of process_active_checks()
    18660:20121030:103752.280 In get_min_nextcheck()
    18660:20121030:103752.280 In send_buffer() host:'zabbix.mydomain.com' port:10051 values:0/100
    18660:20121030:103752.280 End of send_buffer():SUCCEED


    Our Zabbix server is v1.8.4 on CentOS 5, behind a firewall, with tcp 10051 forwarded to it.
    Our agents are version v1.8.5
    All our remote agents use a DNS hostname to contact the server (and all resolve to the correct IP)
    All our remote agents use active checks.
    This is only affecting around half our remote agents.
    There is no obvious common factor between the non-communicating agents -different firewalls/routers at each site, different Windows server versions, not all agents on a given site are non-communicating.

    I realise this isn't a networking help forum, but if anyone can offer some advice on how to troubleshoot the zabbix agent in more detail, or if there was a recent Windows update which interferes with agent v.1.8.5, or anything else relevant, it would be much appreciated.

    Thanks in advance
    Mark
  • jerrylenk
    Member
    Zabbix Certified Specialist
    • May 2010
    • 62

    #2
    Have you double-checked the agentd.conf files on the agents?
    If for whatever reason, anything in there had been changed, the agent would not notice until it is restarted.

    Also, the host config in zabbix may have an IP Adress from which to accept active-check requests from that host. Could these Adresses have changed?

    Comment

    • mbsit
      Senior Member
      • Sep 2012
      • 130

      #3
      Hi
      In my opinion, after restart your servers (where agent is) "hostname" has changed (maybe domain changed etc.).
      So, agent connect successfully, but there is no host defined on server (different name).

      Bests,
      Grzegorz

      --
      Wdrożenia, usługi IT - Warszawa
      Pozdrawiam
      Grzegorz Grabowski
      ____
      WdroĊĵenia, szkolenia, umowy serwisowe
      Warszawa - Polska

      Comment

      • SSD
        Junior Member
        • Oct 2012
        • 3

        #4
        Hi

        Thanks for the replies.

        The agent configs are all as they should be - the host name, server name and other parameters are all correct. The hosts communications are not restricted to certain IP (but it's our IP that changed, not the host server's).
        The hostnames haven't changed, either on the host machine, the agent config or on the Zabbix server.

        Ok, now I'm baffled. I've just restarted an agent that has been fully functional until now and it's stopped communicating even though the host machine tests successfully against our servername.

        Mark

        Comment

        • SSD
          Junior Member
          • Oct 2012
          • 3

          #5
          Well, I found the problem.. a bit of packet capturing and testing showed that the default MTU on our router was set too high, nothing to do with Zabbix

          What is curious is that agents that were running during our IP migration were not affected until they were restarted. I'm not an IP expert but I would have thought the tcp sessions would have been renegotiated before now, at least when our systems were restarted. Anyway, problem solved.

          Thanks for your help!

          Mark

          Comment

          Working...