Ad Widget

Collapse

Active agent randomly stops communicating until restart

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gherbstman
    Junior Member
    • May 2019
    • 17

    #1

    Active agent randomly stops communicating until restart

    We have been having an ongoing issue where the Zabbix agent randomly stops communicating with our server. The agent will stop communicating and will not start communicating again until we restart the agent service on the monitored computer.

    We are seeing this with both agent 1.0 and agent 2.0 and multiple versions of the agents including the most current.

    We are also seeing this on multiple versions of Windows server ranging from 2012 to 2016. Maybe current versions like 2019 and 22 but we are not sure of that.

    The service stays running and the logs show terse information about not being able to communicate.

    We are using both a DNS name and a backup IP address for the active server configuration on the agent.

    Event logs are showing randomly one or the other the name or the IP failing. This happens to only a couple agents out of a couple hundred agents at any given time.

    Has anyone else been seeing this and does anyone, more importantly, have a solution?
    Last edited by gherbstman; 24-11-2022, 15:45.
  • Markku
    Senior Member
    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
    • Sep 2018
    • 1782

    #2
    Originally posted by gherbstman
    the logs show terse information about not being able to communicate.
    Can you show us the actual logs?

    Markku

    Comment

    • gherbstman
      Junior Member
      • May 2019
      • 17

      #3
      I will have to get you get you those later. I turned up logging on an agent to see if I can collect more details.

      One of the messages we see is: Active check data upload started to fail

      Comment

      • Markku
        Senior Member
        Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
        • Sep 2018
        • 1782

        #4
        That error basically means that there is a connectivity problem from the agent to the server. What can you tell us about the network topology between the agent and server?

        Markku

        Comment

        • gherbstman
          Junior Member
          • May 2019
          • 17

          #5
          Zabbix server with a public IP and just basic firewall rules. No fancy IP or security services.

          Agent (monitored) server behind a SonicWALL along with multiple other monitored servers (~20) on the same LAN. Other servers maintain connectivity. Randomly a random server will stop reporting in. Restarting the Zabbix agent on that monitored server, fixes the issue.

          Out of about 300 monitored servers, we are seeing 1 or 2 per week do this.

          Some clarification on agent V2. I had one go offline today. It seems to be a different issue. In this case the service was stopped. We have replaced most of our V2 agents with V1 due to this different issue.​

          Comment

          • Markku
            Senior Member
            Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
            • Sep 2018
            • 1782

            #6
            Install and run Wireshark (or some other packet capturing software) on the Windows host when the problem occurs, that will give you data about the connection attempts. If possible, run Wireshark even longer, before the problem starts, so that you see what happens. If your agents show up with unique IP addresses in the Zabbix server, you can also run tcpdump in the server side to capture the connections. Ideally you would need simultaneous captures from both client side and server side.

            If you don't manage the firewalls, also consult your firewall admins about the case, check the session logs on the firewall for the affected agents and so on. Every agent connection (interval is dependent of your item configurations) is a separate TCP connection in the firewall logs.

            If you need help interpreting the resulting capture files (from Wireshark or tcpdump), let me know.

            Markku

            Comment

            • gherbstman
              Junior Member
              • May 2019
              • 17

              #7
              Wireshark is a good idea. It will take a while as this only occurs once in a while on a random agent.

              Comment

              Working...