Ad Widget

Collapse

Zabbix 7.0: Random "Zabbix Agent/active checks not available"

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • muelli
    Member
    • Jun 2021
    • 68

    #1

    Zabbix 7.0: Random "Zabbix Agent/active checks not available"

    Hello forum,

    once again I have a weird problem on one of my freshly installed 7.0 servers (not upgraded servers!).
    I randomly get "Zabbix Agent is not available" or "Active checks are not available" alerts across different servers at different times.
    So far I found it, that it has nothing to do with network or firewall issues, I was at the Dashboard when one alert popped up and logged in to the server that seemed unavailable.
    Everything was up and running and reachable and I could connect to port 10051 on the zabbix server with telnet. Still, it took more time for the alerts to go away.
    When checking last data it seems the agent is collecting data in the background as it all seems there, so it is not a complete agent stall....
    I am using agent2 btw.

    Does anybody have any idea how I could debug this problem further? I am quite clueless here, sorry.

    Thanks!

    edit: during the time the alert was shown I had an icmp simple check running that showed the server actually up and running and responding as well.....
    Last edited by muelli; 29-06-2024, 10:51.
  • muelli
    Member
    • Jun 2021
    • 68

    #2
    I forgot to attach the relevant agent2 log for that time:

    2024/06/29 03:29:02.113439 detected 11 time difference between queue checks, rescheduling tasks
    2024/06/29 03:29:06.215221 plugin 'Cpu': time spent in collector task 2.641208 s exceeds collecting interval 1 s
    2024/06/29 03:29:06.921491 plugin 'VFSDev': time spent in collector task 3.374806 s exceeds collecting interval 1 s
    2024/06/29 03:29:07.746369 [101] cannot connect to [zabbix_server:10051]: write tcp 192.168.123.138:54551->zabbix_server:10051: i/o timeout
    2024/06/29 03:29:07.746526 [101] active check configuration update from host [ssh] started to fail
    2024/06/29 03:29:07.826312 [101] cannot connect to [zabbix_server:10051]: write tcp 192.168.123.138:57449->zabbix_server:10051: i/o timeout
    2024/06/29 03:29:07.826408 [101] history upload to [zabbix_server:10051] [ssh] started to fail
    2024/06/29 03:55:45.084712 [101] history upload to [zabbix_server:10051] [ssh] is working again
    2024/06/29 03:55:45.970458 [101] active check configuration update from [zabbix_server:10051] is working again
    zz0.utsgr7q3arbzz

    As mentioned before, the i/o timeout cannot be replicated, as I was logged in and checking network connection at that time.

    Comment

    • Markku
      Senior Member
      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
      • Sep 2018
      • 1781

      #3
      Interesting details: collectors are running longer than the 1s interval that is configured (very short interval btw!), and TCP connectivity from the agent to the server is failing at the same time. Are these possibly related to each other? Just thinking aloud. Also the "time difference" is interesting.

      And, why does it take almost 30 minutes to restore the TCP connectivity?

      You checked the firewall you said. What did you see in the logs there? I mean, the agent will keep trying after the connection failure. Did you or did you not see the agent trying?

      Did you mean that while the agent said that it is still unable to connect to the server, that you were able to manually connect from the agent to server:10051?

      What did the server log say at the same time?

      Increasing the logging level (maybe first in the agent side) should provide you more information about the agent workings. Also, as a networking person, I also encourage to take packet captures on both sides of the conversation at the same time to see how to components actually are doing.

      "Different servers and different times", sounds like a "centralized" problem, either in the middleboxes (firewalls etc) or in the server-side.

      Markku

      Comment

      • muelli
        Member
        • Jun 2021
        • 68

        #4
        >>collectors are running longer than the 1s interval that is configured (very short interval btw!)

        I left the plugin timeout at default setting (which is not mentioned in the logfile) but I can raise it....

        About the firewall: There is not outbound firewall. The connection is completely unrestictred, and since the subnet is connected to some random ISP router, there are not logfiles on that device as well...

        >Did you mean that while the agent said that it is still unable to connect to the server, that you were able to manually connect from the agent to server:10051?
        Yes exactly:
        "cat /dev/null |telnet zabbix_server 10051" was successfull/able to connect.

        However I found something interesting on the zabbix server around that same time:

        1245259:20240629:032907.427 failed to accept an incoming connection: from X.X.X.X: reading first byte from connection failed: [11] Resource temporarily unavailable
        1245260:20240629:032907.754 failed to accept an incoming connection: from X.X.X.X: SSL_accept() timed out

        Later on there is nothing related in the log.....

        I have a feeling it might have something to do with the zabbix server.... but I am unsure what it could be
        I would raise the loglevel if I knew which agent would fail next......


        a shot in the dark:
        echo 5000000 >/proc/sys/net/core/somaxconn
        echo 5000000 >/proc/sys/net/ipv4/tcp_max_syn_backlog
        maybe my tcp connections ran out, but since this zabbix instance is only monitoring 20 servers.. unlikely.

        edit: raising somaxconn did nothing to solve the problem.
        the error on the same host appeared again today for around 2 minutes.
        I raised loglevel on this host/agent and will check back tomorrow.
        Last edited by muelli; 30-06-2024, 17:31.

        Comment

        • aliberry
          Junior Member
          • Jul 2024
          • 1

          #5
          Hey muelli - did you have any luck in solving this? We're seeing similar things with our 7.0 proxies. We've filed a Zabbix bug https://support.zabbix.com/browse/ZBX-24658

          Comment

          • Markku
            Senior Member
            Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
            • Sep 2018
            • 1781

            #6
            In order to understand the situation, I would take packet captures between all the components: between server and proxies (to find out what the server says to the proxies, who monitors who), and proxies and the agent (to find out what the agent gets). Then I would use Wireshark 4.3.0rc1 to see the captures and understand what the components were doing.

            It is easier without TLS, but decrypting TLS is also possible:

            One of the built-in security features in Zabbix is TLS (Transport Layer Security) support for external connections. This means that when your distributed Zabbix proxies or Zabbix agents connect to …


            Post the captures (or links) here if you need help with them. (Please don't show any text exports, nobody has time to try to understand them, capture files only so that they are workable with Wireshark. Just make sure you don't publish anything you don't want to publish.)

            Also add the relevant server, proxy and agent logs from the same time, to be able to correlate the log events with the capture contents.

            Markku

            Comment

            • muelli
              Member
              • Jun 2021
              • 68

              #7
              After enabling debug log on both server/agents, the problem somehow went away for me and has not yet re-appeared.
              I captured data for 24h but nothing went unavailable again, so I tossed it.
              I am still puzzled what happened. Maybe it was a network problem.....

              Comment

              • EHRETic
                Member
                • Jan 2021
                • 45

                #8
                Hi,

                Since yesterday and the update of Zabbix server to 7.0.1, I have random Active Checks also not working.
                My setup was working fine with 7.0.0 since weeks (I almost updated straight away)

                Reboot of the server seems to solve for some the issue but others are coming. Since last reboot, the server itself is not capable to have Active Checks:

                Click image for larger version  Name:	image.png Views:	0 Size:	23.9 KB ID:	488076
                Update of agent to 7.0.1 doesn't not solve the issue (I tried on a couple of VMs and Zabbix host is updated already)

                It's my own home lab, so I don't bother too much, but this version have surely problems.
                Most of my VMs are connected to Zabbix server since at least a year and I never had such connectivity issues prior yesterday (most VMs on same subnet)

                Kind regards
                Franck

                Comment

                • Markku
                  Senior Member
                  Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                  • Sep 2018
                  • 1781

                  #9
                  Two main options (or maybe combined):
                  1. DebugLevel=4, maybe in the agent (unless your agent log already tells you what happened)
                  2. Packet capture (maybe like "sudo tcpdump -v -W 1000 -C 1 port 10051 -w /var/tmp/activeagent.pcap" on the active agent, to get max 1000 x 1 MB captures)
                  to analyze the situation.

                  (I take it that you don't have any proxies so https://support.zabbix.com/browse/ZBX-24658 is not a concern here)

                  Markku

                  Comment

                  Working...