Ad Widget

Collapse

Issues with active agents after proxy update to 7.0.18

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • troffasky
    Senior Member
    • Jul 2008
    • 567

    #1

    Issues with active agents after proxy update to 7.0.18


    Updated zabbix-proxy from 7.0.15 to 7.0.18.
    Since then the proxy logs are spammed with:

    Code:
    16382:20250829:102418.404 failed to accept an incoming connection: from 172.31.252.16: reading first byte from connection failed: [11] Resource temporarily unavailable
    16380:20250829:102418.663 failed to accept an incoming connection: from 172.31.252.81: reading first byte from connection failed: [11] Resource temporarily unavailable
    16379:20250829:102418.665 failed to accept an incoming connection: from 172.31.252.12: reading first byte from connection failed: [11] Resource temporarily unavailable
    16381:20250829:102418.771 failed to accept an incoming connection: from 172.31.252.38: reading first byte from connection failed: [11] Resource temporarily unavailable
    16383:20250829:102418.993 failed to accept an incoming connection: from 172.31.252.70: reading first byte from connection failed: [11] Resource temporarily unavailable
    16382:20250829:102422.439 failed to accept an incoming connection: from 172.31.252.49: reading first byte from connection failed: [11] Resource temporarily unavailable
    16380:20250829:102422.666 failed to accept an incoming connection: from 172.31.252.82: reading first byte from connection failed: [11] Resource temporarily unavailable
    16379:20250829:102422.669 failed to accept an incoming connection: from 172.31.252.64: reading first byte from connection failed: [11] Resource temporarily unavailable
    16381:20250829:102422.776 failed to accept an incoming connection: from 172.31.252.58: reading first byte from connection failed: [11] Resource temporarily unavailable
    16383:20250829:102422.997 failed to accept an incoming connection: from 172.31.252.32: reading first byte from connection failed: [11] Resource temporarily unavailable

    Metrics are making it through but many gaps. Ie active agent availability item is flapping.

    Does not seem to be a resource issue on proxy, load average is 0.7 and 0B of swap used.
    Proxy config file was not changed as part of upgrade, timestamp is 10 months ago.

    If I google this, almost every post is about using agent encryption. This does not apply here, no encryption in use.
  • troffasky
    Senior Member
    • Jul 2008
    • 567

    #2
    The proxy is also struggling with talking to the server, proxy poller utilisation is stuck at >98%

    Comment

    • markfree
      Senior Member
      • Apr 2019
      • 868

      #3
      Have you checked your configuration file?
      Some updates push new configuration files with minor changes. Depending on the packet manager, the application may start using the new file instead of the old one.
      Its also possible that your proxy configuration needs some process tuning to better handle the load.

      Comment

      • troffasky
        Senior Member
        • Jul 2008
        • 567

        #4
        Proxy config file was not changed as part of upgrade, timestamp is 10 months ago.

        Comment

        • troffasky
          Senior Member
          • Jul 2008
          • 567

          #5
          This just "fixed itself" 18h later. It *definitely* started right after upgrading from 7.0.15 to 7.0.18.

          Comment

          • Semicolon
            Junior Member
            • Sep 2025
            • 3

            #6
            I also see this behavior, and it also "fixed itself." But then it went back to being busted again; this time just about all agents are reporting disconnected and down for over 30 hours.

            I went from 7.0.17 to 7.0.18.

            Comment

            • troffasky
              Senior Member
              • Jul 2008
              • 567

              #7
              This is pretty weird. I know you're all probably thinking "networking issue, 100%" but I am convinced it's the application. The issue did not come back after restarting the service though.
              I have Smokeping running on the same host and there wasn't packet loss to any of the agents or any increase in ping time or jitter.
              The agents connect over a Wireguard tunnel that terminates on the Zabbix proxy host.
              The Zabbix server connects to the Zabbix proxy through a more normal L3 route through the datacentre firewall.
              I cannot replicate the issue by restarting the service so not sure what useful information I can add for troubleshooting.

              Comment

              • Markku
                Senior Member
                Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                • Sep 2018
                • 1781

                #8
                My guess is some kind of TCP resource starvation: too many TCP connections coming to the proxy or something like that.

                See and search for various tools that give you some idea of the TCP socket situation, I'd start with plain "ss" command. And/or try to increase the various kernel-level TCP socket settings.

                From above (the use of Wireguard tunnel) I understand that your proxy is reachable from internet, so it may be just the noise from internet causing this. Or then something else, hard to say without knowing all the surrounding details.

                If your skills allow, use Wireshark+sshdump or tcpdump to see the traffic and figure out what's happening.

                Markku

                Comment

                • troffasky
                  Senior Member
                  • Jul 2008
                  • 567

                  #9
                  Interesting, never used ss command before. Would be good as a Zabbix agent metric.
                  I don't recall anything in journalctl about running out of TCP sockets and the issue didn't go away after a reboot.
                  Like I said, I can't reproduce it so can't troubleshoot this any further, but I will look at that ss command if it happens again.
                  FWIW I had no issues SSHing on or browsing the web interface.

                  Comment

                  • Markku
                    Senior Member
                    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                    • Sep 2018
                    • 1781

                    #10
                    If the issue reoccurs, you may also want to verify the proxy metrics and if you need to increase the number of trappers on the proxy (unless there are lots of them already configured).

                    Markku

                    Comment

                    • troffasky
                      Senior Member
                      • Jul 2008
                      • 567

                      #11
                      This metric is collected and you cannot even tell where the issue begins (1) and ends (2):
                      Click image for larger version

Name:	image.png
Views:	79
Size:	53.6 KB
ID:	506926

                      Comment

                      • Semicolon
                        Junior Member
                        • Sep 2025
                        • 3

                        #12
                        It came back for me, and lasted for nearly a week. Then it self-resolved three days ago.
                        It just came back an hour ago.

                        I can confirm that it is not a TCP ephemeral port exhaustion issue, or some other tcp resource issue.

                        Comment

                        • Markku
                          Senior Member
                          Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                          • Sep 2018
                          • 1781

                          #13
                          Originally posted by Semicolon
                          It came back for me, and lasted for nearly a week. Then it self-resolved three days ago.
                          It just came back an hour ago.

                          I can confirm that it is not a TCP ephemeral port exhaustion issue, or some other tcp resource issue.
                          How do the TCP sessions look like in packet capture?

                          Markku

                          Comment

                          • Semicolon
                            Junior Member
                            • Sep 2025
                            • 3

                            #14
                            Originally posted by Markku

                            How do the TCP sessions look like in packet capture?

                            Markku
                            It repaired itself this morning; I will capture the next time it starts to fail

                            Comment

                            Working...