Ad Widget

Collapse

Zabbix server 6.0.40 Recv-Q is full tcp 10051

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • AspenKle
    Junior Member
    • Feb 2024
    • 12

    #1

    Zabbix server 6.0.40 Recv-Q is full tcp 10051

    Any tips on tips on this is much appreciated.
    Env

    Zabbix server:
    Zabbix 6.0.40. © 2001–2025, Zabbix SIA
    Standard E4as v4 (4 vcpus, 32 GiB memory)
    OS was upgrade from Linux (ubuntu 20.04)
    to Linux (ubuntu 24.04) in JAN 2025.

    Database server:
    Azure Database for Mysql flexible server
    General Purpose, D4ds_v4, 4 vCores, 16 GiB RAM, 100 storage, 600 IOPS

    DB parameters:
    innodb_io_capacity=600
    innodb_io_capacity_max=4000

    Zabbix Agents version running on Windows (Windows Server 2019 Datacenter), the monitored hosts.
    Some agents are:
    zabbix_agent2-6.0.26-windows-amd64-openssl.msi
    Other agents are updated to
    zabbix_agent2-6.0.40-windows-amd64-openssl.msi

    Monitored hosts:68
    Required server performance, new values per second 56.16

    We have check diagnostics
    zabbix_server -c /etc/zabbix/zabbix_server.conf -R diaginfo=valuecache

    From your diaginfo=valuecache output, here's the analysis and recommended actions:
    1. Cache Usage
    Total Size: 266,518,696 bytes (~254MB)
    Used: 1,689,944 bytes (~1.6MB)
    Free: 266,518,696 bytes (~254MB)
    Utilization: ~0.63% (extremely low)
    2. Items vs. Values
    Items: 3,295
    Values: 57,279
    Ratio: ~17 values per item (normal for active monitoring)
    3. Performance
    Time: 0.000965s (very fast response)


    We are seeing an issue like this:

    ss -ltn ouput.

    Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
    tcp 0 0 127.0.0.1:12563 0.0.0.0:* LISTEN -
    tcp 4097 4096 0.0.0.0:10051 0.0.0.0:* LISTEN -
    tcp 0 0 0.0.0.0:10050 0.0.0.0:* LISTEN -

    [ZBX-7933] zabbix generate TCP queue overflow - ZABBIX SUPPORT
    From time to time generates zabbix a TCP queue overflow.
    Then is no traffic from / to ZABBIX more possible, only zabbix restart help here.

    Example agent logs
    2025/06/11 00:49:27.024499 [101] cannot receive data from [ZABBIX-IP:10051]: Cannot read message: 'read tcp HOST-NAME:64868->ZABBIX-IP21:10051: i/o timeout'
    2025/06/11 00:49:27.024500 [101] active check configuration update from host [HOST-NAME] started to fail

    Example zabbix server log
    sudo tail -f zabbix_server.log
    # 1357:20240130:133326.485 failed to accept an incoming connection: connection rejected, getpername() faild: [107] Transport endpoint is not connected.

    Zabbix trapper processes is 0 when it happens, it has nothing to do.
    It would be great if we had Zabbix trapper processes more than 75% busy, but it is 0.
    We do not seen any Zabbix alerts when Recv-Q is full tcp 10051 on Zabbix server, it is instant fixed if we sudo zabbix-server stop/start.

    ### Option: StartTrappers
    StartTrappers=20

    Hoping some can share some tips here, Zabbix 6 is LTS so hoping we don't need to upgrade yet.
    Last edited by AspenKle; 13-06-2025, 20:50.
  • BradKnowles
    Junior Member
    • May 2025
    • 24

    #2
    I know we have some systems where `net.netfilter.nf_conntrack_max` has to be set to the maximum allowed value (524288 or larger), otherwise we experience some problems.

    I don't know if this helps you, but at least it's something you can look at.

    Comment

    • AspenKle
      Junior Member
      • Feb 2024
      • 12

      #3
      Hi BradKnowles, curently we have.
      Code:
      cat /proc/sys/net/netfilter/nf_conntrack_max
      262144

      Comment

      • BradKnowles
        Junior Member
        • May 2025
        • 24

        #4
        Originally posted by AspenKle
        Hi BradKnowles, curently we have.
        Code:
        cat /proc/sys/net/netfilter/nf_conntrack_max
        262144
        Have you tried boosting that value to 524288 and seeing if the problem persists?

        Comment

        • AspenKle
          Junior Member
          • Feb 2024
          • 12

          #5
          Hi BradKnowles, I will check a bit more and try to set that parameters until next boot, i.e. setting it temporary.
          Could you explain a bit more what you guy's are seeing in your systems?
          After AI and Google I always end up with "high Recv-Q/Send-Q, you should focus on application and network tuning, not conntrack settings."
          If you look at the picture this is what we see when Recv-Q is pilling up, a sudden high inbound flow to the Zabbix server.

          In general:
          Houskeeper is always fast
          housekeeper [deleted 236436 hist/trends, 0 items/triggers, 28 events, 12 problems, 62 sessions, 0 alarms, 0 audit, 0 autoreg_host, 0 records in 95.029074 sec, idle for 1 hour(s)]
          We do not see any zabbix alerts over 75% or similar when high Recv-Q sets in.
          Regards
          Attached Files
          Last edited by AspenKle; 17-06-2025, 09:23.

          Comment

          • AspenKle
            Junior Member
            • Feb 2024
            • 12

            #6
            BradKnowles I have just changed it temporarily :
            sudo sysctl -w net.netfilter.nf_conntrack_max=524288
            cat /proc/sys/net/netfilter/nf_conntrack_max
            524288

            Comment

            • AspenKle
              Junior Member
              • Feb 2024
              • 12

              #7
              BradKnowles
              We got the same result right now:
              Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
              tcp 0 0 127.0.0.1:12563 0.0.0.0:* LISTEN -
              tcp 4097 4096 0.0.0.0:10051 0.0.0.0:* LISTEN -
              tcp 0 0 0.0.0.0:10050 0.0.0.0:* LISTEN -

              and did a restart sudo service zabbix-server restart, then the queue went down.
              Zabbix must be eating some data that takes way to long to process, but the funny thing is that we see no alerts over 75% for any zabbix util* (trapper*poller*histsync*unreach* etc process.
              The thing we do see is Utilization of trapper data collector process, in % goes to 0, when Recv-Q goes over 2k

              It is still:
              cat /proc/sys/net/netfilter/nf_conntrack_max
              524288
              Last edited by AspenKle; 18-06-2025, 14:10.

              Comment

              • AspenKle
                Junior Member
                • Feb 2024
                • 12

                #8
                This is very frustrating now hm....
                Last edited by AspenKle; 17-06-2025, 17:42.

                Comment

                • AspenKle
                  Junior Member
                  • Feb 2024
                  • 12

                  #9
                  It seems more stabile today or since last night. Did some changes on agent configs. Will update here after 48 h for next update on Recv-Q and changes done, if this was the fix.

                  Comment

                  • Markku
                    Senior Member
                    Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                    • Sep 2018
                    • 1781

                    #10
                    Originally posted by AspenKle
                    Houskeeper is always fast
                    housekeeper [deleted 236436 hist/trends, 0 items/triggers, 28 events, 12 problems, 62 sessions, 0 alarms, 0 audit, 0 autoreg_host, 0 records in 95.029074 sec, idle for 1 hour(s)]
                    Regards
                    I don't know if it is related to your original issue, but for me the housekeeper/database performance looks quite bad: 95 seconds spent on cleaning the database every hour. I would partition the database to release load from the database housekeeping.

                    "62 sessions" also hints that maybe you are using some API connections that you don't logout from. (Again, not probably related to your issue but as a general observation.)

                    Markku

                    Comment

                    • Markku
                      Senior Member
                      Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
                      • Sep 2018
                      • 1781

                      #11
                      # 1357:20240130:133326.485 failed to accept an incoming connection: connection rejected, getpername() faild: [107] Transport endpoint is not connected.
                      I'm a bit puzzled about this: why are you showing server log from 1.5 years ago? Are the current server-side logs exactly the same now in 2025?

                      Markku

                      Comment


                      • AspenKle
                        AspenKle commented
                        Editing a comment
                        Yes, it was the same log statement content as we got 1.5 years back, that we also see now. So just used that.
                    • AspenKle
                      Junior Member
                      • Feb 2024
                      • 12

                      #12
                      Anyway Markku and BradKnowles, thanks for the information.
                      It turns out that this escalated after the number of hosts increased over the years.
                      Fix 1, made it better, we use passive but some agents had both configured, so we commented out :
                      Code:
                      Server=ZABBIX-IP
                      # ServerActive=ZABBIX-IP
                      When ServerActive is configured, it asks for new configuration after RefreshActiveChecks settings, this causes some tcp traffic on 10051, that we do not need, Recv-Q lower.
                      But this was not the root cause.

                      Fix 2, made it much better and seems like the root cause.
                      Some trapping agents (now 3 agents x 68 servers) uses https://www.nuget.org/packages/ZabbixSender.Async/1.2.0.
                      The
                      Code:
                      await sender.Send("MonitoredHost1", "trapper.item1", "12");
                      Was not handled correctly in the .Net agent, causing almost spam on tcp 10051, now it is handled correctly by dev, Recv-Q much lower.

                      Fix 3 or observation, the environment is scanned frequently using a Vulnerability Scanner, this also had in impact, that was the reason for this daily 01:00 problem.

                      Recv-Q is zero in general now.
                      Last edited by AspenKle; 23-06-2025, 09:36. Reason: Updated root cause

                      Comment

                      Working...