Ad Widget

Collapse

Proxy connections unstable on large environments

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • karanpreetsingh1990
    Junior Member
    • Apr 2020
    • 16

    #1

    Proxy connections unstable on large environments

    Hello,

    I wanted to share something that has me scratching my head for quite some time. I've been seeing issues regarding the connections to the proxy servers from the agents having issues. We have 8700+ agents connecting with 200k+ items and required VPS is over 1850+.

    We are hosting on an RHEL 7 server and have 1 frontend server, 1 zabbix server, 3 zabbix proxies and mysql databases hosted on separate servers.

    The specific error we receive is as follows:

    [369052]: 3244:20210316:224654.813 active check data upload to [zabbix-proxy-cluster.abc.xyz:10051] started to fail ([connect] cannot connect to [[zabbix-proxy-cluster.abc.xyz]:10051]: A connection timeout occurred.)

    Upon checking on the Proxy server itself it seems the issue is with the TCP connections. We've already increased the limits on the OS(
    net.core.somaxconn) but it looks like zabbix isn't picking them up.


    # ss -ntl '( sport = :10051 )'
    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    LISTEN 129 128 10.x.x.x:10051 *:*


    This is causing a lot of issues as new agents aren't able to connect and we get a load of agent heartbeat issues. All our agents are active.

    Please suggest if someone has encountered or fixed this before. Any help would be much appreciated!!

    Thanks,
    Karan
  • intrepidsilence
    Junior Member
    • Aug 2020
    • 3

    #2
    Hi,

    I had to create an override for the service:

    Code:
    cat /etc/systemctl/system/zabbix-server.service.d/override.conf
    Code:
    [Service]
    LimitNOFILE=65536
    LimitSTACK=infinity
    LimitNPROC=16384
    TasksMax=8192

    Comment

    • karanpreetsingh1990
      Junior Member
      • Apr 2020
      • 16

      #3
      Originally posted by intrepidsilence
      Hi,

      I had to create an override for the service:

      Code:
      cat /etc/systemctl/system/zabbix-server.service.d/override.conf
      Code:
      [Service]
      LimitNOFILE=65536
      LimitSTACK=infinity
      LimitNPROC=16384
      TasksMax=8192
      Could you explain a bit more about what this does? I was under the impression that this was more due to the accept performance of the Zabbix processes, but if this solves it, that'd be great!

      Comment

      • james.cook000@gmail.com
        Member
        • Apr 2018
        • 49

        #4
        Hi,

        We had a similar situation years ago and it was due to the number and frequency of connections.

        We had to allow the quicker reuse of tcp connections that were in the time-wait state.

        Check your netstat output and see what the majority of the connection states are.

        Here's a link where it lists some of the settings id be looking at...

        The TCP/IP parameters for tweaking a Linux-based machine for fast internet connections are located in /proc/sys/net/... (assuming 2.1+ kernel).


        Cheers
        James

        Comment


        • james.cook000@gmail.com
          [email protected] commented
          Editing a comment
          The settings I would be looking at are TCP_TW_RECYCLE and TCP_TW_REUSE (this one in particular as this sorted our issue - agent connections are number but short)
      • karanpreetsingh1990
        Junior Member
        • Apr 2020
        • 16

        #5
        Originally posted by [email protected]
        Hi,

        We had a similar situation years ago and it was due to the number and frequency of connections.

        We had to allow the quicker reuse of tcp connections that were in the time-wait state.

        Check your netstat output and see what the majority of the connection states are.

        Here's a link where it lists some of the settings id be looking at...

        The TCP/IP parameters for tweaking a Linux-based machine for fast internet connections are located in /proc/sys/net/... (assuming 2.1+ kernel).


        Cheers
        James
        Hi James,

        I've looked at most of the TCP tweaks that can be done on the system and have most of them in place but I'll give this another look and add anything that I may have missed. Thanks for the details.

        Comment


        • james.cook000@gmail.com
          [email protected] commented
          Editing a comment
          The settings I would be looking at are TCP_TW_RECYCLE and TCP_TW_REUSE (this one in particular as this sorted our issue - agent connections are number but short)
      • karanpreetsingh1990
        Junior Member
        • Apr 2020
        • 16

        #6
        Originally posted by cyber
        Is there a specific reason you cannot add some more proxies? With that amount of hosts I would definitely add some
        Is that NVPS per proxy or total? If per proxy, then it is way too much.... I remember from some place, that per proxy you should keep it ~400...
        This is the total NVPS of the system not per proxy. Adding new proxies isn't as easy in an enterprise environment but I'm trying to get new servers to put the proxies on. That would solve most of the issues but in the meanwhile, from the sizing guidelines, these servers seem sufficient enough.

        Comment

        • james.cook000@gmail.com
          Member
          • Apr 2018
          • 49

          #7
          A couple of other things...

          Are you running selinux or a local firewall (iptables/firewalld). Are they perhaps getting in the way?

          Are there any dropped packets or collisions on your interfaces and is the interface physical or virtual?

          Are there any messages in proxy server syslog when there occur?

          Are there any messages in proxy log when this occurs?

          Are there enough proxy trappers and how busy are these as they are the things processing the client requests.

          What's the system performance when this occurs like ie cpu, swap, interrupts etc....

          Cheers
          Last edited by [email protected]; 20-05-2021, 00:07.

          Comment

          • cyber
            Senior Member
            Zabbix Certified SpecialistZabbix Certified Professional
            • Dec 2006
            • 4806

            #8
            Is there a specific reason you cannot add some more proxies? With that amount of hosts I would definitely add some
            Is that NVPS per proxy or total? If per proxy, then it is way too much....I remember from some place, that per proxy you should keep it ~400......

            Comment

            • intrepidsilence
              Junior Member
              • Aug 2020
              • 3

              #9
              This is an old post now and I feel terrible for missing your comments and questions. Did you ever try my suggestion? If you are starting Zabbix with systemd this is the right way to increase those values for the service.

              Comment

              • karanpreetsingh1990
                Junior Member
                • Apr 2020
                • 16

                #10
                Originally posted by intrepidsilence
                This is an old post now and I feel terrible for missing your comments and questions. Did you ever try my suggestion? If you are starting Zabbix with systemd this is the right way to increase those values for the service.
                This turned out to be an issue with the zabbix server/proxy instead of the settings on the OS side. A new parameter was introduced to set the backlog on the zabbix server/proxy side and that resolved the issue. I did try the suggestions but since the setting was limited on the code level, they were overridden and didn't make an impact.

                The Zabbix issue on this can be found here. ListenBacklog is now a parameter and can be set in the configuration file.

                Hope this helps.

                Comment

              Working...