Ad Widget

Collapse

Server hit the wall, I think - configuration steps to grow?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jjeff123
    Member
    • May 2022
    • 33

    #1

    Server hit the wall, I think - configuration steps to grow?


    We have on the order of 30 active proxies monitoring 725K items, nearly all are network switches doing SNMP polling. Our server is a single box running postgres/timescale Zabbix 7.05 hitting 4700 VPS
    and I have to add a hundred or so more switches to monitor, I'm worried I'll push the box over the edge where it'll never be able to keep up.
    My queue spikes every 2 minutes, from a low of 4K to a high of 15K. One of my proxies always seem to have queued data by 5 or 10 seconds. I'm unclear if the server queue graph is because this specific proxy is slow/broken or because of the server itself.

    I need to migrate to newer hardware, but what configuration changes should I be making?

    A server restart takes on the order of 30 minutes, as the history sync takes a while and all the data from proxies has to be processed.

    The biggest performance issue I see is disk writes. Zabbix says disk utilization is low, but iotop shows me doing 30-100 M/s constantly.

    Throwing more CPU and memory at it doesn't seem to make a bit of difference.
    I have 8 cores and even during those stressful reboots, my load doesn't get over 6, and I have 48Gb of memory with 34G as buffer/cache, so neither zabbix nor postgres is asking for more.



    Disk load




    Server queue during restart



    normal queue

    Click image for larger version

Name:	image.png
Views:	817
Size:	36.6 KB
ID:	496268

    Click image for larger version

Name:	image.png
Views:	482
Size:	25.6 KB
ID:	496269​​​​
  • mrnobody
    Member
    • Oct 2024
    • 61

    #2
    Hi Jeff (this is why i live for kk)

    Never seen an installation so huge, 4.7k VPS and almost a milion of hosts, OMFG.
    In certification class, i heared about Patroni for the first time(cluster of DB, that runs in PostgreSQL). I trust in experience of teacher and other that sayid it's a good option, but i never tested one; i would migrate software first, to a more robust solution, wait it become avaible enough and only after this, migrate hardware (based on 30 minutes to restart server, looks something is wrong).

    Good luck

    Comment

    • jjeff123
      Member
      • May 2022
      • 33

      #3
      Originally posted by mrnobody
      Hi Jeff (this is why i live for kk)

      Never seen an installation so huge, 4.7k VPS and almost a milion of hosts, OMFG.
      In certification class, i heared about Patroni for the first time(cluster of DB, that runs in PostgreSQL). I trust in experience of teacher and other that sayid it's a good option, but i never tested one; i would migrate software first, to a more robust solution, wait it become avaible enough and only after this, migrate hardware (based on 30 minutes to restart server, looks something is wrong).

      Good luck
      No, not millions of hosts. Around 1600, and 4700 VPS, not 4700K. But some of those hosts are stacks of 6 or 8 48 port switches. I've got a couple hosts with 5000+ items. Mostly because we used the default templates, and I think we'll be going through an exercise to significantly reduce monitoring on edge switches.

      This project kind of grew organically, from more or less a lab/test bed to production. The hardware migration will be to much newer hardware, which should help a lot.

      Comment

      • mrnobody
        Member
        • Oct 2024
        • 61

        #4
        Originally posted by jjeff123

        No, not millions of hosts. Around 1600, and 4700 VPS, not 4700K. But some of those hosts are stacks of 6 or 8 48 port switches. I've got a couple hosts with 5000+ items. Mostly because we used the default templates, and I think we'll be going through an exercise to significantly reduce monitoring on edge switches.

        This project kind of grew organically, from more or less a lab/test bed to production. The hardware migration will be to much newer hardware, which should help a lot.
        Items*, not hosts, my beloved dislexy
        That's a shorter path, reduce the quantity of items, keeping only what is necessary; can use mass update to do this right in front end.

        Comment

        • guille.rodriguez
          Senior Member
          • Jun 2022
          • 114

          #5
          Maybe a good option is disable items that you don't need, for example in switch, on discovery rule, only add ports with admin status = 1 (active). If admin status = 0 (disabled) you don't need to monitor.

          Another option is increase the time interval monitoring. For example if you increase snmp pull on a switch with 48 from 1 minute to 2 minute...

          Comment

          • packetdust
            Junior Member
            • Jan 2025
            • 1

            #6
            For what it’s worth, SNMP polling has always seemed to be a bottleneck in Zabbix for as long as I can remember. The release notes for version 7 mention a change to synchronous SNMP polling, but I haven’t noticed any significant improvement in reducing the delays in proxies obtaining and delivering SNMP data to the Zabbix server.

            Here are some general considerations:
            1. PostgreSQL Tuning: How much tuning have you done on your PostgreSQL database? Proper optimization here can make a big difference.
            2. Zabbix Server Configuration: Similarly, how much have you optimized the Zabbix server configuration? There are many settings that can support significant scaling, but any changes should be made incrementally and monitored closely to evaluate their impact.
            3. Proxies and Checks: It sounds like you’re running a lot of proxies. How many checks is the Zabbix server itself performing, versus those handled by proxies? In my experience with larger infrastructures, offloading everything to proxies (we use about 10) significantly reduced the server load and improved its responsiveness.
            4. Polling Frequencies: Reassess the polling intervals for your items. Are they set too frequently for certain use cases?
            5. Scope of Monitoring: Are you monitoring everything by default? Consider whether all the monitored items are necessary.
            6. Item History and Trends: Review your history and trends retention periods. For example, do you really need to keep 365 days of history for switch port utilization? Reducing retention for less critical data can alleviate storage and performance pressures.This version keeps the original intent while making it more concise and professional.

            Comment

            • markfree
              Senior Member
              • Apr 2019
              • 868

              #7
              I would argue that this is not such a large environment, but it is just as relevant as any other.
              I handle some cases where each host can easily reach 11k+ items.
              So, the first thing I did when I started monitoring these types of devices was to remap all the relevant metrics and recreate the legacy template.

              Do your SNMP templates already use the newer SNMP walk method of data polling?

              Also, as guille.rodriguez pointed out, using default templates without any filter can lead to a bunch of unnecessary metrics filling up your DB.
              Usually, OotB templates provide some handy discovery filters, especially for switches and routers. If possible, configuring these filters can greatly reduce the load from hosts. Don't overlook overrides either.

              Adding more hardware is not always a solution to performance issues.
              It seems to me that your Server and Proxys may need some process and cache improvements.

              Organizing the environment for different roles may prove beneficial. For example, separating proxys for specific regions, data-center, types of devices, types of data collection, passive or active, etc.

              Keep in mind that the Zabbix DB can be the main point of latency. Isolating and optimizing it is very important.
              Many DBMS provide load-balancing solutions...

              You can find some performance tuning tips in the forum.
              Last edited by markfree; 02-01-2025, 03:25.

              Comment

              • Jason
                Senior Member
                • Nov 2007
                • 430

                #8
                As others have suggested then I'd start with looking long and hard at the templates and the setup on your hosts. If make sure bulk monitoring was enabled as this makes a massive difference on proxy efficiency.
                Secondly disable any items you don't need on the templates and be quite ruthless with this. Unless you need the item for stats or reporting then disable it.
                Everything that is left then consider adding discard unchanged with heartbeat and set this as long as you can up to about a day. Anything over that and it'll disappear from latest data occasionally.
                Split out your server functions and have dedicated database server, zabbix server and web front-end. Database wise I've been really impressed with postgres especially when coupled with timescaledb. On each server take time to tune for its specific function. Database will need as much ram as can through at it to help with caching along with fast disks. Look at SSD with raid 10. Possibly even a cluster if can afford it.
                snmp does seem to take up more resources on proxies than anything else and especially when some large hosts go offline it can cause issues if it's not been carefully configured. I've yet to try 7 on our biggest setups but upgrading to it soon and looking forward to the improvements

                Comment

                • jjeff123
                  Member
                  • May 2022
                  • 33

                  #9
                  OP here.
                  I'm mostly doing the back end stuff, database and server setup. Templates are mostly other people's job.
                  But yes, we mostly have the default templates, which are gathering entirely too much data, and that needs to stop. I don't need stats on thousands of switch ports that have end user PCs attached.

                  I was cheap and just have 1 box for DB, web and server.
                  The original post was prior to us moving from a 9 year old on-site server to Azure. Under azure I've got better performance, which is great, but then I added enough hosts to push my VPS to 6200.

                  Biggest thing I did was DB tuning, I finally noticed today that I was getting WAL writes every 2-3 minutes. Changing max_wal_size from 4G to 10G eliminated that.
                  Now my performance is reasonable, though I appreciate the tip about proxies.
                  I built a proxy image and deployed it, but discovered that a couple of my remote sites are large enough that I exceeded the configuration cache, and those sites always have thousands of items in the 5/10/30 second queue.
                  I thought that fixing the configuration cache the queue issue would resolve itself, but no such luck. I've got 30K queued items from one proxy right now, even though that proxy is using only a tiny amount of CPU/memory.
                  I'll have to look at that next while I prod my folks to fix our templates.


                  Comment

                  • cyber
                    Senior Member
                    Zabbix Certified SpecialistZabbix Certified Professional
                    • Dec 2006
                    • 4806

                    #10
                    how many pollers you have there in proxy... ? Default values will not work.

                    Comment

                    • jjeff123
                      Member
                      • May 2022
                      • 33

                      #11
                      For the proxy that's falling behind?

                      It has 1000 required VPS. Monitoring 94 devices, most of which are switch stacks, so on the order of 140K items.

                      Looking at the zabbix proxy monitoring, no alerts on this proxy. I did bump up the cachesize, originally it was at 128MB. But that was over a week ago, and didn't really make any difference.
                      I'm also running an upgrade from 6.0, and the template is a 6.0 template without bulk SNMP queries.

                      CacheSize=256M
                      HistoryCacheSize=64M
                      HistoryIndexCacheSize=32M
                      ProxyMemoryBufferSize=128M
                      StartVMwareCollectors=1
                      VMwareFrequency=60
                      VMwarePerfFrequency=60
                      VMwareCacheSize=16M
                      ProxyOfflineBuffer=48
                      StartDiscoverers=3
                      StartPollers=10
                      StartSNMPPollers=12
                      StartPingers=5
                      StartPreprocessors=5

                      Last edited by jjeff123; 31-01-2025, 15:58.

                      Comment

                      • cyber
                        Senior Member
                        Zabbix Certified SpecialistZabbix Certified Professional
                        • Dec 2006
                        • 4806

                        #12
                        ok .. in v7 we have asynchronous pollers for SNMP... that 12 might be ok... But I have no v7 with a big load, so I dont have a comparison.. I have a netwrok proxy with 240 hosts, 220k items and ~700 nvps.. So probably polling a bit less.. You can look over polling times there also.

                        If you do not do any vmware monitoring through that proxy, you can always switch off vmware pollers. Same with discoverers... if not doing network discoveries, don't start them up..

                        Comment

                        • jjeff123
                          Member
                          • May 2022
                          • 33

                          #13
                          Responding to my own post so future people know how this worked out.

                          Spent considerable time tuning the database and zabbix, mostly database. Increased the MAX_Wal size so WALs were done on time instead of size was a major factor.

                          The proxy with the high queue was just something goofy on that one box. Yes, it was my highest used proxy, but we fixed 3 things on it:
                          - apt-get update/upgrade - both the kernel and zabbix to latest 7.0x release
                          - NIC had both a static and DHCP address on it, not sure how.
                          - Rebooted to fix both the NIC issue and allow new kernel to take effect
                          After that all my queue problems on this one proxy went away.

                          The server in Azure worked great, much better than the on-site. But eventually that also hit the bottleneck. And the issue was Disk IO.
                          I had built the server with the default standard SSD, assuming a modern Azure SSD would have far better performance than my on-site. But the Azure standard SSD is limited to 500 IOPS and 100MB/s IO.
                          And there's the issue, because they drastically rate limited the IOPS. A spinning disk is going to have 100-150 IOPS, but even a cheap, old SSD will be in the tens of thousands.
                          So the "standard SSD" in azure should really be sold as a premium HDD.
                          Upgrading to premium SSD with 7500 IOPS has worked great.

                          System has 6400 VPS and performance is fine.

                          Comment

                          Working...