Ad Widget

Collapse

Real world, honest assessment

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • mwtzzz_2021
    Junior Member
    • Sep 2021
    • 8

    #1

    Real world, honest assessment

    Hi folks,

    Can I get some real world feedback regarding scaleability? Where I'm working we're using a product right now that isn't working very well. This product replaced our previous nagios installation which couldn't handle our scale, but the new product comes with its own problems, so we're considering replacing it.

    We're looking at Zabbix, but the main concern is whether it can scale. We've got 30,000 hosts. Each host about 30 checks. 3 data centers.

    Anyone had experience with Zabbix on this scale? Most of the threads I see people are talking about a much smaller scale, a few thousand hosts only.
  • mwtzzz_2021
    Junior Member
    • Sep 2021
    • 8

    #2
    "Should" is different than "actual". Has anyone actually run Zabbix at this scale?

    Comment

    • mwtzzz_2021
      Junior Member
      • Sep 2021
      • 8

      #3
      Thanks for those links I will check those out.

      The frequency of our checks runs the gamut - we've got some that run once a minute, other every five minutes, others every 10 minutes, etc.

      Comment

      • Spectator
        Member
        • Sep 2021
        • 71

        #4
        Hi everyone!

        First of all: sorry for my bad english.

        I have a question which is very similar for the question mwtzzz_2021.
        Our company has cca 35k hosts.
        This hosts mainly are:
        - storages,
        - LTO tape drives,
        - servers (iLO, IPMI,...),
        - ethernet and SAN switches,
        - routers,
        - UPS-s
        - VMWare and HyperV clusters,
        - OSes, like Windows and Linux,
        - MySQL, PostreSQL, Oracle databases,
        - and so on

        Each hosts has cca 50 checks
        Some checks has every 1 and others has every 5 minutes
        Historical datas must be keept for 3 months
        Trend datas must be keept for 3 years

        Does anyone have practical experience monitoring a system of this size?

        What type of database is worth using for such a large system?
        MySQL? MariaDB? PostgreSQL? PostreSQL time-series? Or something else?
        And approximately what size database can I expect for a system of this size?

        Comment

        • Spectator
          Member
          • Sep 2021
          • 71

          #5
          Originally posted by cyber
          Our previously mentioned setup has PG +timescale and has ~1T DB... I guess compression would save some...14d history+365d trends.
          Thanks your answer.

          How often are checks in your system (1 min, 5 min,...)?
          Which version of Zabbix are running?
          May I ask what hardware background your Zabbix system is running on? How many servers, how many proxies? How much CPU, RAM, HDD?
          Are the Zabbix web front-end and Zabbix server running on the same ost?
          Sorry my lot of questions. Surely you know that all the information is important when designing such a large system. Or if you have any more ideas, advice, which you would like to share with me, please share it with me

          Comment

          • mwtzzz_2021
            Junior Member
            • Sep 2021
            • 8

            #6
            Originally posted by Spectator
            Hi everyone!

            Some checks has every 1 and others has every 5 minutes
            Historical datas must be keept for 3 months
            Trend datas must be keept for 3 years

            Does anyone have practical experience monitoring a system of this size?
            You typically cannot use a traditional monitoring system to store trend data. These systems normally only keep the last 10-30 check results. I don't know about Zabbix, though. But for example Icinga2 and Sensu only keep a very brief history.

            For longer periods, you must use metrics instead - graphite, wavefront, etc. Indeed, a compelling argument these days is to replace traditional alerting/monitoring with metrics. Gather everything using metrics, then process it.
            Metrics systems don't use a traditional database. instead they use their own storage for time series data.
            My personal experience is with graphite which scales well - I had it processing on order of a million metrics a minute at my last company.
            At my current company we use wavefront on a larger scale.

            There's only a couple reasons to continue to use a traditional monitoring system (Zabbix, Icinga2) anymore:
            1. running nagios-style check and custom scripts.
            2. schedules, escalations, repeated attempts, etc.

            These two things are still not handled well by modern metrics based replacement systems like Prometheus, Splunk, Wavefront. Such systems have rudimentary alerting but don't have the type of fine-grained control over the alerting that the traditional platforms do. Nor do they have the capability of running the nagios-style check scripts.

            This is the reason I am on this forum asking about Zabbix. We still need to support the nagios-style checks. And we need to do it at a scale similar to yours. And we need to do it for both our on-prem and our cloud hosts. Icinga2 can handle this but it's not clear whether it's still under active development and support. Nagios cannot handle the scale.
            Last edited by mwtzzz_2021; 24-09-2021, 17:31.

            Comment

            • mwtzzz_2021
              Junior Member
              • Sep 2021
              • 8

              #7
              sorry, I was off by an order of magnitude in my last post, about Graphite. We were piping 25 million metrics a minute through it.

              Comment

              • mwtzzz_2021
                Junior Member
                • Sep 2021
                • 8

                #8
                Originally posted by cyber
                At 2019 Summit there was at least one presentation about consolidating multiple Z servers to one, resulting in over 1M items and 250k NVPS in some situations...
                https://assets.zabbix.com/files/zabb..._the_cloud.pdf
                This says Max Tested Processed Values /s 250k during a "burst". What is a "burst" and what does values mean?
                It also says Max Tested processed alerts up to 30k. What does this mean - 30k at one time? or 30k total defined alerts?
                It's not clear from this document what the size of their environment was.

                A bit of false advertising. This one says "will reach 65000+" devices in the future. It doesn't say how many they actually were running at the time they wrote the paper.

                So we're back to the original question: has anyone actually run Zabbix on 30,000+ devices in production?

                I'm beginning to suspect nobody has done it on this type of scale.

                Comment

                • LenR
                  Senior Member
                  • Sep 2009
                  • 1005

                  #9
                  We have 8250 hosts, 1,355,000 items, 5200 values per second. Almost all data is gathered by proxies, 60% of hosts are network devices gathered by SNMP, rest are split between Linux and Windows with mostly active agent items. Database is partitioned mysql VM with good spinning disk. Tune mysql with buffers and huge pages to avoid reads, partitioning deletes old history and trend, avoid housekeeping. We try to keep update LTS version 2x a year. We run mysql and zabbix server on the same vm, console on another vm and multi proxies, some for load, some for access.

                  SSD would be faster, but we are avoiding physical hardware now.

                  Comment

                  • logix88
                    Junior Member
                    • May 2015
                    • 7

                    #10
                    We have recently started using AWS Aurora RDS... In my experience, large scale Zabbix always starts great and issues happen after it's. been in service for a while... really performant DB is key, don't do historical pruning from Zabbix - offload to DB using stored procedures. Use active items instead of passive.. obviouslly use proxies and offload as much onto them... I think if you follow the best practices, this shouldn't cause any troublle... though with any large scale deployment, get Zabbix support contract! It's not too expensive and if things go wrong, they can help like no one else can and personallly have had some critical issues resolved.

                    Comment

                    • mwtzzz_2021
                      Junior Member
                      • Sep 2021
                      • 8

                      #11
                      Thanks for that information, guys. Very helpful. Good tips.
                      It still sounds like nobody has run it on a scale of 30K+ hosts. I'd still be interested in hearing from someone who has actual real-world experience running it in production.

                      Comment

                    • mwtzzz_2021
                      Junior Member
                      • Sep 2021
                      • 8

                      #12
                      Another question for you all: for a public cloud environment, auto-scaling, immutable images, etc. is Zabbix a good option?
                      The dynamic nature of auto-scaling / short-lived instances always brings up the issue of programmatic cleanup / deregistering the hosts. Is this something Zabbix can handle gracefully?

                      Comment


                      • gofree
                        gofree commented
                        Editing a comment
                        depends how dynamic we're talking

                        your stating to many widespread question - needs to be more specific. But at the moment true is that Zabbix is not very well horizontally scalable.
                    • tobankeisha
                      Junior Member
                      • Dec 2021
                      • 2

                      #13
                      not sure if anyone has run Zabbix on this scale

                      Comment

                      • cyber
                        Senior Member
                        Zabbix Certified SpecialistZabbix Certified Professional
                        • Dec 2006
                        • 4806

                        #14
                        These numbers should not be an issue....

                        Comment

                        • cyber
                          Senior Member
                          Zabbix Certified SpecialistZabbix Certified Professional
                          • Dec 2006
                          • 4806

                          #15
                          well.. I have ~10k hosts and average of 50 items per host... several DC-s... I would not hesitate to double the amount of hosts, just needs some extra proxies...
                          The amount of hosts and items is one side, but how often you check is another. With same amount of things you can get different nvps if you check interval is 1m or 10m.. Your servers/proxies work is much different.
                          At 2019 Summit there was at least one presentation about consolidating multiple Z servers to one, resulting in over 1M items and 250k NVPS in some situations...

                          Other one mentions 65k+ devices

                          Comment

                          Working...