Ad Widget

Collapse

Zabbix server maximum number of NVPS

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Amiram
    Member
    • Feb 2021
    • 59

    #1

    Zabbix server maximum number of NVPS

    Hey,

    I'm playing with my Zabbix server (5.2 - MySQL and Nginx) in order to scrap some Prometheus data.
    I have around 2000 servers with ~3000 items per server which sum up to ~6,000,000 items for all of the servers.

    I run the test on 10% to the servers (200) and results with 400K NVPS.
    The server start to hiccup a bit at the GUI side.
    • Take time to load latest data
    • stuck at loading graphs
    • etc..
    How much the Server can take?
    How can I "improve" the GUI so it wont stuck or hiccup at all.

    P.S
    The current Hardware I'm running on is
    36Core with 384Gb Ram with 4T SSD
    Last edited by Amiram; 06-05-2021, 15:03.
  • Amiram
    Member
    • Feb 2021
    • 59

    #2
    Yes all the components are running on the same server.

    The GUI stuck when I try to query a lot of data like showing 200 servers CPU load status for the last day.
    It can show the last 1hour or maybe 3 but not for more.

    P.S 1
    When I try to use the python API to to it I get an exception.

    P.S 2
    The problem is the same when I use the grafana zabbix plug in.

    Comment

    • gofree
      Senior Member
      Zabbix Certified SpecialistZabbix Certified Professional
      • Dec 2017
      • 400

      #3
      Im asking myself whats the need for this design ? otherwise I think its a bad idea. First you never need all the metrics really from every single host - second why use Zabbix for clearly prometheus environment (??). Third ...it will kill your db, whats your interval, intention, db setup....why do it ?

      Comment

      • Amiram
        Member
        • Feb 2021
        • 59

        #4
        The environment is not a Prometheus one, there are promql stuff running as well like node_exporter or cAdvisor.

        Why?
        The node_exporter and the Zabbix monitor a lot of identical metrics so in order to avoid monitor the same stuff twice I'm setting the Zabbix server as the data source of the Grafana for the node_exporter metrics.
        I need the Zabbix for a lot of other advantages.

        The interval is 1m and the database type is MySQL running on the same server as the frontend and the Server.

        So
        why do it?
        :
        I'm testing the idea of being the data source of all the data (including the prometheus one) because:
        • The data will be gathered one time.
        • No two different data sources (and two different setup to gather the data)
        • I might want to use so of that data to create triggers
        I'm trying to test this concept, it will simplify my environment as I'm having around 3K servers in 2 datacenters (1.5K each).

        A valid results of the test might be - The Zabbix Server (server/database/frontend) will no be able to handle it.
        Another valid answer can be - You need more HW resources (different servers, faster storage devices, more CPU power, More RAM etc....)


        P.S 1
        My hardware resources is not the problem.


        P.S 2
        I result with 2M items for just 200 servers scraping their Prometheus data.

        Comment

        • cyber
          Senior Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Dec 2006
          • 4806

          #5
          That single host will never-ever do for such amount of data...I am surprised you even get that data into DB.. 400k NVPS... that is enormous amount......

          Comment

          • cyber
            Senior Member
            Zabbix Certified SpecialistZabbix Certified Professional
            • Dec 2006
            • 4806

            #6
            I can see the point, but IMHO that will not work. Zabbix is still a monitoring software, meaning, you gather things you need to monitor your "thing", not just scrape every single metric you can find, and then trying to expose it to 3rd party also...With this kind of setup, you will first scrape all prometheus endpoints, do massive preprocessing and store and then expose it to grafana via (I would not directly say) questionable plugin, which queries your relational DB (which is probably already having hard time to accept all this data).... I think you would be better off running grafana and its ecosystem for this amount of things (specially, if you just need only small part for monitoring purposes). Whatever custom dashboard or report you might need, it will run better without Zabbix being that DB there...

            Comment

            • orbenet
              Junior Member
              • May 2022
              • 5

              #7
              Are you using default templates for everything without modification?

              I would suggest increasing the interval values for any templates you are using and starting off slow with data collection. Disable everything and then start enabling items one by one to figure out the ones you need.

              To give you an example, my setup has ~1000 nodes (a mix of Network, Server, and IoT) running on a Virtual Machine with 8GB of DDR4 and 2 Cores. I have 3 Proxies with 1 Core 1GB of Ram for my other sites.

              When I first deployed, I focused on just critical items that I needed alert (Like ICMP down, Web Page giving 404/500 errors, RAID Failures, TCP Socket connection failed, DB Services on servers not running with action to restart them) and I slowly added additional items for monitoring and some data collection for reporting.

              The critical stuff is set to collect every 10-20 seconds - but some items I don't really need to get INSTANT alert for so they may be set to 10-20 min or some 24hour intervals - the idea is to configure this to meet any internal requirements you have.

              Another thing to note is that you will need to play around with the values in the Zabbix Server Configuration file to get the optimal amount of "workers" for collection.

              Lastly, you will have to figure out data retention for each item to be bare minimum required.

              My end results is ~600 NVPS for ~100,000 items with ~60,000 triggers (only a few triggers send out emails, the rest we handle day to day using the reporting feature to get an idea about current performance and needed future performance).

              The gist of this is that Zabbix comes pretty well out-of-the-box but will require further tuning to fit your exact needs.

              You should really plan exactly what requirements you are trying to achieve with this solution, and then configure it to suit the requirements.

              Comment

              • Colttt
                Senior Member
                Zabbix Certified Specialist
                • Mar 2009
                • 878

                #8
                Amiram thats sounds nice.. but I would recommend PostgreSQL+TimescaleDB-Plugin, it performs much better than MySQL. Also, a correct configuration of postgres/timesacle is important. which filesystem did you use? noatime? maybe a M2 instead of SSD? Did you sue proxies (I would strongly recommend it)?
                Debian-User

                Sorry for my bad english

                Comment

                Working...