Ad Widget

Collapse

Zabbix for large environment setups

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • preeto
    Junior Member
    • Jan 2012
    • 8

    #1

    Zabbix for large environment setups

    Hi All,

    I am currently evaluating Zabbix for our environment, as it stands we will be loading it with around 1500 - 2000 hosts from the start. This number will increase to between 5000 - 7000 hosts over the next couple of years.

    Currently the evaluation is taking place on the following hardware:

    Zabbix Server/ DB: HP ProLiant DL360 G5
    Proxies: Virtual Machines
    DB: Postgresql

    Zabbix is currently loaded with around 600 hosts with 318 nvps and the DB size is around 90GB. I want to get some real life setup examples to take to the guys here to convince them that Zabbix can scale to that many number of hosts. Is there anyone out there using Zabbix with 5000+ hosts? What is the setup like in terms of hardware used? Is the database setup with partitioning?

    Thanks,

    P
  • JoeChen
    Junior Member
    • Mar 2012
    • 1

    #2
    I do not have 5k+ hosts, but almost 4K hosts, 500K items and 1K value/sec

    per my experience, the bottleneck is DB IO, not proxy/master.

    our HW:
    master: Dell R610, with 32G
    DB: Dell R510, with 10 disks, using oracle asm to stripe IO.
    proxy: 3 virtual server with 4 vcpu and 8GB memory.

    the biggest problem to us is DBIO util.
    now it is continually 50% busy, sometimes 70%.

    In addition, due to begin/end style sql, the oracle sharepool has problem.

    Comment

    • preeto
      Junior Member
      • Jan 2012
      • 8

      #3
      Thanks for your reply Joe. We would be looking to scale higher than 4000 hosts.

      I suspected as much that IO on the DB size would be an issue. Anyone else with similar real life examples and what has been done to mitigate the IO issues?

      Comment

      • justinc
        Junior Member
        • Aug 2011
        • 20

        #4
        Preeto,
        Yes Zabbix can scale. I currently run a large production environment with ~9k host and still growing spanning across multiple data centers worldwide.

        As for your specs you should be good. You are going to want as much memory in your DB as possible. You are going to need some Fast Storage for your DB. You will need it to keep up with your new values per second. Also if you plan to leave your proxies online when take down the server, you will wish you had some super fast storage. I see around ~8k update statements per second when turning Zabbix Server back on and this will go until it's caught up.
        The Zabbix server will not need the resources that the DB will.


        My Recommendations (in no particular order):
        1. Run your server and db on separate system. (not sure if you meant you were running on the same physical machine or if they are just on similar HW.)
        2. Go ahead and partition your tables. (One of the largest performance improvements as I was scaling out was partitioning the tables out. If not the HouseKeeper will be your bane. )
        3. Be realistic with your update intervals, draw a line in the sand and tell your self that your default update intervals is at least 120 seconds or more. If you have no problem pushing it to 300seconds, do so. (overall I target myself to have a total average ~200 seconds. (# of items / nvps)
        4. Keep an eye on your database performance. (you will most likely have to tune your configuration as you keep growing.)
        5. Plan out your Templates. I think for most environments this pretty straight forward but if you are like me and your environment has crazy custom distributive applications on most of your systems this may get a bit challenging especially if you adding dependencies to everything).
        6. I am not sure how many users you plan to have access to your instance, but restrict it. I made the mistake and left the guest enabled with read access as we were growing. This was fine when nvps was only 1000. (the more people means more queries, and a possibility of hitting your max connections set for your DB.)
        7. Add another DB server for replication. The time it will take you to import full db backup is going to take a while. Run your backups on your Slave as well. (The sooner you get this going the better. You want to avoid the large import and wait for it to catch up). This will also be useful if you are wanting to run reports on the data.
        8. You will need to increase pollers & cache as you scale. Use internal checks to keep a pulse on how your server is doing. In the current version you are unable to use internal monitors to monitor %busy processes on the proxy so you will have to keep an eye on your queue)
        9. Start looking over the Zabbix API, this has saved me countless hours.
        10. Have a solid configuration management system or package management to be able to deploy and manage your agent config, binaries, and scripts
        11. Keep up with resolving any not supported items and unknown triggers. This can get away from you really quick if you are doing large amounts of host adds.

        Things that I do & don't:
        I don't use discovery.
        I don't use Zabbix Agent(Active Checks).
        I don't use host profiles.
        I try to avoid most string data being stored in history. Of course I some.
        I use a lot of external scripts that capture data so I utilize Zabbix sender.(if you can collect a number of values at once, Zabbix sender is very efficient at doing bulk updates)
        I have modified the frontend to show more than the last 20 issues on the dashboard.
        I have modified the frontend to reduced the time a trigger stays on the trigger status page when it has returned to an Okay state.
        I removed Overview from the Monitoring tab. Too many times someone thought it would be a good idea to go to Overview and select ALL. And when their refresh value on their front end set at 1-2min it set there making this large query over and over.


        Primary and Seconday DBs: 24cores, memory:96GB
        Zabbix Server: 12cores, memory:16GB
        Zabbix proxies: virtual - 2cores, memory:2GB

        Comment

        • richlv
          Senior Member
          Zabbix Certified Trainer
          Zabbix Certified SpecialistZabbix Certified Professional
          • Oct 2005
          • 3112

          #5
          Originally posted by justinc
          The Zabbix server will not need the resources that the DB will.
          definitely (and it's great that it's scaling good enough for you).
          just to complete the nice set of information, what's the database you settled on for the central server and the proxies ?
          Zabbix 3.0 Network Monitoring book

          Comment

          • justinc
            Junior Member
            • Aug 2011
            • 20

            #6
            Also I use a php cache system for the front-end. If you are interested in this you can look at APC or eAccelerator; both are easy to setup.

            Originally posted by richlv
            just to complete the nice set of information, what's the database you settled on for the central server and the proxies ?
            Primary DB: MySQL
            Proxy DBs: SQLite

            Comment

            • gmcore
              Junior Member
              • Apr 2012
              • 2

              #7
              Hi,

              We have a Ubuntu Lucid LTS x86_64/Apache2/Mysql/Zabbix server (1.8.1) with 4k hosts on a Dell Poweredge r410/24 cores Xeon/32G ram.

              The server and frontend are pretty responsive except the Overview page which can take some minutes to refresh - spending time in Apache CPU utilization.

              I installed APC PHP opcode cache system but it didn't change anything on this specific page.

              Our current performance tunings are :

              Zabbix: CacheSize=512M
              Mysql: innodb_buffer_pool_size=1G

              Hope it can help, also hope for some help on the Overview page
              Last edited by gmcore; 19-04-2012, 10:35. Reason: added info

              Comment

              • richlv
                Senior Member
                Zabbix Certified Trainer
                Zabbix Certified SpecialistZabbix Certified Professional
                • Oct 2005
                • 3112

                #8
                there are reports that enabling webserver compression (mod_deflate, for example) helps with overview page performance
                Zabbix 3.0 Network Monitoring book

                Comment

                • gmcore
                  Junior Member
                  • Apr 2012
                  • 2

                  #9
                  Thanks for the hint, but mod_deflate is already enabled.

                  Anyway the limiting factor seams to be the time Apache/PHP is spending outputting the html code for the page, not the transfert speed of the html on the connexion.

                  Comment

                  • frankymryao
                    Member
                    • Oct 2011
                    • 52

                    #10
                    Originally posted by gmcore
                    Hi,

                    We have a Ubuntu Lucid LTS x86_64/Apache2/Mysql/Zabbix server (1.8.1) with 4k hosts on a Dell Poweredge r410/24 cores Xeon/32G ram.

                    The server and frontend are pretty responsive except the Overview page which can take some minutes to refresh - spending time in Apache CPU utilization.

                    I installed APC PHP opcode cache system but it didn't change anything on this specific page.

                    Our current performance tunings are :

                    Zabbix: CacheSize=512M
                    Mysql: innodb_buffer_pool_size=1G

                    Hope it can help, also hope for some help on the Overview page
                    Ops, I found my boss Joe upstairs...

                    For frontend performance, a work-round is to split it to serverl frontends and use FQDN and dns to locate.

                    For our HW, the apache sometimes hung up and became normal after restarting it.

                    Comment

                    • rsvancara
                      Member
                      • Jul 2012
                      • 42

                      #11
                      My Two Cents

                      We have zabbix and a Postgresql database installed on the same system:

                      96 GB of RAM
                      6 - 15K 300GB Disks
                      12 CPU cores

                      We are monitoring 3728 hosts, with 200278 items. Outside this issue zabbix scales fine. However if you can not overcome this issue, you will have missing data which may or may not be a problem for you. For us, it is a deal breaker and I have done everything I can to tune Posgresql and Zabbix. The next steps are to add more disks to keep up with the I/O demands of the zabbix and the Postgresql database or migrate to a distributed setup which I am skeptical about reducing or how much it can reduce the overall load on the our zabbix server. Like other people said, the database will be your biggest bottleneck. I would plan on a good database server (48GB RAM, 12 Cores), with a very fast disk system (20+ 10K-300GB drives in RAID 10) might be a good starting place. But depends on what you monitor and your retention period for you data.

                      Comment

                      • Colttt
                        Senior Member
                        Zabbix Certified Specialist
                        • Mar 2009
                        • 878

                        #12
                        dont use hard-disks!! use SSD! its much faster has a very doof io-performace and its enterprise-ready!
                        Debian-User

                        Sorry for my bad english

                        Comment

                        • rsvancara
                          Member
                          • Jul 2012
                          • 42

                          #13
                          Ssd

                          SSD's are great...if you have the money to buy them. But the other consideration is space and retention time for your historical data.

                          Comment

                          • admin5795@gmail.com
                            Junior Member
                            • Aug 2012
                            • 7

                            #14
                            Why not use node mode to monitor your system?

                            Originally posted by preeto
                            Hi All,

                            I am currently evaluating Zabbix for our environment, as it stands we will be loading it with around 1500 - 2000 hosts from the start. This number will increase to between 5000 - 7000 hosts over the next couple of years.

                            Currently the evaluation is taking place on the following hardware:

                            Zabbix Server/ DB: HP ProLiant DL360 G5
                            Proxies: Virtual Machines
                            DB: Postgresql

                            Zabbix is currently loaded with around 600 hosts with 318 nvps and the DB size is around 90GB. I want to get some real life setup examples to take to the guys here to convince them that Zabbix can scale to that many number of hosts. Is there anyone out there using Zabbix with 5000+ hosts? What is the setup like in terms of hardware used? Is the database setup with partitioning?

                            Thanks,

                            P
                            As the title

                            Comment

                            • frankymryao
                              Member
                              • Oct 2011
                              • 52

                              #15
                              Originally posted by [email protected]
                              As the title
                              As my experience in China, 'node-master' has a lot weird problems. The most important problem is: I add a host to node and it disappear! It happens very often.

                              In my opinion, 'node-master' structure is a plain arch. Master receive configuration and sync to node. Because of the worst network infrastructure of the world in China. The communication between master and node will lose some information, so that some host-info will lose. It results in the disappear of some random host.

                              We use proxy well - one server and five proxy. After some patching on zabbix codes for performance. Zabbix runs well in 60w+ items and 15w+ triggers. I think 'proxy-server' arch is more flexible and easier to management. What is the most important is that 'proxy-server' scales out very easily.

                              Last week I scaled out our zabbix arch from 3 proxy to 5 proxy online and 2 proxy backup. For a certain proxy, I use only 5 minutes to install db and proxy. After that, I use Puppet to sync agent configuration file(zabbix_agent.conf). It's very relax for me

                              All in all, I think 'proxy-server' is better than 'node-master'. It's lighter.

                              Comment

                              Working...