Ad Widget

Collapse

20k+ server environment, thoughts?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • lukes87
    Junior Member
    • Mar 2014
    • 4

    #1

    20k+ server environment, thoughts?

    Hello,

    I'm exploring possible options to monitor ~20k servers with the potential to scale even larger.

    I would like to know the community thoughts on an environment this large and what kind of architecture would be best suited.

    The requirements are to have a team looking at ideally one dashboard/frontend to take action on any alerts that will arise.

    I'm looking at the proxy distributed setup, would deploying multiple proxies be enough for this or would the master eventually run into issues polling data?
  • timbo
    Member
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2013
    • 50

    #2
    Hi there,

    While I have no experience with an enterprise environment of that size, I have two suggestions:
    1) Get support: http://www.zabbix.com/support.php or http://www.zabbix.com/integration_services.php
    2) Read this: http://www.packtpub.com/monitor-larg...ng-zabbix/book

    I have the ebook (was only published a few months ago), and it runs through how to setup an enterprise environment and plan for future capacity.

    Good luck!

    -Timbo

    Comment

    • Jason
      Senior Member
      • Nov 2007
      • 430

      #3
      I'd have thought with a setup of that size you'd need lots of proxies rather than agents talking direct to the server. The exact number of proxies will depend on the spec of the servers and whether you have agents/snmp etc... Look to use active agents where possible and possibly increasing the default item refresh time.

      The database will need to be on it's own backend server with the zabbix frontend on a separate server.

      The main thing will be to pay close attention to what you monitor and how often you monitor it... i.e. don't monitor something unless you really need that info and make sure the intervals are as long as they can be especially on items that don't normally change.

      Comment

      • Colttt
        Senior Member
        Zabbix Certified Specialist
        • Mar 2009
        • 878

        #4
        use a different DB-server with SSDs and a lot of memory (min 48GB)
        on the zabbix-server you dont need SSDs, HDD are ok but you will be need RAM
        on the frontend the same thing, use a different server for everything!
        Debian-User

        Sorry for my bad english

        Comment

        • timbo
          Member
          Zabbix Certified SpecialistZabbix Certified Professional
          • Sep 2013
          • 50

          #5
          Oh, I also just remembered this from the official Zabbix blog:
          One of the questions for those of us that use Zabbix on a large scale is “Just how much data can Zabbix ingest before it blows up spectacularly?” Some of the work I’ve been doing lately revolves around that question. I have an extremely large environment (around 32000+ devices) that could potentially be monitored entirely […]


          It's a good read also.

          -Timbo

          Comment

          • lukes87
            Junior Member
            • Mar 2014
            • 4

            #6
            Thanks for the replies everyone, definitely helps. So the general consensus is with the addition of multiple proxies a single master can handle 20k+ servers no problem?

            Comment

            • jtrice
              Junior Member
              • Apr 2014
              • 1

              #7
              Tuning the frontend

              With thousands of host I'm finding the GUI becoming almost unusable. It appears to be loading the list of hosts for almost every management function (creating a user or user group etc.). I have cranked up the php memory limit just to get it to work at all. I must be missing something because I don't see this problem mentioned in any of the other discussions. Any ideas?
              Thanks,
              Jim

              Comment

              • kloczek
                Senior Member
                • Jun 2006
                • 1771

                #8
                Originally posted by jtrice
                With thousands of host I'm finding the GUI becoming almost unusable. It appears to be loading the list of hosts for almost every management function (creating a user or user group etc.). I have cranked up the php memory limit just to get it to work at all. I must be missing something because I don't see this problem mentioned in any of the other discussions. Any ideas?
                Thanks,
                Jim
                IIRC web frontend on such occasions straight calls SQL queries to obtain necessary lists so if I'm not wrong everything depends on how well necessary data are cached by DB backend in memory caches.
                I think that you should "negotiate" such subject with your DB backend
                On my scale hints in input lines appears instantly.

                If may I share my last experience ..

                At the moment I'm using zabbix 2.0.11 and on scale of ~700 systems and 150k items and at the moment I cannot list all items in unsupported state (even with extended http request time in php.ini to .5h). DB backend is on SSD so I'm guessing that storage is not kind of bottleneck. Using DTrace on Linux I see that majority (~80%) of all IOs generated by mysqld are with <=2ms latency.
                Code:
                # dtrace -qn 'syscall::write:entry /execname == "mysqld"/ {self->stime = timestamp;} syscall::write:return /self->stime != 0/ {@LWrite = quantize(timestamp - self->stime);} tick-60s {printa(@LWrite);}'
                
                
                           value  ------------- Distribution ------------- count
                             512 |                                         0
                            1024 |@@@@@                                    10879
                            2048 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           68605
                            4096 |@@@@                                     8348
                            8192 |@                                        3252
                           16384 |@                                        1264
                           32768 |                                         78
                           65536 |                                         39
                          131072 |                                         15
                          262144 |                                         7
                          524288 |                                         5
                         1048576 |                                         1
                         2097152 |                                         1
                         4194304 |                                         2
                         8388608 |                                         1
                        16777216 |                                         1
                        33554432 |                                         0
                
                ^C
                >=98% of IOs generated by mysqld used by zbx srv are write() syscalls so this is why in above I'm tracing only latency of write()

                Less than two weeks ago I made very painful upgrade from 1.8.20 to 2.0.11. Goal was do such upgrade with full old data preservation .. and with shortest possible downtime (effectively it was about 5 min). Painful because it was non-standard OOTB upgrade. Still had no time to investigate the problem with web frontend queries on whole env scale even to collect all necessary details to open SR . In 1.8 it was ~1 min.
                I think that some indexes may be missing here (as I mention DB backend is mysql 5.1 with partitioned history*/trends* tables).

                Tomasz
                PS. In 3-4 weeks I should switch to 2.2.x. Hopefully it will be better.
                Last edited by kloczek; 03-04-2014, 23:56.
                http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                https://kloczek.wordpress.com/
                zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                My zabbix templates https://github.com/kloczek/zabbix-templates

                Comment

                • natalia
                  Senior Member
                  • Apr 2013
                  • 159

                  #9
                  Originally posted by kloczek
                  In 3-4 weeks I should switch to 2.2.x. Hopefully it will be better.
                  Hi,

                  Could you update about the results ? what is your solution for using GUI ?

                  Thanks

                  Comment

                  • kloczek
                    Senior Member
                    • Jun 2006
                    • 1771

                    #10
                    Originally posted by natalia
                    Hi,

                    Could you update about the results ? what is your solution for using GUI ?

                    Thanks
                    I'm quite busy at the moment. I'll try to add some new details tomorrow or after tomorrow.

                    PS. I've did in mean time final upgrade to 2.2.3.
                    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                    https://kloczek.wordpress.com/
                    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                    My zabbix templates https://github.com/kloczek/zabbix-templates

                    Comment

                    Working...