Ad Widget

Collapse

Howto monitor 1000 servers in three datacenters with Zabbix?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Mirza
    Junior Member
    • Mar 2005
    • 7

    #1

    Howto monitor 1000 servers in three datacenters with Zabbix?

    Hi,


    I would like to replace our Nagios, MRTG and OTRS based solutions with Zabbix.

    At the moment we have three datacenters with a gigabit backbone and 350 servers each. 25% increase per year. The pre-production environment is connected with a gigabit and a 155 Mbit leased line.

    To have a 24x7 monitoring without any interrupts I need to size the solution.

    First I will need a Linux cluster. This may be an heartbeat, a RedHat Cluster Server, Steeleye's Lifekeeper or hp MC/Serviceguard. Two possible strategies:
    • Place the cluster into the pre-production - so I may use a shared storage (SCSI or FC-based) and store the database on it.
    • Distribute the cluster to the datacenters and have local copies of the database on each center.


    The second solution would be the best one the avoid any monitoring problems on network outtages. But I would need a Zabbix frontend which is able to collect data from three or more different SQL-databases and aggregate information.

    Placing the database on a shared storage may also be difficult: if the primary node crashes dirty buffers (filesystem buffers) will be deleted. So MySQL may not be the first choice, as all dirty buffers will be cached by the operating system. I thought about tuning the filesystem parameters and mount the filesystem in "sync" mode - but that is not reliable enough. So PostgreSQL or Oracle using raw devices and logs would be the better choice...?

    How big do I need to size the hardware to store the most detailed values at least three months for 1000 servers? Do I need 2 CPU or 4 CPU servers? 300 GB in RAID10 mode for the database?

    Has someone experiences with such big environments?

    Is Zabbix able to schedule and run all items and triggers within 30 seconds?

    Thanks, OLiver
  • LEM
    Senior Member
    Zabbix Certified Specialist
    • Sep 2004
    • 112

    #2
    Originally posted by Mirza
    Hi,
    Placing the database on a shared storage may also be difficult: if the primary node crashes dirty buffers (filesystem buffers) will be deleted. So MySQL may not be the first choice, as all dirty buffers will be cached by the operating system.
    Have you envisaged the MySQL Cluster architecture?


    How big do I need to size the hardware to store the most detailed values at least three months for 1000 servers? Do I need 2 CPU or 4 CPU servers? 300 GB in RAID10 mode for the database?
    You should consider using this (wannabe-)sizer (for 1.0 release) : Zabbix (1.0) sizer.

    Has someone experiences with such big environments?
    For me and for now, we only run 1 central point of monitoring to monitor about 250 servers and 200 network elements (switches/routers) over 1 datacenter and about 100 remote locations dispatched over a private WAN.

    Planned infrastucture (cause IT landscape's growth) is a 'simple' Linux cluster using MySQL clustering architecture, but we use for now (and since 5 months) a single old-out computer (2xP3 1Gh - 2 Go RAM - Debian) running all parts (www, zabbix, mysql) pretty fine (load avg 2.5 most of the time.... response time ok for now, database size about 4Gb for now) for about 30 triggers/servers, 3 triggers/network elements, retention time 90 days.

    Is Zabbix able to schedule and run all items and triggers within 30 seconds?
    YMMV : depends on you #of triggers/host (and finally, the number of checks/30 seconds).
    --
    LEM

    Comment

    • Alexei
      Founder, CEO
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Sep 2004
      • 5654

      #3
      Originally posted by Mirza
      First I will need a Linux cluster. This may be an heartbeat, a RedHat Cluster Server, Steeleye's Lifekeeper or hp MC/Serviceguard. Two possible strategies:


      • Place the cluster into the pre-production - so I may use a shared storage (SCSI or FC-based) and store the database on it.
      • Distribute the cluster to the datacenters and have local copies of the database on each center.
      I think that the first option is the best alternative currently. I say 'currently' because ZABBIX doesn't support distribution monitoring yet, so having three servers reporting to one with customised(?) front-end could be difficult from maintenance point of view. Reliability of connections between the datacenters is key argument for decision making.

      Alternatively, you may have three intependent ZABBIX servers, one per each location.

      Originally posted by Mirza

      Placing the database on a shared storage may also be difficult: if the primary node crashes dirty buffers (filesystem buffers) will be deleted. So MySQL may not be the first choice, as all dirty buffers will be cached by the operating system. I thought about tuning the filesystem parameters and mount the filesystem in "sync" mode - but that is not reliable enough. So PostgreSQL or Oracle using raw devices and logs would be the better choice...?
      Yes, filesystem buffers will be deleted. But, is the lost of hundreds of metrics really business critical? I don't think so.

      Crashes is not something that happens often. I'd be more concerned about database integrity (short recovery time) and fast switch over. So, a solution with shared storage (HP Service Guard, for example) seems to be more preferrable in this case.

      Both PostgreSQL and Oracle are not fast enough. I'd suggest using of MySQL for your environment.

      Originally posted by Mirza
      How big do I need to size the hardware to store the most detailed values at least three months for 1000 servers? Do I need 2 CPU or 4 CPU servers? 300 GB in RAID10 mode for the database?
      It all depends on number of check per second and period for keeping detailed history and trend data. I'm confident that 300 GB is more than enough for three months.

      I'd start with 2xCPU setup with at least 1GB of RAM for database usage. ZABBIX doesn't use lots of CPU power and does use virtually no RAM, the database does. Fast disk I/O is very important.

      Again, take it with a grain of salt, it really depends on number of checks. It could even happen that 1xCPU server with 512MB is enough for you if you plan to run several (up-to 10) checks per server each 30 seconds.
      Originally posted by Mirza
      Has someone experiences with such big environments?
      Check this: http://www.zabbix.com/forum/showthread.php?t=77
      Originally posted by Mirza

      Is Zabbix able to schedule and run all items and triggers within 30 seconds?
      Yes, it is!
      Alexei Vladishev
      Creator of Zabbix, Product manager
      New York | Tokyo | Riga
      My Twitter

      Comment

      • Mirza
        Junior Member
        • Mar 2005
        • 7

        #4
        Originally posted by LEM
        Have you envisaged the MySQL Cluster architecture?
        I did - in fact we have the largest MySQL cluster (Master-Slave) in Europe.

        Well, I know about EAC, the MySQL Cluster and MySQL Master-Slaves. EAC has problems concerning an uptodate MySQL version, the MySQL cluster is not very common und wide tested and MySQL Master_Slaves - well, ugly Replication drift is one thing to know, and even if the Master dies, data may be disappear.

        Thanks, I will check the sizer too.

        Mirza

        Comment

        • Mirza
          Junior Member
          • Mar 2005
          • 7

          #5
          Originally posted by Alexei
          I think that the first option is the best alternative currently. I say 'currently' because ZABBIX doesn't support distribution monitoring yet, so having three servers reporting to one with customised(?) front-end could be difficult from maintenance point of view. Reliability of connections between the datacenters is key argument for decision making.
          Well, another approach: I will have one single clustered MySQL database, but three Zabbix servers writing data into it. A fourth zabbix server has been configured to make only read-only access and display data, configuration will be done on the remote zabbix-servers.

          What happens when the Zabbix server will loose the connection to the database? Will data be stored locally (I assume not)?

          Well, I could add a local database in each datacenter and to a replication to the central database...

          Alternatively, you may have three intependent ZABBIX servers, one per each location.
          In this case I loose my overview. As in any clustered environment only the big solution SLAs are interesting - if one server fails, this does not matter.

          Yes, filesystem buffers will be deleted. But, is the lost of hundreds of metrics really business critical? I don't think so.
          Depends on the business you are working in. If I need to pay a huge number of Dollar to my customers whenevery I am missing my SLAs I will definitly be interested in every single data for at least 9 months.

          If it's my private server, I may loose data for a couple of hours - no problem.

          Implementing a monitoring solution in a datacenter is more than the technical solution - customers, fees, penalty fees etc. taking more ressources than the solution itself.

          Crashes is not something that happens often. I'd be more concerned about database integrity (short recovery time) and fast switch over. So, a solution with shared storage (HP Service Guard, for example) seems to be more preferrable in this case.
          Right. The recovery time is most important. We have a couple of MySQL data losses after a crash of the Linux os. That's why I am asking for a database which is able to rollback logs and recovery data after a crash.

          Both PostgreSQL and Oracle are not fast enough. I'd suggest using of MySQL for your environment.
          What about a 2 to 4 node Oracle RAC-Cluster with 2*3.0 GHz CPUs each and load-balancing?

          Again, take it with a grain of salt, it really depends on number of checks. It could even happen that 1xCPU server with 512MB is enough for you if you plan to run several (up-to 10) checks per server each 30 seconds.
          1000 servers with 25% increase per year, at least 60 items per server, saving them 9 months in past.

          Thanks for your help...

          Mirza

          Comment

          • Alexei
            Founder, CEO
            Zabbix Certified Trainer
            Zabbix Certified SpecialistZabbix Certified Professional
            • Sep 2004
            • 5654

            #6
            Originally posted by Mirza
            What happens when the Zabbix server will loose the connection to the database? Will data be stored locally (I assume not)?
            The data will not be stored locally. Zabbix server will just wait until the database is up.
            Originally posted by Mirza
            Implementing a monitoring solution in a datacenter is more than the technical solution - customers, fees, penalty fees etc. taking more ressources than the solution itself.

            Right. The recovery time is most important. We have a couple of MySQL data losses after a crash of the Linux os. That's why I am asking for a database which is able to rollback logs and recovery data after a crash.
            Is InnoDB MySQL database structure supposed to be immune to power failures and OS crashes?
            Originally posted by Mirza
            What about a 2 to 4 node Oracle RAC-Cluster with 2*3.0 GHz CPUs each and load-balancing?
            Oracle support is not finished, so I cannot comment performance of the database yet.
            Originally posted by Mirza
            1000 servers with 25% increase per year, at least 60 items per server, saving them 9 months in past.
            It transforms to 2000 checks per second provide resfresh rate is 30 seconds. I'm not sure if current 1.0 can handle this nicely. This is because timeout processing is not perfect and 1.0 is not optimised for such heavy load.

            Furtunately 1.1 will be here to address all these problems. Parallelism will be greatly improved (I already have understanding how to achieve this), so it will be possible to run tens and even hundreds of server processes simultaneously.

            I'm very interested to make Zabbix scalable to big environments.
            Alexei Vladishev
            Creator of Zabbix, Product manager
            New York | Tokyo | Riga
            My Twitter

            Comment

            • Mirza
              Junior Member
              • Mar 2005
              • 7

              #7
              Originally posted by Alexei
              Is InnoDB MySQL database structure supposed to be immune to power failures and OS crashes?
              Not immune - as far as I understand.

              It's much better than MyISAM as it has the doublewrite buffer. But still data may be lost (in detail the percentage of dirty buffers (90% default) which depends on your my.cnf configuration could be anything between 1MB and 512 MB...

              But the underlaying filesystem still makes trouble. I learned from http://www.ussg.iu.edu/hypermail/lin...03.2/0629.html that with ext3 in ordered mode the Linux kernel tries to make real fsyncs. This works (according to the page) not with IDE drives and is damned slow even with SCSI drives.

              I'm very interested to make Zabbix scalable to big environments.
              Me too as Zabbix is a wonderful tool.

              Comment

              • Mirza
                Junior Member
                • Mar 2005
                • 7

                #8
                Originally posted by Mirza
                But the underlaying filesystem still makes trouble. I learned from http://www.ussg.iu.edu/hypermail/lin...03.2/0629.html
                Quoting http://www.issociate.de/board/post/1...mentioned.html
                fsync is even in Linux 2.6 buggy.

                Mirza

                Comment

                • Alexei
                  Founder, CEO
                  Zabbix Certified Trainer
                  Zabbix Certified SpecialistZabbix Certified Professional
                  • Sep 2004
                  • 5654

                  #9
                  All,

                  We have finished development of active (i.e. initiated by ZABBIX agent) checks recently. Use of the active checks will greatly decrease hardware requirements of ZABBIX server and eliminates polling.

                  Recent benchamarks performed on 1xCPU Athlon 2800+, 2GB, shows that ZABBIX is able to process about 570 checks per second. In fact, this means that 1xCPU ZABBIX server may handle monitoring of 1000 servers with 34 monitored (once per minute) metrics each.

                  I would still recommend 2xCPU system as some more power is required for GUI and reports.

                  The functionality will be released as part of 1.1alpha8, next Monday (hopefully).
                  Alexei Vladishev
                  Creator of Zabbix, Product manager
                  New York | Tokyo | Riga
                  My Twitter

                  Comment

                  Working...