Ad Widget

Collapse

HA, scalability & security: best practices

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jeremyjr
    Junior Member
    • Jan 2013
    • 5

    #1

    HA, scalability & security: best practices

    Hello,

    I am searching for advices and best practices about Zabbix related to HA, scalability & security.

    I'm working on a project for which we will have to automatically deploy several VM on a IAAS provider. At the start we will only have to manage hundreds of virtual servers but I hope we could increase to thousands hosts over the next years.
    I plan to install infra services (monitoring, conifg management, databases) on dedicated servers.

    - 2 Zabbix servers (active/passive)
    Intel Xeon W3520
    24 Go DDR3
    SOFT RAID1 on SATA2 disks
    - 2 DB servers MySQL master/slave replication
    Intel Xeon E5-1620
    64 Go DDR3 ECC
    SOFT RAID1 on 2x120Go SSD

    I've read docs about MySQL replication based on DRBD but I think that while we perform backup on database it will impact the production performance. It won't be the case if we perform dumps on a slave. How people handle this in large environments?

    About security, at first stage the VMs will be in a public cloud and the dedicated servers won't be hosted by the same provider. So I need to secure the communication between Zabbix server and VMs. I know 3 ways to do this :
    - patch zabbix
    - proxy + stunnel
    - ssh checks
    I won't patch Zabbix as I want to use the packages (1.8.11) provided by the linux distribution we will use in order to benefit security fixs. And I don't have the ability to do this myself.
    The proxy could also be a VM hosted by the same provider than the target VMs to monitor. Depending on the provider (Rackspace for example) we could use a private network for the communication between the proxy and the VMs but it's still in clear. So could we also make ssh checks from a proxy?
    Or ssh checks could be launch directly by the server.
    Maybe that you have some advises about security?

    About scalability, I seems better to let VMs schedule checks themselves but in that case we can't crypt communication (can't use ssh checks). So have you some advices to deal with both security and scalability? Is it possible to use proxies with ssh active checks?

    About HA, you can see here https://www.zabbix.com/forum/showthread.php?t=39058, I'm also looking for advices in order to avoid SPOF with proxies.

    I would be really glad if you can tell me about your experience and I hope I was clear enough as english is not my native language.
    Thanks in advance for your help.

    Regards,
    Jérémy
  • insider
    Junior Member
    • Jun 2013
    • 7

    #2
    2x120Go SSD
    Will not be sufficient for big environment, unless you will not keep history at all.

    MySQL replication based on DRBD
    For my opinion, DRBD is too slow to cope with high i/o
    I think it's better to use master-master replication with active/passive load balancing. I'm looking at pacemaker + corosync.

    Comment

    • trikke76
      Member
      Zabbix Certified Trainer

      • Apr 2013
      • 42

      #3
      SOFT RAID1
      in my opinion the only reason to use soft raid is for compatability afterwards

      switch to hardware raid it's much more performant.

      Comment

      • mushero
        Senior Member
        • May 2010
        • 101

        #4
        Agreed on both:

        - Use hardware RAID, always, with battery-backed cache, like Dell PERC

        - Use fast disks like 15K SAS; SSD nice but small if you want data as noted

        - DRDB is too slow in most configs; we have a customer now with 25-50ms update latency which is deadly for DBs; use master-slave instead

        - RAM is cheap, buy more, at least 64GB in 8x8 or 4x16GB so you can add more.

        Nice Dell R420 with 64GB of RAM and PERC and 4x600G SAS 15K in RAID10 for 1.2TB of fast space is a nice starting point for a DB. 2x6 core CPU. Or bigger at 128GB for R720 with 6-8 x 600GB RAID10 disks.

        Comment

        • Vaku
          Junior Member
          • Feb 2018
          • 24

          #5
          Originally posted by insider
          Will not be sufficient for big environment, unless you will not keep history at all.


          For my opinion, DRBD is too slow to cope with high i/o
          I think it's better to use master-master replication with active/passive load balancing. I'm looking at pacemaker + corosync.
          DRBD is much faster than Mysql.
          1) DRBD works in kernel space in syncs only data blocks that were changed - this data is already after data processing by mysql, you are synchronizing the result.
          2) Mysql does a lot of things in userspace, which is already much slower and it uses transactions - hence data processing, double work, which also is an overhead and latency

          IO impact mostly caused of using InnoDB, which is not suitible for zabbix type of workload.
          Consider switching from InnoDB to NoSQL for history and trend tables. This will significantly reduce IO overhead.
          DRBD is most efficient at direct replication, join two servers with cross-over with dedicated network interface and avoid traffic routing and interference.
          Then this should be good to go.

          Comment

          • Vaku
            Junior Member
            • Feb 2018
            • 24

            #6
            Originally posted by insider
            Will not be sufficient for big environment, unless you will not keep history at all.


            For my opinion, DRBD is too slow to cope with high i/o
            I think it's better to use master-master replication with active/passive load balancing. I'm looking at pacemaker + corosync.
            Master-Master replication is not stable for zabbix workload.
            You will experience eventual cluster-wide deadlocks and serious downtime, which eliminates all the HA purpose.

            Comment

            • Markku
              Senior Member
              Zabbix Certified SpecialistZabbix Certified ProfessionalZabbix Certified Expert
              • Sep 2018
              • 1781

              #7
              Originally posted by Vaku

              Master-Master replication is not stable for zabbix workload.
              You will experience eventual cluster-wide deadlocks and serious downtime, which eliminates all the HA purpose.
              Hi Vaku , can you tell more about this? I've run some 40+GB Zabbix database with 200+ NVPS in MariaDB master-master replication with no problems. Is there some specific scale or usage scenarios that you are thinking about? I don't have any clients connecting to the other master however.

              Markku

              Comment

              • Vaku
                Junior Member
                • Feb 2018
                • 24

                #8
                Originally posted by Markku

                Hi Vaku , can you tell more about this? I've run some 40+GB Zabbix database with 200+ NVPS in MariaDB master-master replication with no problems. Is there some specific scale or usage scenarios that you are thinking about? I don't have any clients connecting to the other master however.

                Markku
                Hi, that may depend on environment. Well, it may be OK to run on bare metal.
                But default installation MariaDB with galera master-master replication on a cluster of two VMware VM's cannot handle such a load of 200 NVPS without special configuration and DB tweaking.
                In production It often results in unexpected nasty cluster-wide deadlocks, which eliminates HA, and that can be found on many threads on Percona forum with people frustration on sudden cluster-wide deadlocks and no any useful response from Percona.
                Same problems have experienced we. We had a HA master-master galera replication with ProxySQL separating reads and writes and it was a nightmare to fix Zabbix "HA" cluster couple of times every month.
                Default InnoDB storage is extremely ineffective for storing and accessing Zabbix type of data, which is time-based.
                So for it to work with high load as 200+ NVPS stable and without extra hardware resources, there could be an option to use RocksDB storage engine within MariaDB or moving to ClickHouse.
                That is the reason Zabbix have finally enabled support for ClickHouse.

                Comment


                • tim.mooney
                  tim.mooney commented
                  Editing a comment
                  Really interesting information. I appreciate you sharing your insights about DB backends for large environments!

                • Markku
                  Markku commented
                  Editing a comment
                  Ok, thanks for the experiences. I haven't had said issues with the MariaDB master-master replication, even when running everything on VMs on shared platforms. But I haven't used Galera, don't even know what it exactly is.

                  Markku

                • WanWizard
                  WanWizard commented
                  Editing a comment
                  Can't confirm anything of the above.

                  We run a managed hosting business for the SME market.

                  We have a MariaDB four-node Galera cluster (one is a standby node for backups and offline loading/restoring) as shared HA database platform, using ESX VM's with local SSD storage for the DB's. We load-balance client access to the nodes with 1 hour stickyness, causing most client VM's to choose one on startup and stick with it.

                  We have a Zabbix active/passive cluster (CentOS 7 + pacemaker/corysync) using that for backend storage. We monitor our own DC (hardware, separate 1G/10G infrastructure, SSD, SAS and SATA NAS backend, firewalls, loadbalancers, the works), our cloud CDN nodes (for traffic offloading) and currently about 250 ESX VM's and a long list of web scenario's via proxies on those CDN nodes (to test internet response times).

                  It all runs like clockwork.
              Working...