Ad Widget

Collapse

Successful HA setup

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • five0va
    Junior Member
    • Mar 2015
    • 26

    #1

    Successful HA setup

    Hello all! My co-worker and I hatched this plan over the past few months to setup a full Zabbix HA environment. Galera Cluster (with MariaDB), two (or more) App servers, running the UI or UI on different nodes, we decided to go with running the UI on the App servers for now and have the Proxies in somewhat of a cluster running KeepaliveD. Here is a link to my Google Drive which has all the documentation. Please note, as of right now, I don't have all the files from the DB nodes... but it's a standard Galera cluster. One thing that I was unable to get working was auto-registration of Agents. If anyone has any ideas, please share! The logs just show that the systems have no idea about each other.

    Drive data
  • kloczek
    Senior Member
    • Jun 2006
    • 1771

    #2
    Originally posted by five0va
    Hello all! My co-worker and I hatched this plan over the past few months to setup a full Zabbix HA environment. Galera Cluster (with MariaDB), two (or more) App servers, running the UI or UI on different nodes, we decided to go with running the UI on the App servers for now and have the Proxies in somewhat of a cluster running KeepaliveD. Here is a link to my Google Drive which has all the documentation. Please note, as of right now, I don't have all the files from the DB nodes... but it's a standard Galera cluster. One thing that I was unable to get working was auto-registration of Agents. If anyone has any ideas, please share! The logs just show that the systems have no idea about each other.

    Drive data
    Here https://www.zabbix.com/forum/showthread.php?t=28240 you have why galera cluster it is not the optimal solution and on monitoring something bigger than few host POC env it will be even not working.
    Last edited by kloczek; 04-09-2015, 21:06.
    http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
    https://kloczek.wordpress.com/
    zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
    My zabbix templates https://github.com/kloczek/zabbix-templates

    Comment

    • five0va
      Junior Member
      • Mar 2015
      • 26

      #3
      Excellent, thank you for this!

      Comment

      • C2c
        Junior Member
        • Sep 2015
        • 12

        #4
        @five0va

        Can you please tell what you have done for HA for Zabbix servers? corosync+pacemaker or HA Proxies or something else? and what is lsyncd.conf?. And how your maria DB Galera Cluster working? any issues?

        I am new to Zabbix. Setting up new Zabbix infrastructure.

        1. Zabbix App on separate servers
        2. Database on separate servers
        3 WEB UI on separate servers

        Or can any body else please explain what is right way to setup HA for Zabbix servers only? As mentioned above Zabbix db and Zabbix web are on different servers. I am looking to setup HA for Zabbix servers only. I have gone through some posts explaining about HA for Zabbix but none have my answer



        Thanks

        Comment

        • five0va
          Junior Member
          • Mar 2015
          • 26

          #5
          We're using Keepalived for the HA setup. No Pacemaker/Corosync here. Keepalived is routing software for Linux and once configured, it sets up a VRRP that is able to "float" between the nodes that are configured in the keepalived.conf. To get to the sharing of configuration, we use Lsyncd, which is configured to (for us anyway) sync config files between the Zabbix Server and UI hosts.

          I decided against doing Lysyncd for the Proxies as I'm designing them for horizontal growth.

          Comment

          • five0va
            Junior Member
            • Mar 2015
            • 26

            #6
            2nd part to this:

            For the Databses, this is just a Galera MariaDB database with 3 nodes. As kloczek pointed out, there would normally be issues with swapping nodes back and forth. Although it's easily worked around by using "Sticky" connections.

            A 3 way setup where the Databases and Server are on the same nodes is doable... just setup Galera on 3 nodes (i wouldn't go less than 3), then install the server and setup keepalived and lsyncd for 3 nodes.

            Please checkout my google drive (link in first post), I'm including the links to the stuff I've talked about.

            Comment

            • kloczek
              Senior Member
              • Jun 2006
              • 1771

              #7
              Originally posted by five0va
              2nd part to this:

              For the Databses, this is just a Galera MariaDB database with 3 nodes. As kloczek pointed out, there would normally be issues with swapping nodes back and forth. Although it's easily worked around by using "Sticky" connections.
              I've mentioned that galera may be to complicated solution as long as your server will be not used on monitoring all what is possible to monitor and all simple items, external scripts, zabbix agent and other things like SNMP monitoring would be moved behind proxy/proxies.
              With buffer of last few hours data on proxies and automatic switching to slave database as new master using master-master DB backend setup is not necessary and/or is over complication .. everything to be compliant with KISS principle
              In such scenario even if some transactions would be not committed on slave reconnect with DB backend and automatic resync with proxies will cause that all gaps will be filled.

              Zabbix without slave(s) is less useful in typical maintenance scenarios when you must upgrade DB engine software or during major zabbix upgrade. In such scenario you can leave current master and start using slave as new master. If anything will go wrong .. downgrade zabbix server software -> repoint to original master and restart. Whole rollback procedure will take few seconds.
              Master-master setup does not provide such level of flexibility.
              Time to time on some tables is good to run optimize tables. Tables optimization is causing that exact table is locked and as long as it may take significant time this cannot be done on master. Doing such maintenance on slave and failing over to optimized slave as new master allows minimize maintenance windows to single seconds.
              Last edited by kloczek; 28-09-2015, 20:03.
              http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
              https://kloczek.wordpress.com/
              zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
              My zabbix templates https://github.com/kloczek/zabbix-templates

              Comment

              • five0va
                Junior Member
                • Mar 2015
                • 26

                #8
                Originally posted by kloczek
                I've mentioned that galera may be to complicated solution as long as your server will be not used on monitoring all what is possible to monitor and all simple items, external scripts, zabbix agent and other things like SNMP monitoring would be moved behind proxy/proxies.
                With buffer of last few hours data on proxies and automatic switching to slave database as new master using master-master DB backend setup is not necessary and/or is over complication .. everything to be compliant with KISS principle
                In such scenario even if some transactions would be not committed on slave reconnect with DB backend and automatic resync with proxies will cause that all gaps will be filled.

                Zabbix without slave(s) is less useful in typical maintenance scenarios when you must upgrade DB engine software or during major zabbix upgrade. In such scenario you can leave current master and start using slave as new master. If anything will go wrong .. downgrade zabbix server software -> repoint to original master and restart. Whole rollback procedure will take few seconds.
                Master-master setup does not provide such level of flexibility.
                Time to time on some tables is good to run optimize tables. Tables optimization is causing that exact table is locked and as long as it may take significant time this cannot be done on master. Doing such maintenance on slave and failing over to optimized slave as new master allows minimize maintenance windows to single seconds.

                I get it... but I don't think there is much concern. We have already upgraded the DB once, we have KeepaliveD running on the DBs and using the VRRP address for all this purpose instead of the Round Robin approach that is normally done with Galera. Stopping KeepaliveD on the nodes during upgrade makes life easier, then once the upgrade is done, we shuffle. Our Proxies have plenty of storage on board. We stopped our Zabbix Server during this upgrade and everything turned out just fine with no data lost. I'm all about KISS and to move this to a planned hardware DB cluster, instead of the current VM one, we went with Galera as the hardware DB cluster will be a much better move in the end. The DB is the largest part of our work - which takes us no time to actually work on it, everything else in the "chain" is running nice and smooth.

                Comment

                • kloczek
                  Senior Member
                  • Jun 2006
                  • 1771

                  #9
                  Originally posted by five0va
                  I get it... but I don't think there is much concern. We have already upgraded the DB once, we have KeepaliveD running on the DBs and using the VRRP address for all this purpose instead of the Round Robin approach that is normally done with Galera.
                  OK. Did you already tested few times what happens on cold reboot of one of the master-master DB nodes?
                  Last edited by kloczek; 28-09-2015, 20:47.
                  http://uk.linkedin.com/pub/tomasz-k%...zko/6/940/430/
                  https://kloczek.wordpress.com/
                  zapish - Zabbix API SHell binding https://github.com/kloczek/zapish
                  My zabbix templates https://github.com/kloczek/zabbix-templates

                  Comment

                  • five0va
                    Junior Member
                    • Mar 2015
                    • 26

                    #10
                    On the VM, we shut it down from vSphere and did not have any issues (MariaDB 10 BTW). The node that was shutdown was the currently active node. We got db connection errors, but everything stayed cached. I looked through latest data and did not see anything that was dropped (viewed at the 1 hour view).

                    Comment

                    • Vaku
                      Junior Member
                      • Feb 2018
                      • 24

                      #11
                      Don't ever use Galera cluster for zabbix ha, it's extremely bad and unstable for such workloads.
                      Better plan on database architecture to speedup heavy history tables in NoSQL and use direct DRBD replication because it works much faster in kernel space, not like slow clumsy galera master-master replication, which eventually results in cluster-wide deadlocks and hence produces downtime, which eliminates all it's HA purpose.

                      Comment

                      Working...