Ad Widget

Collapse

High availability and failover

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • gde
    Junior Member
    Zabbix Certified Specialist
    • Mar 2011
    • 5

    #1

    High availability and failover

    Hello,

    I haven't started installed anything now, though I've been thinking about what tools to use to have HA and handling failover.

    For zabbix, we've decided to use corosync+pacemaker, as described in the wiki.
    For the database, we'll be using postgresql 9.0 and the new streaming replication method. We'll also use pgpool to handle failover and recovery of nodes.

    Here's my question : what solution have you folks been using to assure high availability ?

    I don't think there are lots of approches to this. Corosync+pacemaker seems like the only choice here but right now all our ideas are pretty much theory. I'm interested in hearing any "field experience" on this and how well it works in practice, what problems have been encountered, etc.

    Thanks for your feedback !
  • guesommer
    Junior Member
    • Feb 2009
    • 4

    #2
    Re: High availability and failover

    We're using linuxha (pacemaker) as HA solution.

    The database in use is mysql (performance reasons).

    We are doing the replication via DRBD (and linuxha is controling this).

    DRBD works absolutly fine, we had the most troubles, that the cluster stack sometimes does not properly recover after a full crash (done by hand) of the nodes.

    Comment

    • dnshat
      Junior Member
      • Jul 2011
      • 1

      #3
      HA setup at dnshat.com

      Originally posted by gde
      Here's my question : what solution have you folks been using to assure high availability ?
      Hi gde - just saw a twitter post leading me here I'm using Zabbix in the core of dnshat.com to provide DNS Failover and automated DNS Load Balancing solutions on a subscription basis for client websites. Basically I use Zabbix to monitor client websites for specific content strings, and if the content strings are not found - trigger actions then update MySQL records in a replicated backend database for a redundant PowerDNS setup.

      I have 2 core monitoring locations. Only one location is active at a given time. I use MySQL master/master replication between these two locations for the zabbix database. If the primary cloud site fails, custom scripts on the secondary cloud site see that the primary is down and start zabbix on the secondary system where it resumes site monitoring services - when the primary is restored my scripts on the secondary system shutdown the zabbix server on the secondary so only the zabbix server on the primary is running. Because of the replicated mysql backened - everything stays in sync. If I loose both the primary and the secondary monitoring location - I have manual procedures in place to activate a 3rd mysql slave only instance in another datacenter (changing it to a master and starting up the zabbix server processes in the 3rd location manually - which I have never had to use in production - but its nice to know if I loose my primary and secondary I have a 3rd system ready to take over).

      I usually serve the dnshat website on the secondary monitoring system, with the zabbix php web interface used from on the secondary (writing to the secondary database which flows through replication to the primary database where the zabbix_server binaries are running). The primary Zabbix watches the secondary webserver - if secondary fails primary shifts DNS resolution sending web traffic to the primary server. If the primary fails, the secondary is already active in DNS for webtraffic so no shift is needed (just the scripts to startup the zabbix_server binary). This situation works well for me since I am really only using the zabbix web monitoring pieces for my DNS failover services - it would be more complicated if I was connecting to zabbix agents and needed the monitoring to only originate from a single IP preconfigured in the agents conf files since as far as I know they only allow 1 IP for the authorized zabbix server (but I could see a script based system fired off from a secondary to connect to a list of agents and change config files and restart agents to allow polling from a secondary IP - it would be tricky to build but it could be done - a much better solution would be for the zabbix agent configs to allow entry of multiple source zabbix server IPs).

      Just sharing what I'm doing - if you "master" (pun intended) MySQL replication - it opens up new possibilities for how you can architect HA capabilities using Zabbix.

      Comment

      • DSon
        Member
        • Sep 2009
        • 44

        #4
        dnshat: FYI..

        Just read your HA solution and this sounds very flexible.

        One thing I noticed is that you mentioned the possibility of adding multiple Zabbix servers in the agent.conf.

        Well, you might be pleased to know that you can already do this (they need to be seperated by a comma).

        There is unfortunately a small caveat to this function, namely that only the first IP address can be used for active checks. This probably won't be a problem for you however, since you confessed to not having to monitor agents.

        Other than that, you may find this function useful.

        Hope this helps,
        Danny.

        Comment

        • r3dn3ck
          Member
          • Jul 2008
          • 43

          #5
          mysql + heartbeat + stonith + shared storage. Simple and effective. The only failover event to occur to date did so seamlessly.

          Comment

          • DSon
            Member
            • Sep 2009
            • 44

            #6
            Stonith, or not? (split brain)

            re: Stonith - I have thus far read mixed opinions on whether or not this is needed.

            e.g. YES - if more than 2 nodes in a cluster, otherwise - NO.

            Having been running with several 2-node (heartbeat/pacemaker) clusters for a while now I have already observed several "split brain" occurrences to date (DRBD for shared storage).

            Each time, manually recovery was needed (using drbdadm - nothing to do with H/beat from what I could tell).

            What are other people's experiences in this area?

            i.e. can Stonith be used to avoid DRBD split-brain with only 2 nodes?

            Danny.

            Comment

            • richlv
              Senior Member
              Zabbix Certified Trainer
              Zabbix Certified SpecialistZabbix Certified Professional
              • Oct 2005
              • 3112

              #7
              stonith is needed if two nodes running some service at the same time can cause some problems. node count does not matter, this can be true even if you only have two nodes
              Zabbix 3.0 Network Monitoring book

              Comment

              • frankymryao
                Member
                • Oct 2011
                • 52

                #8
                a conception: update_percent, it describe the percent that a host has updated in last few minutes, it is very accurate.

                Comment

                Working...