Originally Posted by mcmyst
There is a point that worries me about how people manage automated failover with database in master-master replication ?
I mean, if we look at my first idea, we suppose that the link between the 2 data centers break down : the zabbix primary server will keep running AND the second one will start up !! Here start the problem : 2 zabbix server writing on the database in the same time. When the replication will be able to synchronize (when the link between the 2 places will be ok) the database will be completely corrupted...
Thank you for you replies !
I don't know if this topic is still watched, but I might have some informations to the master-master replication.
If you're running a MySQL master/master (active/active) setup and the link between the both of them will break, there's no "auto" startup of the secondary zabbix server. You've to do that w/ separate tools like heartbeat and pacemaker - most likely you'll already know that
We're talking here about a master/master (active/active) MySQL setup in a active/passive cluster setup. This means the MySQL replication between the active and passive cluster nodes is running all the time, while the Zabbix services are only running on the active cluster node.
Now expect that the required configuration of a cluster (e.g. heartbeat + pacemaker) is already done and everything works as designed. Now the link between both data centers is going down and heartbeat reports a failure on both nodes. At this moment, both nodes are "thinking": "hey I'm alive and my partner's dead" - we call this a split-brain syndrome.
While all services on the active cluster node are still running, the passive cluster node is starting all Zabbix related services. So at this point, both cluster nodes are active and writing to their own DB (now w/o replication).
Now after a while the link between both data centers is coming back online, which means both cluster nodes can see each other again. Depending on the configuration, an alert is sent or a cluster node is going passive again (i.e. shutting down all Zabbix related services). MySQL is keeping up & running on both nodes, since it's not a cluster related service.
And here's where the MySQL replication comes into play. In a replication setup, there's always a master and a slave. The master is logging all transactions (called binary log) and the slave is registered to that binary log. A slave always knows it's position in this log and can catch up w/ the master, even if it's disconnected for a while (e.g. if slave has position 20 and master has position 25, then the slave has to catch up all transactions from 20 to 25). Now in a classic master/slave setup, only the master is allowed to write, because the slave can never be ahead of the master (because it's only one way replication). But in a master/master setup, both MySQL servers are masters and both MySQL servers are slave from each other. So they're able to write both to their databases and catch up, as soon as they're connected again (because every server has it's own binary log).
BUT THIS IS IMPORTANT: If both severs will write to the same tables, then you might have a problem w/ auto-increment values. To avoid that problem, you've to set the following configuration variables on the servers:
This will increment all auto-increment fields by 2 instead of 1. On the server 1 you've an offset of 1 (i.e. 27, 29, 31, ...) and on the server 2 you've an offset of 2 (i.e. 28, 30, 32, ...).