I am working on building a large scale zabbix deployment, including distributed monitoring of remote sites. High availability is a requirement of the central node 1. Node 1 will also be monitoring a large quantity of local resources as well, which limits the usefulness of an active/passive based high availability solution.
From early trials, I am running into the following problems:
1) Database primary keys: Zabbix does its own form of auto increment for the database tables, which completely eliminates technologies like multi-master replication for database high availability. While MySQL does have a cluster option, multi-master replication was the preferred database solution due to reduced management complexity. Using auto increment primary keys, each zabbix_server forming the HA cluster would connect to its own local database and changes are replicated to the other servers, providing a mechanism to further scale the central server as needed.
2) Second zabbix_server startup: Initial observations appear that the second zabbix_server process using the same database would refuse to start if there were not items available due to be checked in some specific time frame. Once the second server successfully launched, I did not observe any operational problems other than the database problems mentioned earlier. If this problem is caused by not having items ready to be checked, then this problem would likely resolve itself as additional checks are added during the full deployment process.
3) Unknown - Multiple servers running active host checks? - If multiple zabbix_server's are running, do active checks get run from all zabbix_server processes?
The ideal deployment in this case would end up as a server farm. Active checks would effectively be handled by a random host in the server farm. Connections to the zabbix server from agents, proxies, or subordinate servers would be into a load balanced server farm address. An extension of this design that should be equally doable would be a server farm of zabbix_proxy servers.
Is anyone else pursuing a similar zabbix deployment?
From early trials, I am running into the following problems:
1) Database primary keys: Zabbix does its own form of auto increment for the database tables, which completely eliminates technologies like multi-master replication for database high availability. While MySQL does have a cluster option, multi-master replication was the preferred database solution due to reduced management complexity. Using auto increment primary keys, each zabbix_server forming the HA cluster would connect to its own local database and changes are replicated to the other servers, providing a mechanism to further scale the central server as needed.
2) Second zabbix_server startup: Initial observations appear that the second zabbix_server process using the same database would refuse to start if there were not items available due to be checked in some specific time frame. Once the second server successfully launched, I did not observe any operational problems other than the database problems mentioned earlier. If this problem is caused by not having items ready to be checked, then this problem would likely resolve itself as additional checks are added during the full deployment process.
3) Unknown - Multiple servers running active host checks? - If multiple zabbix_server's are running, do active checks get run from all zabbix_server processes?
The ideal deployment in this case would end up as a server farm. Active checks would effectively be handled by a random host in the server farm. Connections to the zabbix server from agents, proxies, or subordinate servers would be into a load balanced server farm address. An extension of this design that should be equally doable would be a server farm of zabbix_proxy servers.
Is anyone else pursuing a similar zabbix deployment?


Comment