Master/Child Node in Large Distributed Monitoring (DM) Environments
Tested On: Zabbix Version 1.5.3 Beta and MYSQL 5.0 Database ONLY
Note: Before I get started, I would like to mention that this is my own hack and does not mean it will be supported by feature versions of Zabbix. I am posting this in case this might benefit someone else. I spent several weeks testing/tweaking this with over around 893 host and 131974 active items on a single child node pushing history to the master node with very good performance. I was able to successfully push history to the master node every 10 seconds and configuration changes occurred within 5 minutes without skipping a beat on the history and graphs. Also, I was able to sync latest data so that lastcheck, lastvalue, and prevalue (Last check, Last value and Change) show up under latest data for a child node since this would no longer impact the history syncs.
If you do decide to try this hack, please remember to backup your database before you get started.
Problems found with current settings for Child/Master node setup in Large Environments with over 100 host and 20,000 active items
• TCP connection 10051 between master/child node times-out due to waiting for data to be returned from slave or master node.
• History data pauses for over 5 to 10 minutes on graphs due to nodesync syncing configuration changes between the nodes. Eventually history data can never catch backup because it gets farther and farther behind.
• Function “calculate_checksums” increases the database load causing deadlocks due to “UNION ALL” and hash on all the tables at ounce in the database dbschema.c that are set to ZBX_SYNC.
• If a template was removed from a host and re-added back on the master node, sometimes the INSERT items would occur before the deletes, causing corrupted items on the child node. You would know this happening too because there would be duplicate key warnings in the zabbix_server.log on the child node.
Solution
• Increased timeout value on trapper port (10051/tcp) to 10800 seconds (3 Hours).
• Created separate daemons for nodesync_data (History, Trends, etc) and nodesync_config (Configuration Changes) so while history is being pushed configuration changes can be made at the same time.
• Sort “calculate_checksums” data so that deletes occur first before INSERTS.
• Created node_cksum_temp table so that checksums can be determine a table (items, hosts, etc.) at a time to decrease the number of deadlocks on the tables and to reduce load on the database.
• Insert records into the database on the master node with low_pritority to reduce deadlocks on the database.
• Increased ZBX_ITEMS_SIZE 10000 to support 200,000 items in dbcaching.
Note: Most changes have to be done on both the master and child nodes.
Setup
1. Stop all Zabbix services on master and child nodes.
2. Backup your databases on both the master and child nodes.
3. Modify /etc/zabbix/zabbix_server.conf on both the master and child nodes by changing the following entry:
TrapperTimeout=10800
4. Create the following table named node_cksum_temp in the Zabbix database on both the master and child node for temporary checksum data:
MYSQL Database Schema for table name “node_cksum_temp”
5. On both the master and child nodes, install attached patch NodeSync_Patch-Version_1.5.3_20080614_2308.patch:
zabbix-original (Original Version- Directory)
zabbix (Patch Version – Directory)
patch –p0 < NodeSync_Patch-Version_1.5.3_20080614_2308.patch
6. Restart zabbix_agentd and zabbix_server on the master node.
7. Restart zabbix_agentd and zabbix_server on the child node.
8. Watch zabbix_server.log for NODE messages.
Tested On: Zabbix Version 1.5.3 Beta and MYSQL 5.0 Database ONLY
Note: Before I get started, I would like to mention that this is my own hack and does not mean it will be supported by feature versions of Zabbix. I am posting this in case this might benefit someone else. I spent several weeks testing/tweaking this with over around 893 host and 131974 active items on a single child node pushing history to the master node with very good performance. I was able to successfully push history to the master node every 10 seconds and configuration changes occurred within 5 minutes without skipping a beat on the history and graphs. Also, I was able to sync latest data so that lastcheck, lastvalue, and prevalue (Last check, Last value and Change) show up under latest data for a child node since this would no longer impact the history syncs.

If you do decide to try this hack, please remember to backup your database before you get started.
Problems found with current settings for Child/Master node setup in Large Environments with over 100 host and 20,000 active items
• TCP connection 10051 between master/child node times-out due to waiting for data to be returned from slave or master node.
• History data pauses for over 5 to 10 minutes on graphs due to nodesync syncing configuration changes between the nodes. Eventually history data can never catch backup because it gets farther and farther behind.
• Function “calculate_checksums” increases the database load causing deadlocks due to “UNION ALL” and hash on all the tables at ounce in the database dbschema.c that are set to ZBX_SYNC.
• If a template was removed from a host and re-added back on the master node, sometimes the INSERT items would occur before the deletes, causing corrupted items on the child node. You would know this happening too because there would be duplicate key warnings in the zabbix_server.log on the child node.
Solution
• Increased timeout value on trapper port (10051/tcp) to 10800 seconds (3 Hours).
• Created separate daemons for nodesync_data (History, Trends, etc) and nodesync_config (Configuration Changes) so while history is being pushed configuration changes can be made at the same time.
• Sort “calculate_checksums” data so that deletes occur first before INSERTS.
• Created node_cksum_temp table so that checksums can be determine a table (items, hosts, etc.) at a time to decrease the number of deadlocks on the tables and to reduce load on the database.
• Insert records into the database on the master node with low_pritority to reduce deadlocks on the database.
• Increased ZBX_ITEMS_SIZE 10000 to support 200,000 items in dbcaching.
Note: Most changes have to be done on both the master and child nodes.
Setup
1. Stop all Zabbix services on master and child nodes.
2. Backup your databases on both the master and child nodes.
3. Modify /etc/zabbix/zabbix_server.conf on both the master and child nodes by changing the following entry:
TrapperTimeout=10800
4. Create the following table named node_cksum_temp in the Zabbix database on both the master and child node for temporary checksum data:
MYSQL Database Schema for table name “node_cksum_temp”
Code:
CREATE TABLE `node_cksum_temp` ( `nodeid` int(11) NOT NULL default '0', `tablename` varchar(64) NOT NULL default '', `recordid` bigint(20) unsigned NOT NULL default '0', `cksumtype` int(11) NOT NULL default '0', `cksum` text NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
zabbix-original (Original Version- Directory)
zabbix (Patch Version – Directory)
patch –p0 < NodeSync_Patch-Version_1.5.3_20080614_2308.patch
6. Restart zabbix_agentd and zabbix_server on the master node.
7. Restart zabbix_agentd and zabbix_server on the child node.
8. Watch zabbix_server.log for NODE messages.

Comment