Hi
I had been running Zabbix 1.4.5.2 for about a 3 months with 2 nodes. The slave node monitoring close to 200 hosts (about 28,000 items, 16,000 triggers), and the master node monitoring just a a thousand or so items.
Everything seemed to be going fine, but I wanted to move over to 1.6 - mainly because of the mass update for host information. The 2 sites are on different parts of the country, with ADSL2 links at each end.
I updated the mysql db's on both ends, and compiled zabbix 1.6 (both ends running CentOS 5). I found that the data was not updating to the master node. I made a backup of the zabbix db's and started from scratch, recreating the 1.6 db on both ends and starting to add the hosts. I just added a few hosts to one end, and found that the server was trying to send 17Mb across to the other node. I kept seeing "Timeout while answering request" in the log, and no other data was being sent from one node to the other. If I restarted the node that was sending the data, I would see it trying to send the same 17Mb again, and then after about 15 mins the time out error again.
I have started again from scratch today, and only added 6 Hosts groups (no hosts, just the groups) on the master node, and they replicated to the the slave node. I then added 10 hosts to the slave node, and they replicated as expected to the master node. Once this had happened, I imported my template and then linked the template (198 items) to the 10 hosts on the master node. This information replicated to the slave, and it started to capture data.
This is when the problem starts, now the master is trying to send 1.7Mb data to the slave; but it fails
19163:20081005:121258 NODE 8: Received data from master node 5 for node 8 datalen 1845379
On the slave i see:
14965:20081005:123006 Timeout while answering request
14965:20081005:123006 NODE 5: Error while receiving answer from Node [8] error: ZBX_TCP_READ() failed [Interrupted system call]
If I restart the master zabbix server, it tries to do the same thing again. The intresting thing is the master is getting history from the slave, but when I look at the monitor screens, it shows that there is no data at all (slave gui is showing data for the hosts);
14964:20081005:123713 NODE 5: Received history from node 8 for node 8 datalen 57767
14971:20081005:123716 NODE 5: Received history_uint from node 8 for node 8 datalen 33788
14967:20081005:123717 NODE 5: Received history_str from node 8 for node 8 datalen 594
14967:20081005:123724 NODE 5: Received data from slave node 8 for node 8 datalen 8
When i connect to the mysql db from MS Access using the ODBC driver, i can see that the master node does have data in the history, history_str and history_unit tables.
I am really confused, have no idea what else to try.
Is there a way to compress the data (is it compressed already?), or increase the time out period, so that the 2 servers can work like it did on 1.4.5.2?
Thanks
Paul
I had been running Zabbix 1.4.5.2 for about a 3 months with 2 nodes. The slave node monitoring close to 200 hosts (about 28,000 items, 16,000 triggers), and the master node monitoring just a a thousand or so items.
Everything seemed to be going fine, but I wanted to move over to 1.6 - mainly because of the mass update for host information. The 2 sites are on different parts of the country, with ADSL2 links at each end.
I updated the mysql db's on both ends, and compiled zabbix 1.6 (both ends running CentOS 5). I found that the data was not updating to the master node. I made a backup of the zabbix db's and started from scratch, recreating the 1.6 db on both ends and starting to add the hosts. I just added a few hosts to one end, and found that the server was trying to send 17Mb across to the other node. I kept seeing "Timeout while answering request" in the log, and no other data was being sent from one node to the other. If I restarted the node that was sending the data, I would see it trying to send the same 17Mb again, and then after about 15 mins the time out error again.
I have started again from scratch today, and only added 6 Hosts groups (no hosts, just the groups) on the master node, and they replicated to the the slave node. I then added 10 hosts to the slave node, and they replicated as expected to the master node. Once this had happened, I imported my template and then linked the template (198 items) to the 10 hosts on the master node. This information replicated to the slave, and it started to capture data.
This is when the problem starts, now the master is trying to send 1.7Mb data to the slave; but it fails
19163:20081005:121258 NODE 8: Received data from master node 5 for node 8 datalen 1845379
On the slave i see:
14965:20081005:123006 Timeout while answering request
14965:20081005:123006 NODE 5: Error while receiving answer from Node [8] error: ZBX_TCP_READ() failed [Interrupted system call]
If I restart the master zabbix server, it tries to do the same thing again. The intresting thing is the master is getting history from the slave, but when I look at the monitor screens, it shows that there is no data at all (slave gui is showing data for the hosts);
14964:20081005:123713 NODE 5: Received history from node 8 for node 8 datalen 57767
14971:20081005:123716 NODE 5: Received history_uint from node 8 for node 8 datalen 33788
14967:20081005:123717 NODE 5: Received history_str from node 8 for node 8 datalen 594
14967:20081005:123724 NODE 5: Received data from slave node 8 for node 8 datalen 8
When i connect to the mysql db from MS Access using the ODBC driver, i can see that the master node does have data in the history, history_str and history_unit tables.
I am really confused, have no idea what else to try.

Is there a way to compress the data (is it compressed already?), or increase the time out period, so that the 2 servers can work like it did on 1.4.5.2?
Thanks
Paul