After making some config changes My master node(1) has stopped receiving history updates from the slave(2). If I restart the master occassionally some history data will sneak through, but once node 2 sends over its config history syncing stops. How do I troubleshoot this?
I saw some duplicate insert issues:
4176:20090723:163700 [Z3005] Query failed: [1062] Duplicate entry '200100000000056-Services' for key 2 [insert into applications (applicationid,hostid,name,templateid) values (200200000001057,200100000000056,'Services',200100 000000012)]
So I cleared out the applications table. Let the table get updated and no longer see those errors in the log, but still when the config goes over the history syncs cease. How do I troubleshoot this?
Here's some log info:
31389:20090723:170830 NODE 2: Unable to connect to Node [1] error: *** Cannot connect to [x.x.x.x]:10051 [Connection refused]
31389:20090723:170839 NODE 2: Sending history_sync of node 2 to node 1 datalen 352648
31389:20090723:170847 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 338537
31389:20090723:170852 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 14300
31389:20090723:170853 NODE 2: Sending history_text of node 2 to node 1 datalen 3084
31389:20090723:170854 NODE 2: Sending auditlog of node 2 to node 1 datalen 110
31389:20090723:170855 NODE 2: Sending history_sync of node 2 to node 1 datalen 352555
31389:20090723:170910 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 337068
31389:20090723:170918 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 658
31389:20090723:170918 NODE 2: Sending history_text of node 2 to node 1 datalen 126
31389:20090723:170920 NODE 2: Sending history_sync of node 2 to node 1 datalen 352605
31389:20090723:170935 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 337386
31389:20090723:170947 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 264
31389:20090723:170947 NODE 2: Sending history_text of node 2 to node 1 datalen 126
31373:20090723:170948 Send list of active checks to [10.252.115.6] failed: host [demo1] not found
31389:20090723:171014 NODE 2: Sending configuration changes to master node 1 for node 2 datalen 3399648
On the Master I turned Debug info to 4 and found the following:
9798:20090723:172728 Query [select min(clock) from history where itemid=200100000003375]
9798:20090723:172728 In delete_history(history_uint,200100000003375,7,1248 384414)
9798:20090723:172728 Query [select min(clock) from history_uint where itemid=200100000003375]
9798:20090723:172728 Query [delete from history_uint where itemid=200100000003375 and clock<1247779614]
9785:20090723:172728 In process_record [items<AD>200100000008665<AD>0<AD>status<AD>0<AD>0< AD>error<AD>1<AD>]
9785:20090723:172728 Query [select 0 from items where itemid=200100000008665]
9785:20090723:172728 Query [update items set status=0,error='' where itemid=200100000008665]
9785:20090723:172728 In calculate_checksums
That error occurs after the master receives the slaves config, also I noticed the data len is not equal, not sure how that could happen.
Any help would be greatly appreciated.
-Chris
I saw some duplicate insert issues:
4176:20090723:163700 [Z3005] Query failed: [1062] Duplicate entry '200100000000056-Services' for key 2 [insert into applications (applicationid,hostid,name,templateid) values (200200000001057,200100000000056,'Services',200100 000000012)]
So I cleared out the applications table. Let the table get updated and no longer see those errors in the log, but still when the config goes over the history syncs cease. How do I troubleshoot this?
Here's some log info:
31389:20090723:170830 NODE 2: Unable to connect to Node [1] error: *** Cannot connect to [x.x.x.x]:10051 [Connection refused]
31389:20090723:170839 NODE 2: Sending history_sync of node 2 to node 1 datalen 352648
31389:20090723:170847 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 338537
31389:20090723:170852 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 14300
31389:20090723:170853 NODE 2: Sending history_text of node 2 to node 1 datalen 3084
31389:20090723:170854 NODE 2: Sending auditlog of node 2 to node 1 datalen 110
31389:20090723:170855 NODE 2: Sending history_sync of node 2 to node 1 datalen 352555
31389:20090723:170910 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 337068
31389:20090723:170918 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 658
31389:20090723:170918 NODE 2: Sending history_text of node 2 to node 1 datalen 126
31389:20090723:170920 NODE 2: Sending history_sync of node 2 to node 1 datalen 352605
31389:20090723:170935 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 337386
31389:20090723:170947 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 264
31389:20090723:170947 NODE 2: Sending history_text of node 2 to node 1 datalen 126
31373:20090723:170948 Send list of active checks to [10.252.115.6] failed: host [demo1] not found
31389:20090723:171014 NODE 2: Sending configuration changes to master node 1 for node 2 datalen 3399648
On the Master I turned Debug info to 4 and found the following:
9798:20090723:172728 Query [select min(clock) from history where itemid=200100000003375]
9798:20090723:172728 In delete_history(history_uint,200100000003375,7,1248 384414)
9798:20090723:172728 Query [select min(clock) from history_uint where itemid=200100000003375]
9798:20090723:172728 Query [delete from history_uint where itemid=200100000003375 and clock<1247779614]
9785:20090723:172728 In process_record [items<AD>200100000008665<AD>0<AD>status<AD>0<AD>0< AD>error<AD>1<AD>]
9785:20090723:172728 Query [select 0 from items where itemid=200100000008665]
9785:20090723:172728 Query [update items set status=0,error='' where itemid=200100000008665]
9785:20090723:172728 In calculate_checksums
That error occurs after the master receives the slaves config, also I noticed the data len is not equal, not sure how that could happen.
Any help would be greatly appreciated.
-Chris
Comment