Ad Widget

Collapse

Node 1 and Node 2 fail to sync history

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • chrisf
    Junior Member
    • Apr 2009
    • 25

    #1

    Node 1 and Node 2 fail to sync history

    After making some config changes My master node(1) has stopped receiving history updates from the slave(2). If I restart the master occassionally some history data will sneak through, but once node 2 sends over its config history syncing stops. How do I troubleshoot this?

    I saw some duplicate insert issues:
    4176:20090723:163700 [Z3005] Query failed: [1062] Duplicate entry '200100000000056-Services' for key 2 [insert into applications (applicationid,hostid,name,templateid) values (200200000001057,200100000000056,'Services',200100 000000012)]

    So I cleared out the applications table. Let the table get updated and no longer see those errors in the log, but still when the config goes over the history syncs cease. How do I troubleshoot this?

    Here's some log info:

    31389:20090723:170830 NODE 2: Unable to connect to Node [1] error: *** Cannot connect to [x.x.x.x]:10051 [Connection refused]
    31389:20090723:170839 NODE 2: Sending history_sync of node 2 to node 1 datalen 352648
    31389:20090723:170847 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 338537
    31389:20090723:170852 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 14300
    31389:20090723:170853 NODE 2: Sending history_text of node 2 to node 1 datalen 3084
    31389:20090723:170854 NODE 2: Sending auditlog of node 2 to node 1 datalen 110
    31389:20090723:170855 NODE 2: Sending history_sync of node 2 to node 1 datalen 352555
    31389:20090723:170910 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 337068
    31389:20090723:170918 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 658
    31389:20090723:170918 NODE 2: Sending history_text of node 2 to node 1 datalen 126
    31389:20090723:170920 NODE 2: Sending history_sync of node 2 to node 1 datalen 352605
    31389:20090723:170935 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 337386
    31389:20090723:170947 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 264
    31389:20090723:170947 NODE 2: Sending history_text of node 2 to node 1 datalen 126
    31373:20090723:170948 Send list of active checks to [10.252.115.6] failed: host [demo1] not found
    31389:20090723:171014 NODE 2: Sending configuration changes to master node 1 for node 2 datalen 3399648

    On the Master I turned Debug info to 4 and found the following:
    9798:20090723:172728 Query [select min(clock) from history where itemid=200100000003375]
    9798:20090723:172728 In delete_history(history_uint,200100000003375,7,1248 384414)
    9798:20090723:172728 Query [select min(clock) from history_uint where itemid=200100000003375]
    9798:20090723:172728 Query [delete from history_uint where itemid=200100000003375 and clock<1247779614]
    9785:20090723:172728 In process_record [items<AD>200100000008665<AD>0<AD>status<AD>0<AD>0< AD>error<AD>1<AD>]
    9785:20090723:172728 Query [select 0 from items where itemid=200100000008665]
    9785:20090723:172728 Query [update items set status=0,error='' where itemid=200100000008665]
    9785:20090723:172728 In calculate_checksums


    That error occurs after the master receives the slaves config, also I noticed the data len is not equal, not sure how that could happen.

    Any help would be greatly appreciated.

    -Chris
  • chrisf
    Junior Member
    • Apr 2009
    • 25

    #2
    Additional info

    Looks like this is an issue with the configs getting out of sync.
    But I have no idea how as I make all config changes on the master node.

    I've dropped the Debug level to 3 and what I see is:

    16820:20090723:192157 NODE 1: Received data from slave node 2 for node 2 datalen 3416571

    Here's some dupe info:
    15078:20090723:185542 [Z3005] Query failed: [1062] Duplicate entry '200200000010073-vfs.fs.inode[/,free]' for key 2 [insert into items (itemid,type,snmp_community,snmp_oid,snmp_port,hos tid,description,key_,delay,history,trends,status,v alue_type,trapper_hosts,units,multiplier,delta,snm pv3_securityname,snmpv3_securitylevel,snmpv3_authp assphrase,snmpv3_privpassphrase,formula,error,logt imefmt,templateid,valuemapid,delay_flex,params,ipm i_sensor) values (200200000026608,0,'','',161,200200000010073,'Free number of inodes on $1','vfs.fs.inode[/,free]',60,7,365,0,3,'','',0,0,'',0,'','','0','','',2001 00000000030,0,'','','')]

    15078:20090723:185522 [Z3005] Query failed: [1062] Duplicate entry '200100000000056-200100000000001' for key 2 [insert into hosts_templates (hosttemplateid,hostid,templateid) values (200200000000090,200100000000056,200100000000001)]

    Now I tried deleting everything from these tables and restarting both systems to see if the sync would restore the data from the slave to the master... then I could restore the dump I made and ignore the failed inserts.

    But after a restart the tables remained empty. Which makes me believe the Master is no longer attempting to insert the data from the Slave.

    I verified this by restoring the items and hosts_templates tables and restart the master and slave again.

    We see the familiar:
    16820:20090723:192157 NODE 1: Received data from slave node 2 for node 2 datalen 3416571

    But I no longer see the duplicate insert errors, which would mean the master is ignoring the config from the slave.

    Can someone point me in the right direction here? Is there a procedure to delete the child and "resync"?

    Comment

    • xaeth
      Member
      • Nov 2004
      • 67

      #3
      we just started having this issue this week as well. any responses would be great.


      Another we were getting with it is:

      NODE 1: Error while receiving answer from Node [2] error: ZBX_TCP_READ() failed [Interrupted system call]

      and

      12475:20090724:114721 NODE 2: Unable to connect to Node [1] error: Cannot connect to [156.132.82.26:10051] [Connection refused]
      12475:20090724:115037 NODE 2: Error while receiving answer from Node [1] error: ZBX_TCP_READ() failed [Connection reset by peer]
      12475:20090724:115107 NODE 2: Error while sending data to Node [1] error: ZBX_TCP_WRITE() failed [Connection reset by peer]
      12475:20090724:115107 NODE 2: Unable to connect to Node [1] error: Cannot connect to [156.132.82.26:10051] [Connection refused]


      The network connectivity between the 2 nodes looks fine, so this is very confusing.

      Comment

      Working...