Ad Widget

Collapse

DM node sync problems

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • xs-
    Senior Member
    Zabbix Certified Specialist
    • Dec 2007
    • 393

    #1

    DM node sync problems

    Hi all,

    Situation
    I've got a DM setup consisting of 2 slave nodes and 1 master node.
    This setup has been running correctly for a 6 months or so.

    I recently reinstalled the OS on one of the nodes, causing this node to be offline for 1 day. Apparently this caused the master or child node to think it's out of sync and stopped sending the regular updates to the master.

    I have had some experience in this as in 1.4 this would cause the child node to redo the whole sync (all history, trends, etc), but it seems it's not doing this now.
    When i restart the child node, it sends 1 'data' (i think configuration) update and thats it (looking at the master node logfile).

    I did recently upgrade from 1.6.2 to 1.6.4 tho, i read something was changed in the way syncs are done.

    Info:
    Zabbix servers: 2 child nodes -> 1 master node
    Zabbix version: 1.6.4
    Mysql version 5.0 (used by all nodes)
    OS: Ubuntu-8.04LTS (used by all nodes)

    Question
    - Is there a way to force resync somehow (i don't think there is an official method for this, is there?)
    - Is there a way to reset the sync status, i.e. by clearing out a sync table (i have had mixed results with this in the past and am a bit reluctant to do this without confirmation)
  • xs-
    Senior Member
    Zabbix Certified Specialist
    • Dec 2007
    • 393

    #2
    Some extra info.

    For some reason one of the child nodes has gone of of sync (in this case due to it being offline for a day).
    After this the child will start to 'try' and get in sync with the master for for some reason it stops or chockes.

    I have tried clearing out all *_sync tables and restarting zabbix on the child node (there were several million entries here and countung). After this, it just started counting from 0 and is still going. No sync =\.

    Help

    Comment

    • xs-
      Senior Member
      Zabbix Certified Specialist
      • Dec 2007
      • 393

      #3
      update

      After some more testing, it seems the whole sync process does start, but stops at a certain point, reason unknown.

      I am testing this by restarting the 'broken' child node, which restart the sync process. Each time i restart it, and look on the master node log, i see 1 data update (in the log) and a bit later it shows a 'Timeout while answering request'. After this, no more sync related log messages for this node on either the master or the child node.

      I have set the master node to debugging with the hope of seeing what causes this.
      The following debug log snippet is related to this issue, but no clear reason of why this is failing. 'Broken' child node has nodeid 30.
      11781:20090422:081022 In update_checksums
      11781:20090422:081022 Query [select curr.tablename,curr.recordid,prev.cksum,curr.cksum ,prev.sync from node_cksum curr, node_cksum prev where curr.nodeid=30 and prev.nodeid=curr.nodeid and curr.tablename=prev.tablename and curr.recordid=prev.recordid and curr.cksumtype=1 and prev.cksumtype=0 and curr.tablename='functions' and curr.recordid=3001000000000836 union all select curr.tablename,curr.recordid,prev.cksum,curr.cksum ,NULL from node_cksum curr left join node_cksum prev on prev.nodeid=curr.nodeid and prev.tablename=curr.tablename and prev.recordid=curr.recordid and prev.cksumtype=0 where curr.nodeid=30 and curr.cksumtype=1 and prev.tablename is null and curr.tablename='functions' and curr.recordid=3001000000000836 union all select prev.tablename,prev.recordid,prev.cksum,curr.cksum ,prev.sync from node_cksum prev left join node_cksum curr on curr.nodeid=prev.nodeid and curr.tablename=prev.tablename and curr.recordid=prev.recordid and curr.cksumtype=1 where prev.nodeid=30 and prev.cksumtype=0 and curr.tablename is null and prev.tablename='functions' and prev.recordid=3001000000000836]
      11781:20090422:081022 In process_record [functions*3001000000000837*0*itemid*6*300300000000 4920*triggerid*6*3003000000004533*function*1*6c617 374*parameter*1*30]
      11781:20090422:081022 Query [select 0 from functions where functionid=3001000000000837]
      11781:20090422:081022 Query [update functions set itemid=3003000000004920,triggerid=3003000000004533 ,function='last',parameter='0' where functionid=3001000000000837]
      11781:20090422:081022 In calculate_checksums
      11781:20090422:081022 Query [delete from node_cksum where nodeid=30 and cksumtype=1]
      11781:20090422:081023 Query [insert into node_cksum (nodeid,tablename,recordid,cksumtype,cksum) select 30,'functions',functionid,1,concat_ws(',',itemid,t riggerid,md5(function),md5(parameter)) from functions where 1=1 and functionid between 3000000000000000 and 3099999999999999 and functionid=3001000000000837]
      11781:20090422:081023 In update_checksums
      11781:20090422:081023 Query [select curr.tablename,curr.recordid,prev.cksum,curr.cksum ,prev.sync from node_cksum curr, node_cksum prev where curr.nodeid=30 and prev.nodeid=curr.nodeid and curr.tablename=prev.tablename and curr.recordid=prev.recordid and curr.cksumtype=1 and prev.cksumtype=0 and curr.tablename='functions' and curr.recordid=3001000000000837 union all select curr.tablename,curr.recordid,prev.cksum,curr.cksum ,NULL from node_cksum curr left join node_cksum prev on prev.nodeid=curr.nodeid and prev.tablename=curr.tablename and prev.recordid=curr.recordid and prev.cksumtype=0 where curr.nodeid=30 and curr.cksumtype=1 and prev.tablename is null and curr.tablename='functions' and curr.recordid=3001000000000837 union all select prev.tablename,prev.recordid,prev.cksum,curr.cksum ,prev.sync from node_cksum prev left join node_cksum curr on curr.nodeid=prev.nodeid and curr.tablename=prev.tablename and curr.recordid=prev.recordid and curr.cksumtype=1 where prev.nodeid=30 and prev.cksumtype=0 and curr.tablename is null and prev.tablename='functions' and prev.recordid=3001000000000837]
      11781:20090422:081023 In process_record [functions*3001000000000838*0*itemid*6*300300000000 4922*triggerid*6*3003000000004536*function*1*6c617 374*parameter*1*30]
      11781:20090422:081023 Query [select 0 from functions where functionid=3001000000000838]
      11781:20090422:081023 Query [update functions set itemid=3003000000004922,triggerid=3003000000004536 ,function='last',parameter='0' where functionid=3001000000000838]
      11781:20090422:081023 In calculate_checksums
      11781:20090422:081023 Query [delete from node_cksum where nodeid=30 and cksumtype=1]
      11777:20090422:081023 Timeout while answering request
      /data/zabbix/sbin/zabbix_server [11777]: Lock failed [Interrupted system call]
      11777:20090422:081023 In node_sync(len:1738)
      @devs, can you please shed some light on this. I don't know what i can do more debugging-wise.
      If more information is required, let me know.


      still using the following software:
      Ubuntu-8.04LTS
      MySQL-5.0
      Zabbix-1.6.4

      Comment

      • ataylo13
        Senior Member
        • Feb 2007
        • 122

        #4
        I am also seeing similar issue.
        Version : 1.8.8
        Current Configuration 1 Master and 3 Child Nodes

        Comment

        • side_control
          Member
          • Mar 2008
          • 37

          #5
          I'm also experiencing this issue.

          Comment

          • Senator
            Junior Member
            • Feb 2009
            • 9

            #6
            maybe similar like this:

            Comment

            • welkin
              Senior Member
              • Mar 2007
              • 132

              #7
              I have another slightly different issue with DM.

              I just updated all of my zabbix nodes to ver 1.6.4. I started with updating node 3 than node 2 and at last i updated the master. After applying the two patches needed for dm in ver 1.6.4 to work i configured 6 hosts on the child node. The configuration synced with the master but i do not get data from agent items. I do get the data for the icmp ping.

              regards welkin

              Comment

              • xs-
                Senior Member
                Zabbix Certified Specialist
                • Dec 2007
                • 393

                #8
                Still same =\
                I do know that after initial configuration data update, the slave start to sync the master node.
                Then a 'Lock failed [Interrupted system call]' will be shown on the master.
                After that, nothing =\

                7444:20090511:143149 Starting zabbix_server. ZABBIX 1.6.4.
                7444:20090511:143149 **** Enabled features ****
                7444:20090511:143149 SNMP monitoring: YES
                7444:20090511:143149 WEB monitoring: YES
                7444:20090511:143149 Jabber notifications: NO
                7444:20090511:143149 ODBC: NO
                7444:20090511:143149 IPv6 support: NO
                7444:20090511:143149 **************************
                7446:20090511:143158 server #1 started [Poller. SNMP:YES]
                7447:20090511:143158 server #2 started [Trapper]
                7448:20090511:143158 server #3 started [Trapper]
                7449:20090511:143158 server #4 started [Trapper]
                7450:20090511:143158 server #5 started [Trapper]
                7451:20090511:143158 server #6 started [Trapper]
                7452:20090511:143158 server #7 started [ICMP pinger]
                7453:20090511:143158 server #8 started [Alerter]
                7454:20090511:143158 server #9 started [Housekeeper]
                7454:20090511:143158 Executing housekeeper
                7456:20090511:143158 server #10 started [Timer]
                7459:20090511:143158 server #12 started [Node watcher. Node ID:10]
                7460:20090511:143158 server #13 started [HTTP Poller]
                7463:20090511:143158 server #14 started [Escalator]
                7444:20090511:143158 server #0 started [Watchdog]
                7444:20090511:143158 In main_watchdog_loop()
                7458:20090511:143158 server #11 started [Poller for unreachable hosts. SNMP:YES]
                7447:20090511:143222 NODE 10: Received data from slave node 1 for node 1 datalen 792113
                7454:20090511:143229 Deleted 42574 records from history and trends
                /data/zabbix/sbin/zabbix_server [7448]: Lock failed [Interrupted system call]
                7448:20090511:143735 NODE 10: Received data from slave node 30 for node 30 datalen 13830306
                Last edited by xs-; 11-05-2009, 14:39.

                Comment

                • xs-
                  Senior Member
                  Zabbix Certified Specialist
                  • Dec 2007
                  • 393

                  #9
                  Possible cause found?

                  Ok, i have found a possible cause of this problem, perhaps others experiencing the same can check this in their network.

                  Our nodes are located in different networks, separated by routers and firewalls.
                  These firewalls have a connection idle timeout configured on 2 hours.

                  I have tested with a complete resync of data (deleted 1 node's data from node_cksum on master+child, deleted all data in *_sync tables on child).
                  In this case it started syncing but after 2hours and a couple of minutes it failed.

                  9763:20090511:173611 NODE 10: Received data from slave node 30 for node 30 datalen 13862378
                  9770:20090511:180554 Executing housekeeper
                  9770:20090511:180716 Deleted 178029 records from history and trends
                  9770:20090511:190816 Executing housekeeper
                  9770:20090511:190938 Deleted 178267 records from history and trends
                  9763:20090511:191034 NODE 10: Error while sending data to Node [30] error: ZBX_TCP_WRITE() failed [Connection reset by peer]
                  The firewall logs of the device between these nodes states that the connection has been idle for 2 hours, and this it forcibly closed the connection.

                  I;m going to try and find a workaround for the time being, but i would like to see this addressed in development, to use new connections when old ones close unexpectedly, send keepalive packets, and perhaps make the sync process multithreaded to speed things up?

                  I will also update the jira ticket.

                  Comment

                  • welkin
                    Senior Member
                    • Mar 2007
                    • 132

                    #10
                    any news on this issue? i have the same problem, but the nodes are in the same subnet! I moved the database of one Node to a 64Bit System and compiled a 64Bit Zabbix there. Now the Slave won`t sync with the master again.

                    regards
                    welkin

                    Comment

                    • xs-
                      Senior Member
                      Zabbix Certified Specialist
                      • Dec 2007
                      • 393

                      #11
                      Well, i tried several things.
                      2 slave nodes are accessible from the master via 2 seperate firewalled paths.
                      I have tested the firewall timeout via one path and removed the issue where the firewall drops the connection. Didnt help. The connection stays dead period.
                      I did check via traces that the zabbix daemon uses SO_KEEPALIVE for its connections. Even with kernel SO_KEEPALIVE interval set to 5 min, it still doesn't complete the sync.

                      I'm beginning to suspect a bug somewhere. Some query is failing (cause or consequence) at the end of the node_sync table processing.
                      I have looked the DB schema (my databases have been upgraded several times, perhaps i'm out of sync with something), nothing wrong there.
                      I did do some manual DB cleaning in the past, perhaps i have some illegal/orphaned data there, on which the sync fails. But all errors (if any) give no hint to what might cause it.

                      Comment

                      • welkin
                        Senior Member
                        • Mar 2007
                        • 132

                        #12
                        sad to hear that. If i can help you debugging please contact me, but i think some of the devs should have a look on this....

                        regards welkin

                        Comment

                        • side_control
                          Member
                          • Mar 2008
                          • 37

                          #13
                          I had the same issue and I finally found a resolution today. I'm not a hundred percent sure it's the same issue but this is what I did to resolve my problem, where my master node sync'ed up with one child node but not the other.

                          I finally tuned my Zabbix installation because my mysql database was over 42GB for only 162 hosts, 6000 items and 2000 triggers. I modified the polling, history and trends, I retained 7 days of data and moved most of the polling to 10 minutes and the critical triggers to 180 seconds, which improved my performance greatly.

                          Then I looked at the database, it still didn't free up any physical space (InnoDB tables) but the database I'm sure its a lot smaller. After I restarted my mysql database on the master node, with less data to cipher through it started working.

                          I looked into this because Richlv at #zabbix on irc.freenode.net mentioned that the developers thought this issue is caused by having too much data in the database to send over to the nodes, therefor I cleaned it up and it worked for me.

                          Comment

                          • Mikrodots
                            Member
                            • Mar 2008
                            • 37

                            #14
                            Child Node Problems 1.6.5

                            I'm having the same issue.

                            Installed a new master node v. 1.6.5 after a corrupted db on the master. The first child node (node 2) came up no problem. The second (node 3) did not. I couldn't get the existing node 3 to show up in the dropdown - so I went for a complete reinstall of 1.6.5 on all three.

                            I removed all zabbix files, deleted the zabbix db and even reinstalled mysql-server on all three. I used the script from bbrendon: http://blog.brendon.com/unix/install...-the-easy-way/ to install.

                            Once again, the first child node (node 2) came up no problem. The second (node 3) did not.

                            The master, node 1, and working child, node 2, are running on virtual Centos machines. The problem child, node 3, is running on Centos.

                            I had this configuration working for months with version 1.6.1 on all three - I made no changes to any server hardware, network hardware, firewalls, routing or addressing.

                            After installing 1.6.5 on all three the problem child, node 3, would not show up in the master dropdown so I followed this thread:


                            We found that the nodes will be removed by method get_tree_by_parentid() in method get_viewed_nodes() in file hosts.inc.php

                            We have now commented out the line:
                            $available_nodes = get_tree_by_parentid($ZBX_LOCALNODEID,$available_n odes,'masterid');
                            After that I restarted the zabbix_server on the master, the problem child, node 3, was deleted from the master's database (Confirmed using phpmyadmin to look at the nodes table on the master - node 3 was there, then it was gone...)

                            - I recreated node 3 and it did show up in the dropdown after that.

                            On the master node 1, the problem node 3 populated hosts, host groups and actions. It did not populate data, triggers, events, users, or user groups.

                            Locally on node 3 everything apears to be working fine.

                            In the master zabbix_server log:
                            NODE 1: Received data from slave node 3 for note 3 datalen 4407639

                            In the problem child zabbix_server log:
                            NODE 3: Sending configuration changes to master node 1 for node 3 data len 4407640

                            And that looks like it for communication between the two.

                            There are very few hosts on each - less than a dozen - that I imported from the 1.6.1 installation.

                            So I don't think it is database size or too much data causing the problem.

                            Any help would be appreciated. I don't mind troubleshooting but I need some direction.

                            Thanks,

                            Mikrodots

                            Comment

                            • Mikrodots
                              Member
                              • Mar 2008
                              • 37

                              #15
                              Well - I'm not sure what's going on- but this may be a useful observation.

                              I stopped zabbix_server on master node 1 for about 3 seconds and node 3 started sending.

                              I restarted zabbix_server on the master node and they communicated for about 3 minutes.

                              History, events, audit log, looked to be sync'ing.

                              That is until these entries:

                              In the problem child zabbix_server log:
                              NODE 3: Sending configuration changes to master node 1 for node 3 data len 4407640

                              In the master zabbix_server log:
                              NODE 1: Received data from slave node 3 for note 3 datalen 4407639

                              Stopped master zabbix_server again and quickly restarted - no effect.
                              Stopped master zabbix_server for about 3 seconds and restarted and they started communicating again.

                              Rinse and repeat - seems to work each time it hangs. Seems to always hang at the same log entries above - very consistent.

                              I'm going to leave it over night and see what happens.

                              Any ideas?

                              Mikrodots
                              Last edited by Mikrodots; 07-07-2009, 05:37. Reason: clarification

                              Comment

                              Working...