Ad Widget

Collapse

DM node sync problems

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • thissolution
    Junior Member
    • Jun 2008
    • 7

    #31
    A year, and the problem is still around

    G'day

    I have been having this problem since beta 1.6. I originally posted this bug: https://support.zabbix.com/browse/ZBX-540 but just today, over 11 months after it was posted I recieved an email stating it was closed as it was fixed. Problem still happens even with yesterdays release of 1.6, so a year later and there is no fix.
    For me i have had to stay using 1.4.6 as this sync's between my master and and slave nodes with out any issues. Yesterday when I tried to upgrade to 1.6, on the first sync after the upgrade, it sent a 32Mb update from the slave to the master DB. After it was recieved at the master, the CPU stayed at over 100% (dual cpu) for a few hours. I gave up and went back to v1.4.6 and all is running well again. Th problem that I see is that the sync happens, and it is recieved by the master, but the load into the DB takes for ever, and then fails. I remember when I first tried using v1.6 on a new master and slave server, on the first sync it was sending about 20Mb of data, yet when you sync a v1.4.6 the first sync is about 0.5Mb.

    Comment

    • data7
      Junior Member
      • May 2008
      • 18

      #32
      Another update

      Well, I upgraded Zabbix Master node to a better machine and tuned some MySQL parameters but the connection still times out after the first hour on the child node.

      I found this old topic ( http://www.zabbix.com/forum/showthread.php?p=34600 ) which claims that the TrapperTimeout parameter is responsible for the disconnection. DO NOT apply that patch on 1.6.x!!!

      Anyway, to change default server limit (which is 300 seconds) you need to edit zabbix-1.6.6/src/zabbix_server/server.c on :

      Code:
      217                 {"TrapperTimeout",&CONFIG_TRAPPER_TIMEOUT,0,TYPE_INT,PARM_OPT,1,300},
      Change 300 to the desired timeout. Didn't tested it though.

      Comment

      • thissolution
        Junior Member
        • Jun 2008
        • 7

        #33
        data7

        Have you tried this change? Also I guess you would need to recompile the zabbix server after changing this, correct?

        Thanks

        Comment

        • NOB
          Senior Member
          Zabbix Certified Specialist
          • Mar 2007
          • 469

          #34
          Originally posted by thissolution
          data7

          Have you tried this change? Also I guess you would need to recompile the zabbix server after changing this, correct?

          Thanks
          The change just in server.c won't help.

          In 1.6.6 the timeout is hard-coded to be ZABBIX_TRAPPER_TIMEOUT,
          currently 300 seconds (5 minutes).
          Yes, a recompile is required and other change(s) in trapper.c.
          We removed the timeout completely for testing and perhaps for production.

          Regards

          Norbert.
          Last edited by NOB; 24-09-2009, 08:50. Reason: More details what to change, too.

          Comment

          • NOB
            Senior Member
            Zabbix Certified Specialist
            • Mar 2007
            • 469

            #35
            Solution (proposal)

            Hi data7

            Originally posted by data7
            This is happening to me too.

            I've been searching for the problem since friday and it seems like the whole process stops when the configuration changes are sent to the master node.

            On the master the following query keeps active for more than one hour:

            "delete from node_cksum where nodeid=8 and cksumtype=1"

            Where node 8 is the problematic one. Please note that this is not constantly active. By filtering this query I concluded that it is updating every row on node_cksum table at a very slow speed compared to the other nodes, which is curious somehow as there is also another active child node with a similar number of hosts/items/triggers that executes this task very quickly and never posed as problem to me.

            more text removed [Norbert]
            Your post pointed in the right direction.

            We faced the same problem after upgrading from 1.4.6 to 1.6.6 and
            solved it in our case by creating the following additional index
            for MySQL:

            Code:
            create index node_cksum_index_2 on node_cksum (nodeid, cksumtype);
            We hope that others can use this approach and it works for every1.

            That's why I like these forums

            Regards and thanks for your valuable hint !

            Norbert.
            Last edited by NOB; 24-09-2009, 11:35. Reason: Corrected wrong end of post

            Comment

            • data7
              Junior Member
              • May 2008
              • 18

              #36
              thissolution, I already tested and even managed to upgrade the master node's memory again but as NOB said a recompilation was made to change default trapper timeout limit (to 18000 in my case) but it didn't change anything at all.

              NOB, I'll create that index as soons as possible. Hope it works. And thanks a lot for the hint too.

              ###############################################

              NOB, I'm VERY thankful for the info. It worked as you said!

              I strongly recommend that procedure (create index on node_cksum) to all that faced the initial sync problem here.

              @Devs: This should definitely be included on the next release notes.

              Thanks again!
              Last edited by data7; 24-09-2009, 13:09. Reason: Update + Typo

              Comment

              • NOB
                Senior Member
                Zabbix Certified Specialist
                • Mar 2007
                • 469

                #37
                Hi data7

                Originally posted by data7
                thissolution, I already tested and even managed to upgrade the master node's memory again but as NOB said a recompilation was made to change default trapper timeout limit (to 18000 in my case) but it didn't change anything at all.

                NOB, I'll create that index as soons as possible. Hope it works. And thanks a lot for the hint too.

                ###############################################

                NOB, I'm VERY thankful for the info. It worked as you said!

                I strongly recommend that procedure (create index on node_cksum) to all that faced the initial sync problem here.

                @Devs: This should definitely be included on the next release notes.

                Thanks again!
                I am very happy that you provided the observation regarding the DB query.

                Entered as bug ZBX-1058 in ZABBIX support, so it should be added to next release or even in the
                upgrade script from 1.4.x to 1.6.x.

                Regards

                Norbert.

                Comment

                • xs-
                  Senior Member
                  Zabbix Certified Specialist
                  • Dec 2007
                  • 393

                  #38
                  I to am very grateful, this seems to solve the issue.
                  I can finally re-enable our master node

                  Thanks!

                  Comment

                  • andrea.consadori
                    Member
                    • Apr 2013
                    • 94

                    #39
                    can be also applied to v2.x?

                    i've the same issue

                    Comment

                    Working...