Ad Widget

Collapse

DM node sync problems

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Mikrodots
    Member
    • Mar 2008
    • 37

    #16
    1.6.5 delayed updates from child nodes

    Overnight node 3 did populate - but the updates from both child nodes are delayed.

    For example - I caused a host unreachable problem by stopping the zabbix_agentd daemon on a host - the fault shows up on the child within 2 minutes -

    but on the master its been 20 minutes and the fault still has not shown up.

    Mikrodots

    Comment

    • Mikrodots
      Member
      • Mar 2008
      • 37

      #17
      Master responding only to one child

      So very strange - the master seems to only respond to one host - currently node 3, which was causing all the trouble yesterday.

      Now if I create a fault on node 3 it shows up on the master as expected, a problem on node 2 does not (which is the reverse of yesterday)

      Both nodes 2 and 3 are communicating with the master node 1 - I can see that from the logs:
      running this on all three and watching.
      tail /etc/log/zabbix-server/zabbix_server.log | grep "node"

      So it looks like it is doing the data collection, just not showing the results on the dashboard.

      I am at a loss now - please some assistance!

      Mikrodots

      Comment

      • Mikrodots
        Member
        • Mar 2008
        • 37

        #18
        confirm node drops from dropdown

        To troubleshoot some more I uncommented the line:

        $available_nodes = get_tree_by_parentid($ZBX_LOCALNODEID,$available_n odes,'masterid');
        in file hosts.inc.php

        node 3 once again disappeared from the dropdown but still exists in the database on the master and in the dashboard view. It seems to be working properly otherwise.

        the master still does not seem to be updating from node 2.

        So it looks like two different problems.
        Last edited by Mikrodots; 07-07-2009, 21:03.

        Comment

        • Mikrodots
          Member
          • Mar 2008
          • 37

          #19
          I deleted all hosts and host groups on node 2 then recreated them.

          I had originally imported them - maybe that's causing the issue.

          Node 2 seems to be working now.

          Glad I only had a dozen hosts and three host groups.

          Mikrodots

          Comment

          • side_control
            Member
            • Mar 2008
            • 37

            #20
            Mik: Goto your dashboard, what does it say for

            Required server performance, new values per second 15.1918 - ?

            Comment

            • Mikrodots
              Member
              • Mar 2008
              • 37

              #21
              Thanks for the reply Jawbrkr,

              Sorry for the delayed response -crazy week.

              Required server performance, new values per second 68.583

              Everything seems to be working right now...

              Mikrodots

              Comment

              • side_control
                Member
                • Mar 2008
                • 37

                #22
                It looks like performance in my two cents. I have my values between 10-20 and I've upgraded my mysql database from 2GB to 16GB now I have no issues whatsoever with node sync.

                Comment

                • ataylo13
                  Senior Member
                  • Feb 2007
                  • 122

                  #23
                  I just updated all my nodes to 1.6.5 and still having this issue. One of the nodes is humming along fine sending data every few seconds and the others are hanging. They both and at "Sending configuration changes". As you can see it is not the amount of data that gives the server a problem.

                  15970:20090721:141704 NODE 4: Unable to connect to Node [1] error: Cannot connect to [10.xx.xx.xx:10051] [Connection refused]
                  15970:20090721:141704 NODE 4: Unable to connect to Node [1] error: Cannot connect to [10.xx.xx.xx:10051] [Connection refused]
                  15970:20090721:141704 NODE 4: Unable to connect to Node [1] error: Cannot connect to [10.xx.xx.xx:10051] [Connection refused]
                  15970:20090721:141704 NODE 4: Unable to connect to Node [1] error: Cannot connect to [10.xx.xx.xx:10051] [Connection refused]
                  15970:20090721:141714 NODE 4: Sending alerts of node 4 to node 1 datalen 1090
                  15970:20090721:141715 NODE 4: Sending history_sync of node 4 to node 1 datalen 348458
                  15970:20090721:141728 NODE 4: Sending history_uint_sync of node 4 to node 1 datalen 345178
                  15970:20090721:141741 NODE 4: Sending history_str_sync of node 4 to node 1 datalen 5760
                  15970:20090721:141742 NODE 4: Sending events of node 4 to node 1 datalen 171
                  15970:20090721:141742 NODE 4: Sending history_sync of node 4 to node 1 datalen 348467
                  15970:20090721:141755 NODE 4: Sending history_uint_sync of node 4 to node 1 datalen 345330
                  15970:20090721:141815 NODE 4: Sending configuration changes to master node 1 for node 4 datalen 227530

                  What other information do you need?
                  Version : 1.8.8
                  Current Configuration 1 Master and 3 Child Nodes

                  Comment

                  • ataylo13
                    Senior Member
                    • Feb 2007
                    • 122

                    #24
                    Just a quick update... after I came in this morning to 12 hours plus later all nodes are syncing in the logs. 2 of the servers are not appearing the drop down on the web interface though.
                    Last edited by ataylo13; 22-07-2009, 14:47.
                    Version : 1.8.8
                    Current Configuration 1 Master and 3 Child Nodes

                    Comment

                    • Owl
                      Junior Member
                      • Sep 2008
                      • 7

                      #25
                      Any more news on this issue. My server are out of sync and I only have 1 master and 1 child.

                      On the master server I am getting the messages:
                      Starting sync with nodes
                      14114:20090730:083837 NODE 1: Sending [200000000001035] to Node [2]
                      14113:20090730:083838 NODE 1: Received history from node 2 for node 2 datalen 350955
                      14110:20090730:083844 NODE 1: Received history_uint from node 2 for node 2 datalen 326297
                      14114:20090730:083900 NODE 1: Received data from slave node 2 for node 2 datalen 88704
                      14114:20090730:083944 NODE 1: Sending [Data?1?2

                      On the child node:
                      6500:20090730:091121 NODE 2: Sending history_sync of node 2 to node 1 datalen 350612
                      6500:20090730:091128 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 326293
                      6500:20090730:091134 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 3220
                      6500:20090730:091135 NODE 2: Sending history_log of node 2 to node 1 datalen 3378721
                      6500:20090730:091159 NODE 2: Sending history_text of node 2 to node 1 datalen 837252
                      6500:20090730:091207 NODE 2: Sending configuration changes to master node 1 for node 2 datalen 1640
                      6500:20090730:091236 NODE 2: Received data from master node 1 for node 2 datalen 8474900
                      6500:20090730:093808 NODE 2: Sending history_sync of node 2 to node 1 datalen 350955
                      6500:20090730:093814 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 326297
                      12228:20090730:093825 server #25 started [Node watcher. Node ID:2]
                      12228:20090730:093830 NODE 2: Sending configuration changes to master node 1 for node 2 datalen 88705
                      12228:20090730:093919 NODE 2: Received data from master node 1 for node 2 datalen 8457

                      It appears to update, but when I look at the data under latest data from the Master server, nothing is being updated. Also, if I make configuration changes they are not updating in a timely manner.

                      Comment

                      • data7
                        Junior Member
                        • May 2008
                        • 18

                        #26
                        This is happening to me too.

                        I've been searching for the problem since friday and it seems like the whole process stops when the configuration changes are sent to the master node.

                        On the master the following query keeps active for more than one hour:

                        "delete from node_cksum where nodeid=8 and cksumtype=1"

                        Where node 8 is the problematic one. Please note that this is not constantly active. By filtering this query I concluded that it is updating every row on node_cksum table at a very slow speed compared to the other nodes, which is curious somehow as there is also another active child node with a similar number of hosts/items/triggers that executes this task very quickly and never posed as problem to me.

                        If I stop the master node's Zabbix Server it looks like it aborts the operation and the other informations (trends, history, events, etc.) are sent as expected from the child node until it decides to send the configuration changes, then everything starts all over again.

                        In my situation I woudn't need to administrate the child nodes remotely so a "NodeNoConfiguration" option would be welcome, since to me the historical data is way more important. Or at least if it's possible to set up the update interval of the configuration changes replication (one day per week would be acceptable to me).

                        In addition to this, on most nodes I've been using the one way TCP way method, which seems to work when the operation is done quickly. As the operation from the problematic node takes more than one hour it probably can't use the same connection to send the data back to the child so the whole process is stopped again.

                        I doubt that it is exclusively a performance issue. I'll try to create a communication on the other side since that node never got a "Received data from master node" message. Maybe after the first ones this query turns out to be more quick.

                        I'll try to see what I can change on the sources (nodewatcher process).

                        Regards,

                        Comment

                        • data7
                          Junior Member
                          • May 2008
                          • 18

                          #27
                          Updating

                          Just an additional info from my last post, the interval on which the information are sent can be configured from the zabbix_server sources (on src/zabbix_server/nodewatcher/nodewatcher.c - function process_nodes - current and default value is 120 seconds).

                          But afterall I'm still in trouble as the alerts need to be updated and the whole process gets interrupted. Besides that the alerts aren't showing up on the dashboard (But all events are registered correctly for that node!).

                          Comment

                          • chrisf
                            Junior Member
                            • Apr 2009
                            • 25

                            #28
                            Still broken.

                            Why haven't any of the zabbix developers gotten involved with this thread?
                            Does anyone know the sync process well enough to help troubleshoot. this issue? Seems to be a widespread problem.

                            I'm experiencing the same issue.
                            Logs are of no help. My slave node just up and stopped syncing.
                            In the middle of the night at that. No changes were made.
                            The only log info I get is the following on the slave:
                            11367:20090817:185321 NODE 2: Sending configuration changes to master node 1 for node 2 datalen 5732
                            Then it stops communicating to the master node.

                            If I stop both zabbix servers and start the slave it begins to complain it can't connect with the master node.
                            I restart the master and receive the followin in the slave node log:
                            11367:20090817:185353 NODE 2: Sending history_sync of node 2 to node 1 datalen 353142
                            11367:20090817:185400 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 338392
                            11367:20090817:185419 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 8764
                            11367:20090817:185420 NODE 2: Sending history_text of node 2 to node 1 datalen 2786
                            11367:20090817:185420 NODE 2: Sending events of node 2 to node 1 datalen 9759
                            11367:20090817:185421 NODE 2: Sending auditlog of node 2 to node 1 datalen 284
                            11367:20090817:185422 NODE 2: Sending history_sync of node 2 to node 1 datalen 352943
                            11367:20090817:185435 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 338640
                            11367:20090817:185442 NODE 2: Sending history_str_sync of node 2 to node 1 datalen 994
                            11367:20090817:185443 NODE 2: Sending history_text of node 2 to node 1 datalen 126
                            11367:20090817:185444 NODE 2: Sending history_sync of node 2 to node 1 datalen 352842
                            11367:20090817:185459 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 338529
                            11367:20090817:185509 NODE 2: Sending history_text of node 2 to node 1 datalen 126
                            11367:20090817:185536 NODE 2: Sending configuration changes to master node 1 for node 2 datalen 5422
                            11367:20090817:185700 NODE 2: Received data from master node 1 for node 2 datalen 972797

                            That looks great...
                            But then it stops. I am at a complete loss as to why. The logs as well provide no help.
                            Can someone give some insight into how to reset the sync between two nodes?

                            Is there a way to start from scratch? So that all the data on the slave gets pushed back over?
                            Ideally this issue could be solved cleanly without having to do that though.

                            Thanks

                            Chris

                            Comment

                            • data7
                              Junior Member
                              • May 2008
                              • 18

                              #29
                              Another update...

                              Well, I'm still stuck with this but it seems I found the problem.

                              When there's a large amount of data to be sent (and not enough performance on master's DB) it cannot process it completely and the child node doesn't get response from master node.

                              There is a second problem (on another node) on which master node finishes processing data but since that takes a longe time, it cannot use the same TCP connection to send a response to the child node...

                              Probably nobody is reading this, but my last attempt will be to copy table "node_cksum" from the problematic child nodes and insert them manually on master node. Hope it works.

                              All the other regular nodes sync correctly because there is very few data to exchange and few operations to be done on the database (caused by the configuration changes).

                              Comment

                              • chrisf
                                Junior Member
                                • Apr 2009
                                • 25

                                #30
                                Same here

                                Yea I never found the solution.
                                I blew away the master node and reinstalled the latest 1.6.5 and also upgraded to 1.6.5 on the child. They began syncing, but NONE of the configs came over from the child. No help in the logs or on these forums.
                                I split the systems into separate entities as the zabbix distributed system model, at least for me, was unreliable and next to impossible to trouble shoot.

                                -Chris

                                Comment

                                Working...