Ad Widget

Collapse

Distributed Monitoring (1.4.1) - Not Syncing Between Nodes

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • morgan
    Junior Member
    • Jul 2007
    • 24

    #1

    Distributed Monitoring (1.4.1) - Not Syncing Between Nodes

    I am currently in process of evaluating Zabbix for a nagios replacement. The issue I am running across (and it appears some others are as well) is that the Distributed Monitoring appears to be broken or nonfunctional (the Sync aspect) in the current release.

    I have setup the servers as follows:

    Server 1: (master) NodeID 1 -- Trapper running on port 10051
    Server 2: (child) NodeID 2 -- Trapper running on port 10051

    MySQL is version: 5.0.32.


    I have installed Zabbix (clean install) with a fresh DB import (from the .sql files). I ran the respective 'zabbix_server -n <node ID>' on the specific systems and added the Nodes with Server 1 as the master and server 2 as the child.

    The log file shows the correct NodeWatcher ID respectively for each server:
    Server1: [Node watcher. Node ID:1]
    Server2: [Node watcher. Node ID:2]


    No matter what I have done (via the web interface) on either node, there appears to be no traffic flowing between the two nodes. I have performed some TCPDUMPing (and waited well over 5 minutes) and have seen zero data flow between the two hosts. Any and all configuration changes stay local to the individual nodes. The logfile also does not reflect the "NODE 2: Sending data of node 2 to node 1 datalen" type message at any time. The only sync messages that are seen are in Debug=4 and looks like "Starting sync with nodes"

    Unfortunately, the lack of functional distributed monitoring is a complete deal-breaker for me. While a lot of the features in Zabbix are PERFECT for what I have been tasked to do, I cannot afford to handle monitoring from a single system due to architecture of our network and systems (running a single monitor box is far too limited).

    Any insight as to why this isn't working would be appreciated (so that I can continue this evaluation and start developing the ground work to integrate Zabbix into our systems here).

    Edit: I can communicate between the servers without problem on the Zabbix Server port(s) [telnet, etc]. It appears that the zabbix server is just not passing config data.

    Thanks,

    Morgan
    Last edited by morgan; 07-07-2007, 00:24.
  • Alexei
    Founder, CEO
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2004
    • 5654

    #2
    Assuming you have Master Node1 and Child Node2. Please, double check the following:

    1. Configuration of nodes are absolutely identical in ZABBIX frontend.

    Node1 is defined as a Master node, while Node2 is defined as a Child node on both nodes.

    2. NodeID is defined and correct on both nodes.

    3. ZABBIX server is restarted after you changed NodeID setting in configuration file.

    4. zabbix_server -n <nodeid> was executed with corerect nodeid (once for each node)

    5. Please do "select userid from users" on both nodes and post result here
    Alexei Vladishev
    Creator of Zabbix, Product manager
    New York | Tokyo | Riga
    My Twitter

    Comment

    • morgan
      Junior Member
      • Jul 2007
      • 24

      #3
      1) It is absolutely Identical (though I have attempted setting the localnode using 127.0.0.1 and the "public" ip address. Alas -- to no avail.)

      2) Both nodes are correctly defined. I have verified that it's as such in the zabbix_server.conf

      3) I have performed multiple restarts on the zabbix server (after each change) to ensure that everything is working properly and loaded. Each reload was done with about 20 minutes of time between changes (to see if anything had changed)

      4) the command was executed one time per server and with the correct node id. (To check this I did a "clean" db import from the .sql files)

      5) mysql> select userid from users;
      +-----------------+
      | userid |
      +-----------------+
      | 100000000000001 |
      | 100000000000002 |
      +-----------------+

      mysql> select userid from users;
      +-----------------+
      | userid |
      +-----------------+
      | 200000000000001 |
      | 200000000000002 |
      +-----------------+

      Lastly, I have successfully gotten the History Packets. But still no configuration updates being pushed out. I am making the broad assumption that the web UI should be identical when looking at the "hosts" and "items/triggers" sections if I login from the master or the child (provided I have given enough time for propagation).

      Thanks,

      --Morgan

      Comment

      • Alexei
        Founder, CEO
        Zabbix Certified Trainer
        Zabbix Certified SpecialistZabbix Certified Professional
        • Sep 2004
        • 5654

        #4
        Please give me result of "select * from nodes" from both nodes.
        Alexei Vladishev
        Creator of Zabbix, Product manager
        New York | Tokyo | Riga
        My Twitter

        Comment

        • morgan
          Junior Member
          • Jul 2007
          • 24

          #5
          Node 1:

          mysql> select * from nodes;
          +--------+--------+----------+---------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
          | nodeid | name | timezone | ip | port | slave_history | slave_trends | event_lastid | history_lastid | history_str_lastid | history_uint_lastid | nodetype | masterid |
          +--------+--------+----------+---------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
          | 1 | Medius | 0 | 127.0.0.1 | 10051 | 30 | 365 | 0 | 0 | 0 | 0 | 1 | 0 |
          | 2 | ZMon01 | 0 | 192.168.1.53 | 10051 | 30 | 365 | 0 | 0 | 0 | 0 | 0 | 1 |
          +--------+--------+----------+---------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+

          Node 2:

          mysql> select * from nodes;
          +--------+--------+----------+---------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
          | nodeid | name | timezone | ip | port | slave_history | slave_trends | event_lastid | history_lastid | history_str_lastid | history_uint_lastid | nodetype | masterid |
          +--------+--------+----------+---------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
          | 1 | Medius | 0 | 192.168.1.54 | 10051 | 30 | 365 | 0 | 0 | 0 | 0 | 0 | 0 |
          | 2 | ZMon01 | 0 | 127.0.0.1 | 10051 | 30 | 365 | 0 | 0 | 1 | 58 | 1 | 1 |
          +--------+--------+----------+---------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+

          And as stated before, the history packets are working properly now and I am seeing those in the logs on both servers. Just no updates to the configs.

          Comment

          • welkin
            Senior Member
            • Mar 2007
            • 132

            #6
            I got exactly the same problem maybe my mysql output can help you :

            Central Node :select userid from users;
            +-----------------+
            | userid |
            +-----------------+
            | 100000000000001 |
            | 100000000000002 |
            +-----------------+
            Slave Node: select userid from users;
            +-----------------+
            | userid |
            +-----------------+
            | 200000000000001 |
            | 200000000000002 |
            +-----------------+
            Central Node : select * from nodes;

            +--------+--------------+----------+---------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
            | nodeid | name | timezone | ip | port | slave_history | slave_trends | event_lastid | history_lastid | history_str_lastid | history_uint_lastid | nodetype | masterid |
            +--------+--------------+----------+---------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
            | 1 | Central Node | 0 | 127.0.0.1 | 10051 | 30 | 365 | 0 | 0 | 0 | 0 | 1 | 0 |
            | 2 | Slave Node | 0 | 87.230.24.226 | 10051 | 90 | 365 | 0 | 0 | 0 | 0 | 0 | 1 |
            +--------+--------------+----------+---------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
            Slave Node: select * from nodes;

            +--------+--------------+----------+--------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
            | nodeid | name | timezone | ip | port | slave_history | slave_trends | event_lastid | history_lastid | history_str_lastid | history_uint_lastid | nodetype | masterid |
            +--------+--------------+----------+--------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+
            | 1 | Central Node | 0 | 91.190.225.6 | 10051 | 90 | 365 | 0 | 0 | 0 | 0 | 0 | 0 |
            | 2 | Slave Node | 0 | 127.0.0.1 | 10051 | 30 | 365 | 0 | 0 | 0 | 0 | 1 | 1 |
            +--------+--------------+----------+--------------+-------+---------------+--------------+--------------+----------------+--------------------+---------------------+----------+----------+


            greetings welkin

            Comment

            • Alexei
              Founder, CEO
              Zabbix Certified Trainer
              Zabbix Certified SpecialistZabbix Certified Professional
              • Sep 2004
              • 5654

              #7
              Give me some time...
              Alexei Vladishev
              Creator of Zabbix, Product manager
              New York | Tokyo | Riga
              My Twitter

              Comment

              • morgan
                Junior Member
                • Jul 2007
                • 24

                #8
                Alright. Let me know when you have some information as to what might be going on here. I'm hoping that we can get some resolution to this sync issue sooner vs. later so that I can get the final word in if we can use Zabbix here (a lot of engineering is going into revamping/integrating monitoring in a clean way with the rest of our systems).

                Comment

                • morgan
                  Junior Member
                  • Jul 2007
                  • 24

                  #9
                  After doing a bit of debugging myself, it appears the sync issue is specifically happening as follows:

                  in nodesender.c (should be line 98 or so) the while statement:

                  ... snip ...
                  while((row=DBfetch(result)))
                  {
                  found = 1;
                  ... snip ...

                  is not evaluating as true. This is causing "found" to remain set as zero. The SQL statements:
                  Code:
                  "select tablename,recordid,operation from node_configlog where nodeid=" ZBX_FS_UI64 " and sync_slave=0 and conflogid<=" ZBX_FS_UI64 " order by tablename,operation",
                  (slave or master) both apparently return the correct rows. I have verified based on the information from the debug line earlier for "In send_config_data". It appears that the conditional in the while statement is returning "false".

                  Edit: The DBfetch() is returning no data. Maybe an issue with "result".
                  Last edited by morgan; 06-07-2007, 23:48.

                  Comment

                  • morgan
                    Junior Member
                    • Jul 2007
                    • 24

                    #10
                    Fixed.

                    I believe (and have tested) what I believe to be a FIX to the problem I have been seeing.

                    What follows is a patch for nodewatcher.c . It appears that ZBX_FS_UI64 was used in place of a %d when creating the SQL query on lines 79 and 85. Being that NODEID is not a bigint, this was causing an overflow-type (or underflow?) effect to the int value (nodeide) when passed in the SQL statement. By changing this I was able to get configuration syncs to work properly again.

                    Code:
                    zabbix:~/zabbix-1.4.1/src/zabbix_server/nodewatcher# diff -u nodesender.c.orig nodesender.c     
                    --- nodesender.c.orig   2007-07-06 15:06:36.000000000 -0700
                    +++ nodesender.c        2007-07-06 15:07:00.000000000 -0700
                    @@ -76,13 +76,13 @@
                            /* Begin work */
                            if(node_type == ZBX_NODE_MASTER)
                            {
                    -               result=DBselect("select tablename,recordid,operation from node_configlog where nodeid=" ZBX_FS_UI64  " and sync_master=0 and conflogid<=" ZBX_FS_UI64 " order by tablename,operation",
                    +               result=DBselect("select tablename,recordid,operation from node_configlog where nodeid=%d and sync_master=0 and conflogid<=" ZBX_FS_UI64 " order by tablename,operation",
                                            nodeid,
                                            maxlogid);
                            }
                            else
                            {
                    -               result=DBselect("select tablename,recordid,operation from node_configlog where nodeid=" ZBX_FS_UI64 " and sync_slave=0 and conflogid<=" ZBX_FS_UI64 " order by tablename,operation",
                    +               result=DBselect("select tablename,recordid,operation from node_configlog where nodeid=%d and sync_slave=0 and conflogid<=" ZBX_FS_UI64 " order by tablename,operation",
                                            nodeid,
                                            maxlogid);
                            }
                    Please let me know if there is a "cleaner" fix to this (or more appropriate).

                    Cheers,

                    Morgan
                    Last edited by morgan; 07-07-2007, 00:24.

                    Comment

                    • Alexei
                      Founder, CEO
                      Zabbix Certified Trainer
                      Zabbix Certified SpecialistZabbix Certified Professional
                      • Sep 2004
                      • 5654

                      #11
                      Morgan, I appreciate your research very much! Well done!

                      I bet you use 32 bit OS. Your fix is absolutely correct. It is integrated into main code, revision 4451.

                      Feel free to download and test nightly build of pre 1.4.2 available from http://www.zabbix.com/developers.php. Make sure revision number is 4451 or higher.

                      Thank you.
                      Alexei Vladishev
                      Creator of Zabbix, Product manager
                      New York | Tokyo | Riga
                      My Twitter

                      Comment

                      Working...