Ad Widget

Collapse

Problems with distributed monitoring in Zabbix 1.5.3

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • disgruntleddutch
    Member
    • Oct 2006
    • 34

    #1

    Problems with distributed monitoring in Zabbix 1.5.3

    Hi there,

    I currently have Zabbix 1.5.3 server installed on three servers in three different environments. The clients in each environment are reporting to their respective Zabbix servers.

    There is one master node and two child nodes each with MySQL 5 (5.0.51b) databases using InnoDB and it looks something like this:
    Master Node
    |
    |- Child Node
    |
    |- Child Node

    What I did was as follows:
    1. Add child node (NodeID 2) to the master node, then enabled discovery rules and discovery action rules in the child
    2. Did the same thing for NodeID 3.

    Now the problem is that I can't see any of the hosts via the master node's zabbix interface (when you select the child node from the drop down in the top right). I am able to see this if I login to the child node's zabbix interface on the machine in the other environment.
    Another issue I am seeing is that items such as system.hostname, system.uname, agent.version and vfs.file.cksum

    Lastly the zabbix_server log on the child node displays a query error (see attached file).


    I know the nodes are communicating as I also see this in the same Child node's log
    26418:20080703:151202 NODE 2: Sending history_sync of node 2 to node 1 datalen 346154
    26418:20080703:151331 NODE 2: Sending history_uint_sync of node 2 to node 1 datalen 310910
    26418:20080703:151422 NODE 2: Sending events of node 2 to node 1 datalen 50304
    26418:20080703:151450 NODE 2: Sending trends of node 2 to node 1 datalen 300766
    26418:20080703:151528 NODE 2: Sending auditlog of node 2 to node 1 datalen 184

    With corresponding logs in Node 1:
    18813:20080703:151202 NODE 1: Received history from node 2 for node 2 datalen 346154
    18812:20080703:151332 NODE 1: Received history_uint from node 2 for node 2 datalen 310910
    18823:20080703:151422 NODE 1: Received events from node 2 for node 2 datalen 50304
    18845:20080703:151451 NODE 1: Received trends from node 2 for node 2 datalen 300766
    18819:20080703:151528 NODE 1: Received auditlog from node 2 for node 2 datalen 184

    Each environment has about 180-200 hosts with probably 50 more added to each in the next few months but I have to get this to work.

    Any help is greatly appreciated. I found one source to get around the bad sql query and thats to enable 'innodb_locks_unsafe_for_binlog' that to me though sounds a bit... scary.
    Attached Files
    Last edited by disgruntleddutch; 04-07-2008, 00:32.
  • disgruntleddutch
    Member
    • Oct 2006
    • 34

    #2
    So I fixed the deadlocks issue with a patch that Palmertree wrote (from this thread: http://www.zabbix.com/forum/showthread.php?t=9890).

    Good thing is that the results of subnodes are now showing up on the master node, however it is god awfully slow. It takes about 5-6 minutes (about 300 hosts) to load the page when the one master and two subnodes are being loaded.

    What can I do to speed up the overview.php page, are there bugs in it causing redundant queries?

    Comment

    • Palmertree
      Senior Member
      • Sep 2005
      • 746

      #3
      I have some improvements from my original patch that actually fixes trends-history pushes, multiple child node issues, and some speed improvements. It will require some new tables with SQL triggers and some indexes. I am pushing about 972 hosts with 125,000 items to my master node and the master node database is running 75% to 80% idle with less than 1.0 load average (use to be 4.0+ load average before improvements). It took me about 3 weeks of tuning but I was able to achieve very good results.

      If anyone is interested, I will be happy to share them and post the patches.
      Last edited by Palmertree; 06-07-2008, 07:33.

      Comment

      • disgruntleddutch
        Member
        • Oct 2006
        • 34

        #4
        mark me as a volunteer. your last patch already fixed some problems for me so i am excited to see what your other improvements do :-).

        my biggest culprit right now is that the overview page is insanely slow so if you have any optimizations there then thats great as well.

        Comment

        • Palmertree
          Senior Member
          • Sep 2005
          • 746

          #5
          I'll write up the information and post my patch.

          For the overview slowness, I would log mysql slow queries by enabling slow querying logging in your my.cnf config file if you are using mysql. Then I would browse to the overview page and wait until the page comes up and review the log. Then you can see which query is taking the longest and you can add indexes to your database to optimize the query. If you can find out which query is taking the longest, I can try and give you some pointers that might help.

          Comment

          • disgruntleddutch
            Member
            • Oct 2006
            • 34

            #6
            I'll see if I get those logs to you.

            How are you doing with the patch write up?

            Comment

            • Palmertree
              Senior Member
              • Sep 2005
              • 746

              #7
              Sorry for the delay. Still working on the documentation of the patch. Hope to be done shortly. :-)

              Comment

              • disgruntleddutch
                Member
                • Oct 2006
                • 34

                #8
                Great :-). Now to maybe explain more about my setup...

                Since it seems you have a big network to monitor, what are you using hardware wise on the database side. I think some of my slowness issues are stemming from the fact that the database server can't handle it (IO related, not Proc related).

                In all three environments I have the zabbix server daemon running on a HP DL385 using RHEL 4.6 using virtually no load (0.20 - 0.30). There are also three database servers, one for each environment all on Sun Fire T2000 using Solaris 10 with MySQL 5.0.51b and it looks like all three (especially the master) is under considerable load. The master especially seems to be having a high IO service time (70 - 75%).

                Perhaps I'm not optimizing my my.cnf file enough (db machines are dual core, 32 thread servers with 8GB of ram) that or I just need to find a database server that has a hw raid card (T2000 doesn't have this).

                Any input? I tried mysqltuner but it doesn't like Solaris 10 I think :-).

                PS. attached is my my.cnf file in all its non-glory. saved my.cnf as my.txt.
                Attached Files
                Last edited by disgruntleddutch; 10-07-2008, 06:57.

                Comment

                • Palmertree
                  Senior Member
                  • Sep 2005
                  • 746

                  #9
                  Ran into one small bug that is fixed. Testing and will post patch and doc after making sure everything works 100%.

                  Comment

                  • disgruntleddutch
                    Member
                    • Oct 2006
                    • 34

                    #10
                    Thanks for the update!

                    Comment

                    • Palmertree
                      Senior Member
                      • Sep 2005
                      • 746

                      #11
                      Still working on this. Hope to be done by the end of the week. Making another modification so that latest data is update faster on the master node just like history sync.

                      Comment

                      Working...