Ad Widget

Collapse

Distributed monitoring problem

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • stalker
    Junior Member
    • Aug 2008
    • 29

    #1

    Distributed monitoring problem

    I try to setup master node with one child node. Zabbix version 1.6. Postgresql version. Child node used after clean install.

    As a standalone servers all ok, but after converting bases (zabbix-server -n [1,2] -c /etc/zabbix/zabbix_server.conf) and adding master/slave nodes in gui i have trouble when change nodes in gui: Error: Unable to select configuration.

    In log-files I see:

    Master node zabbix-server.log:
    Code:
     23010:20080930:141743 NODE 1: Received auditlog from node 2 for node 2 datalen 904
     23010:20080930:141743 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (200200000000001,200200000000002,1222768032,4,0,0x4d616e75616c204c6f676f7574)
     23010:20080930:141743 Query failed:PGRES_FATAL_ERROR:ERROR:  ошибка синтаксиса в или рядом "x4d616e75616c204c6f676f7574"
    LINE 1: ... (200200000000001,200200000000002,1222768032,4,0,0x4d616e756...
     23014:20080930:142214 NODE 1: Received auditlog from node 2 for node 2 datalen 904
     23014:20080930:142214 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (200200000000001,200200000000002,1222768032,4,0,0x4d616e75616c204c6f676f7574)
     23014:20080930:142214 Query failed:PGRES_FATAL_ERROR:ERROR:  syntax error near "x4d616e75616c204c6f676f7574"
    LINE 1: ... (200200000000001,200200000000002,1222768032,4,0,0x4d616e756...
                                                                                                                 ^
    [etc..]
    Child node zabbix-server.log:
    Code:
     6078:20080930:141614 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
      6078:20080930:141614 NOT OK
      6078:20080930:141624 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
      6078:20080930:141624 NOT OK
      6078:20080930:141635 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
      6078:20080930:141635 NOT OK
      6078:20080930:141644 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
      6078:20080930:141644 NOT OK
      6078:20080930:141655 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
      6078:20080930:141655 NOT OK
      6078:20080930:141705 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
      6078:20080930:141705 NOT OK
    .....
      6078:20080930:141824 Query::insert into node_cksum (nodeid,tablename,recordid,cksumtype,cksum) select 2,'slideshows',slideshowid,1,md5(name)||','||delay from slideshows where 1=1 and slideshowid between 200000000000000 and 299999999999999
    union all select 2,'slides',slideid,1,slideshowid||','||screenid||','||step||','||delay from slides where 1=1 and slideid between 200000000000000 and 299999999999999
    union all select 2,'drules',druleid,1,proxy_hostid||','||md5(name)||','||md5(iprange)||','||delay||','||nextcheck||','||status from drules where 1=1 and druleid between 200000000000000 and 299999999999999
    union all select 2,'dchecks',dcheckid,1,druleid||','||type||','||md5(key_)||','||md5(snmp_community)||','||md5(ports) from dchecks where 1=1 and dcheckid between 200000000000000 and 299999999999999
    [... very long query with many unions ...]
      6078:20080930:141824 Query failed:PGRES_FATAL_ERROR:ERROR:  function md5(numeric) not found
    LINE 7: ...||status||','||md5(macros)||','||md5(agent)||','||md5(time)|...
                                                                 ^
    HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
    
      6078:20080930:141825 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
      6078:20080930:141825 NOT OK
      6078:20080930:141834 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
      6078:20080930:141834 NOT OK
      6078:20080930:141845 NODE 2: Sending auditlog of node 2 to node 1 datalen 904
    [etc... all errors repeated many times]
    Last edited by stalker; 30-09-2008, 12:47.
  • stalker
    Junior Member
    • Aug 2008
    • 29

    #2
    I have deleted all and have install new zabbix-servers with default settings (master and 2 slave). The problem remained.

    All is made in accuracy under the official documentation

    on master in zabbix log:
    Code:
    31661:20080930:164405 NODE 1: Received auditlog from node 2 for node 2 datalen 1866
     31661:20080930:164405 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (200000000000001,200000000000002,1222767853,3,0,0x4c6f67696e206661696c6564205b4275646e696b6f765d)
     31661:20080930:164405 Query failed:PGRES_FATAL_ERROR:ERROR:  ошибка синтаксиса в или рядом "x4c6f67696e206661696c6564205b4275646e696b6f765d"
    LINE 1: ... (200000000000001,200000000000002,1222767853,3,0,0x4c6f67696...
                                                                 ^
    on both slaves:
    Code:
      9014:20080930:164108 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
      9014:20080930:164108 NOT OK
      9014:20080930:164118 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
      9014:20080930:164119 NOT OK
      9014:20080930:164128 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
      9014:20080930:164128 NOT OK
      9014:20080930:164138 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
      9014:20080930:164139 NOT OK
      9014:20080930:164148 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
      9014:20080930:164148 NOT OK
      9014:20080930:164201 NODE 3: Sending auditlog of node 3 to node 1 datalen 675
      9014:20080930:164201 NOT OK

    Comment

    • stalker
      Junior Member
      • Aug 2008
      • 29

      #3
      PS: ubuntu 8.04 server

      Comment

      • stalker
        Junior Member
        • Aug 2008
        • 29

        #4
        in version 1.6 distributed monitoring is usable?

        With debug=4 on both sides:

        fresh installed node 3:
        Code:
         24417:20081001:154429 NODE 3: Sending [ZBX_GET_HISTORY_LAST_ID�3�3
        auditlog�auditid] to Node [1]
         24417:20081001:154429 NODE 3: Receiving [0] from Node [1]
         24417:20081001:154429 Query [select auditid,userid,clock,action,resourcetype,details from auditlog where auditid>0 and auditid between 300000000000000 and 399999999999999 order by auditid limit 10000]
         24417:20081001:154429 NODE 3: Sending auditlog of node 3 to node 1 datalen 1362
         24417:20081001:154429 In connect_to_node(nodeid:1)
         24417:20081001:154429 Query [select ip,port from nodes where nodeid=1]
         24417:20081001:154429 NODE 3: Sending [History�3�3�auditlog
        300300000000001�300300000000002�1222797789�3�0�436f7272656374206c6f67696e205b41646d696e5d
        300300000000002�300300000000001�1222776685�1�2�485454502041757468656e7469636174696f6e
        300300000000003�300300000000001�1222776717�1�21�4e6f6465205b4c6f63616c206e6f64655d206964205b335d
        300300000000004�300300000000001�1222776761�0�21�4e6f6465205b4d6173746572206e6f64655d206964205b315d
        300300000000005�300300000000001�1222776771�1�21�4e6f6465205b454b54206e6f64655d206964205b335d
        300300000000006�300300000000001�1222776947�1�21�4e6f6465205b4d6f73636f775d206964205b315d
        300300000000007�300300000000001�1222776958�1�21�4e6f6465205b456b61746572696e627572675d206964205b335d
        300300000000008�300300000000002�1222850788�3�0�4c6f67696e206661696c6564205b41646d696e5d
        300300000000009�300300000000002�1222850804�3�0�436f7272656374206c6f67696e205b41646d696e5d
        300300000000010�300300000000001�1222852068�4�0�4d616e75616c204c6f676f7574
        300300000000011�300300000000002�1222852075�3�0�4c6f67696e206661696c6564205b41646d696e5d
        300300000000012�300300000000002�1222852084�3�0�436f7272656374206c6f67696e205b41646d696e5d
        300300000000013�300300000000001�1222852104�1�21�4e6f6465205b566f726f6e657a685d206964205b315d
        300300000000014�300300000000001�1222861365�4�0�4d616e75616c204c6f676f7574
        300300000000015�300300000000002�1222861373�3�0�436f7272656374206c6f67696e205b41646d696e5d] to Node [1]
         24417:20081001:154429 NODE 3: Receiving [] from Node [1]
         24417:20081001:154429 NOT OK
        fresh installed node1:
        Code:
         18031:20081001:155232 Trapper got [ZBX_GET_HISTORY_LAST_ID�3�3
        auditlog�auditid] len 44
         18031:20081001:155232 In send_list_of_history_ids()
         18031:20081001:155232 Query [select MAX(auditid) from auditlog where 1=1 and auditid between 300000000000000 and 399999999999999]
         18031:20081001:155232 NODE 1: Sending [0] to Node [3]
         18033:20081001:155232 Trapper got [History�3�3�auditlog
        300300000000001�300300000000002�1222797789�3�0�436f7272656374206c6f67696e205b41646d696e5d
        300300000000002�300300000000001�1222776685�1�2�485454502041757468656e7469636174696f6e
        300300000000003�300300000000001�1222776717�1�21�4e6f6465205b4c6f63616c206e6f64655d206964205b335d
        300300000000004�300300000000001�1222776761�0�21�4e6f6465205b4d6173746572206e6f64655d206964205b315d
        300300000000005�300300000000001�1222776771�1�21�4e6f6465205b454b54206e6f64655d206964205b335d
        300300000000006�300300000000001�1222776947�1�21�4e6f6465205b4d6f73636f775d206964205b315d
        300300000000007�300300000000001�1222776958�1�21�4e6f6465205b456b61746572696e627572675d206964205b335d
        300300000000008�300300000000002�1222850788�3�0�4c6f67696e206661696c6564205b41646d696e5d
        300300000000009�300300000000002�1222850804�3�0�436f7272656374206c6f67696e205b41646d696e5d
        300300000000010�300300000000001�1222852068�4�0�4d616e75616c204c6f676f7574
        300300000000011�300300000000002�1222852075�3�0�4c6f67696e206661696c6564205b41646d696e5d
        300300000000012�300300000000002�1222852084�3�0�436f7272656374206c6f67696e205b41646d696e5d
        300300000000013�300300000000001�1222852104�1�21�4e6f6465205b566f726f6e657a685d206964205b315d
        300300000000014�300300000000001�1222861365�4�0�4d616e75616c204c6f676f7574
        300300000000015�300300000000002�1222861373�3�0�436f7272656374206c6f67696e205b41646d696e5d] len 1362
         18033:20081001:155232 In node_history()
         18033:20081001:155232 Query [begin;]
         18033:20081001:155232 NODE 1: Received auditlog from node 3 for node 3 datalen 1362
         18033:20081001:155232 In process_record ()
         18033:20081001:155232 Query [insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (300300000000001,300300000000002,1222797789,3,0,0x436f7272656374206c6f67696e205b41646d696e5d)]
         18033:20081001:155232 Query::insert into auditlog (auditid,userid,clock,action,resourcetype,details) values (300300000000001,300300000000002,1222797789,3,0,0x436f7272656374206c6f67696e205b41646d696e5d)
         18033:20081001:155232 Query failed:PGRES_FATAL_ERROR:ERROR:  syntax error "x436f7272656374206c6f67696e205b41646d696e5d"
        LINE 1: ... (300300000000001,300300000000002,1222797789,3,0,0x436f72726...
                                                                     ^
        
         18033:20081001:155232 Query [rollback;]
        In database field details is varchar and should be quoted in query.

        Comment

        • stalker
          Junior Member
          • Aug 2008
          • 29

          #5
          With mysql engine all ok

          Comment

          • tekknokrat
            Senior Member
            • Sep 2008
            • 140

            #6
            did you also tried with using proxy for distributed monitoring?
            Btw. i have packages for ubuntu here

            Comment

            • Alvils
              Junior Member
              • Dec 2007
              • 1

              #7
              A small patch

              Seems that zabbix server tries to insert data in PostgreSQL like it was MySQL.



              I created a quick patch to check this:
              In file src/zabbix_server/trapper/nodehistory.c
              Find line 348. It should look like:
              zbx_snprintf_alloc(sql, sql_allocated, &sql_offset, len + 8, "0x%s,",
              Change that to
              zbx_snprintf_alloc(sql, sql_allocated, &sql_offset, len + 38, "encode(decode('%s','hex'),'escape'),

              Recompile Zabbix and that should work.

              Yet, it still crashes occasionally. Perhaps after receiving data from Zabbix remote nodes. For now, I stopped the remote nodes and will see if it still crashes. If not, I will take a look at this problem again...

              Comment

              • stalker
                Junior Member
                • Aug 2008
                • 29

                #8
                I do not use a proxy.

                I have switched to the mysql-version. With kicks, reinstallations and shamanism I managed to unite 4 nodes. It looks working.

                Irritates the following:

                1. Periodically in logs I see the message: "16602:20081006:165209 Timeout while answering request".

                2. Sometimes process mysqld starts to consume 100 % of the processor and for this time (from logs) data acquisition from slave nodes stops.

                Comment

                • stalker
                  Junior Member
                  • Aug 2008
                  • 29

                  #9
                  after registering new node and recieving first portion of data (16603:20081006:171142 NODE 1: Received data from slave node 3 for node 3 datalen 4130113) cpu idle time become 0% for long-long time (i.e.: Cpu(s): 30.3%us, 45.3%sy, 0.0%ni, 0.3%id, 22.3%wa, 0.7%hi, 1.0%si, 0.0%st)
                  .

                  At this time in node3 logs:
                  Code:
                    7538:20081006:181059 server #19 started [HTTP Poller]
                    7540:20081006:181059 server #20 started [HTTP Poller]
                    7542:20081006:181059 server #21 started [HTTP Poller]
                    7545:20081006:181059 server #23 started [Escalator]
                    7511:20081006:181059 server #0 started [Watchdog]
                    7511:20081006:181059 In main_watchdog_loop()
                    7529:20081006:181059 server #15 started [Poller for unreachable hosts. SNMP:YES]
                    7544:20081006:181059 server #22 started [Discoverer. SNMP:YES]
                    7525:20081006:181114 Deleted 0 records from history and trends
                    7530:20081006:181115 NODE 3: Sending configuration changes to master node 1 for node 3 datalen 4130114

                  Comment

                  • stalker
                    Junior Member
                    • Aug 2008
                    • 29

                    #10
                    after 30 minutes node has been added!

                    Comment

                    • stalker
                      Junior Member
                      • Aug 2008
                      • 29

                      #11
                      Problems:
                      1. periodically: when switched in gui to some node i get error "cannot select configuration"

                      2. Periodically: errors in log when 100% cpu is used:
                      Code:
                       Query failed: [delete from node_cksum where nodeid=2 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
                       Query failed: [delete from node_cksum where nodeid=3 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
                       Query failed: [delete from node_cksum where nodeid=4 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
                       Query failed: [delete from node_cksum where nodeid=5 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
                       Query failed: [delete from node_cksum where nodeid=4 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]
                      ....

                      Comment

                      • stalker
                        Junior Member
                        • Aug 2008
                        • 29

                        #12
                        errors "Query failed: [delete from node_cksum where nodeid=2 and cksumtype=1] Deadlock found when trying to get lock; try restarting transaction [1213]"

                        This error arises when the new node is added. Mysql consumes 100 % of resources after reception of a configuration of new node. After 20 minutes errors in log: "Deadlock found when trying to get lock start to appear; try restarting transaction"

                        Comment

                        • Alexei
                          Founder, CEO
                          Zabbix Certified Trainer
                          Zabbix Certified SpecialistZabbix Certified Professional
                          • Sep 2004
                          • 5654

                          #13
                          Registered as ZBX-537.
                          Alexei Vladishev
                          Creator of Zabbix, Product manager
                          New York | Tokyo | Riga
                          My Twitter

                          Comment

                          • stalker
                            Junior Member
                            • Aug 2008
                            • 29

                            #14
                            at this time on slave node:
                            Code:
                              PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                   
                            13537 zabbix    25   5 1177m 421m  768 S  0.3 29.0   0:43.91 zabbix_server                                                                             
                            11101 mysql     20   0  234m  84m 3504 S 95.0  5.8 976:27.65 mysqld

                            Comment

                            • stalker
                              Junior Member
                              • Aug 2008
                              • 29

                              #15
                              Possible this is timeout problem. How to change timeouts?

                              Links between my nodes is 2mbps and after add new switch to slave from master node i repeatedly see on master node:

                              Code:
                              16675:20081009:130543 Timeout while answering request
                               16675:20081009:130543 NODE 1: Error while receiving answer from Node [2] error: ZBX_TCP_READ() failed [Interrupted system call]
                              and on slave node repeated:

                              Code:
                              13537:20081009:130217 NODE 2: Received data from master node 1 for node 2 datalen 1944392
                              On slave node after recieving this 1944392 bytes mysql 20-30min eat 100% cpu and again by timeout recieved it. Perpetuum mobile

                              With initial sync node1 recieved 4Mb data and transforms to same perpetuum mobile.

                              How to change timeouts?

                              Comment

                              Working...