Ad Widget

Collapse

Distributed monitoring problems

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • fatiha
    Member
    • Nov 2007
    • 78

    #1

    Distributed monitoring problems

    Hi,

    I have the same problem and I really need your patch. Zabbix version : 1.4.4. I have one master and three nodes.

    Here's my bug in zabbix_server's logs in one of my node :

    20629:20080808:155933 NOT OK
    20629:20080808:155934 NODE 3: Sending new history of node 3 to node 1 datalen 370521
    20629:20080808:155935 Error while sending data to Node [1]
    20629:20080808:155935 NODE 3: Sending new history_uint of node 3 to node 1 datalen 333691
    20629:20080808:155935 NODE 3: Sending new history_str of node 3 to node 1 datalen 40953
    20629:20080808:160001 Query failed: [update ids set nextid=nextid+1 where nodeid=3 and table_name='node_cksum' and field_name='cksumid'] Server shutdown in progress [1053]
    20612:20080808:160001 Query::select i.itemid,i.key_,h.host,h.port,i.delay,i.descriptio n,i.nextcheck,i.type,i.snmp_community,i.snmp_oid,h .useip,h.ip,i.history,i.lastvalue,i.prevvalue,i.ho stid,h.status,i.value_type,h.errors_from,i.snmp_po rt,i.delta,i.prevorgvalue,i.lastclock,i.units,i.mu ltiplier,i.snmpv3_securityname,i.snmpv3_securityle vel,i.snmpv3_authpassphrase,i.snmpv3_privpassphras e,i.formula,h.available,i.status,i.trapper_hosts,i .logtimefmt,i.valuemapid,i.delay_flex,h.dns from hosts h, items i where i.nextcheck<=1218204001 and i.status in (0,3) and i.type not in (2,7,9) and h.status=0 and h.disable_until<=1218204001 and h.errors_from=0 and h.hostid=i.hostid and mod(i.itemid,5)=1 and i.key_ not in ('status','icmpping','icmppingsec','zabbix[log]') and h.hostid>=100000000000000*3 and h.hostid<=(100000000000000*3+99999999999999) order by i.nextcheck
    20612:20080808:160001 Query failed:MySQL server has gone away [2006]


    Nothing in mysql's logs.

    And after my node crashs the master crashs with that error :

    6387:20080808:115944 zbx_realloc: out of memory. requested '49152' bytes.
    6343:20080808:115944 One child process died. Exiting ...
    6343:20080808:115946 ZABBIX Server stopped

    or that error :

    12653:20080808:155825 zbx_malloc: out of memory. requested '16384' bytes.
    12605:20080808:155825 One child process died. Exiting ...
    12605:20080808:155827 ZABBIX Server stopped

    Anyone can help me please, because I don't know if it's zabbix problem or mysql problem.

    Thanks,
    Fatiha
  • xs-
    Senior Member
    Zabbix Certified Specialist
    • Dec 2007
    • 393

    #2
    are you running mysql5 on the node which says: 'Query failed:MySQL server has gone away'? if not, try that first.

    Its a known issue that when a node fails (master or slave, doesnt matter) during a history or config update, the other nodes die as well.

    Comment

    • fatiha
      Member
      • Nov 2007
      • 78

      #3
      Yes it's mysql 5.0.37

      Comment

      • xs-
        Senior Member
        Zabbix Certified Specialist
        • Dec 2007
        • 393

        #4
        When the zabbix node dies with the 'mysql has gone away' error, us mysql still running here? If so, try to manually run the query presented in the zabbix_server.log file to see why the query failed. You might have a data inconsistency somewhere in the db.

        Comment

        • fatiha
          Member
          • Nov 2007
          • 78

          #5
          Now all my nodes are OK because I restored my database with a backup (2008/07/31).

          But I still have an error in my master node :

          5382:20080811:150817 zbx_realloc: out of memory. requested '49152' bytes.
          5340:20080811:150818 One child process died. Exiting ...
          5340:20080811:150820 ZABBIX Server stopped

          Thanks,
          Fatiha

          Comment

          • fatiha
            Member
            • Nov 2007
            • 78

            #6
            I increased the number of trappers in the master and in one of the nodes because I saw that the number of zabbix_server processes was different between nodes.

            But I always have this error in the master node :

            9495:20080811:155100 zbx_malloc: out of memory. requested '16384' bytes.
            9460:20080811:155100 One child process died. Exiting ...
            9460:20080811:155102 ZABBIX Server stopped

            So, I have to restart zabbix_server every two or three minutes.

            Help me please
            Fatiha

            Comment

            • fatiha
              Member
              • Nov 2007
              • 78

              #7
              New informations, zabbix 1.4.4, I understand how it crashed !

              Before the master node crashes I see in one of the nodes :

              25526:20080811:165523 Deleted 1516 records from history and trends
              25532:20080811:165631 NODE 3: Sending configuration changes of node 3 to node 1 datalen 14941750
              25532:20080811:165730 NOT OK
              25532:20080811:165730 NODE 3: Sending new history of node 3 to node 1 datalen 351442
              25532:20080811:165732 Error while sending data to Node [1]

              So the master node stopped one minute after :

              13883:20080811:165524 NODE 1: Received data from node 3 for node 3 datalen 14941750
              [...]
              13883:20080811:165621 zbx_realloc: out of memory. requested '49152' bytes.
              13823:20080811:165621 One child process died. Exiting ...
              13823:20080811:165623 ZABBIX Server stopped

              How to increase the amount of data received by the master node ?

              Thanks,
              Fatiha
              Last edited by fatiha; 11-08-2008, 17:01.

              Comment

              • xs-
                Senior Member
                Zabbix Certified Specialist
                • Dec 2007
                • 393

                #8
                I'm not sure, but this could be related to the amount of shared mem / semaphore segments available.
                There are several examples and related reading links on the forum on this subject.

                Comment

                • fatiha
                  Member
                  • Nov 2007
                  • 78

                  #9
                  You're right, there were two problems :

                  - database problem => data inconsistency
                  - system problem

                  I change parameters in /etc/sysctl.conf and I reboot my master server :
                  fs.file-max = 65535
                  kernel.random.poolsize = 1024
                  kernel.shmmax = 536870912
                  kernel.shmmni = 4096
                  kernel.shmall = 536870912
                  kernel.sem = 250 32000 100 128
                  fs.file-max = 65536

                  I deleted all the files ibdata1, logs, ... dropped database zabbix and I restored the databases in master and in child nodes.

                  So now it works !

                  Comment

                  Working...