Ad Widget

Collapse

Zabbix Master Node-Server 1.8.4 Crashing in DM Setup

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ApolloDS
    Junior Member
    • Dec 2010
    • 11

    #1

    Zabbix Master Node-Server 1.8.4 Crashing in DM Setup

    Hello
    Maybe someone got the same problems like me?
    I have a brand new setup with zabbix 1.8.3 with Postgresql.
    I also using CentOS 5.5 Fresh Clean Install as the host OS.
    Both machines are exactly the same setup.
    Then I setup a distributed monitoring like the following:

    Node 10=172.22.0.10, Master
    Node 20=172.23.0.10, Slave

    I setup the nodes in both Web-Interfaces and everything seems to work.

    Now I updated today to 1.8.4 and since then I get the following error only on the master node:

    21207:20110105:143017.479 Starting Zabbix Server. Zabbix 1.8.4 (revision 16604).
    21207:20110105:143017.481 ****** Enabled features ******
    21207:20110105:143017.481 SNMP monitoring: YES
    21207:20110105:143017.481 IPMI monitoring: YES
    21207:20110105:143017.481 WEB monitoring: YES
    21207:20110105:143017.481 Jabber notifications: YES
    21207:20110105:143017.481 Ez Texting notifications: YES
    21207:20110105:143017.481 ODBC: NO
    21207:20110105:143017.481 SSH2 support: YES
    21207:20110105:143017.481 IPv6 support: NO
    21207:20110105:143017.481 ******************************
    21210:20110105:143017.628 server #1 started [DB Cache]
    21212:20110105:143017.697 server #2 started [Poller. SNMP:YES]
    21213:20110105:143017.803 server #3 started [Poller. SNMP:YES]
    21214:20110105:143017.915 server #4 started [Poller. SNMP:YES]
    21221:20110105:143018.074 server #8 started [Trapper]
    21218:20110105:143018.102 server #5 started [Poller. SNMP:YES]
    21219:20110105:143018.153 server #6 started [Poller. SNMP:YES]
    21220:20110105:143018.227 server #7 started [Poller for unreachable hosts. SNMP:YES]
    21227:20110105:143018.371 server #10 started [Trapper]
    21229:20110105:143018.415 server #11 started [Trapper]
    21231:20110105:143018.460 server #12 started [Trapper]
    21233:20110105:143018.491 server #13 started [ICMP pinger]
    21222:20110105:143018.508 server #9 started [Trapper]
    21234:20110105:143018.595 server #14 started [Alerter]
    21238:20110105:143018.657 server #15 started [Housekeeper]
    21238:20110105:143018.658 Executing housekeeper
    21240:20110105:143018.728 server #16 started [Timer]
    21242:20110105:143019.135 server #17 started [Node watcher. Node ID:10]
    21246:20110105:143019.269 server #20 started [DB Syncer]
    21243:20110105:143019.731 server #18 started [HTTP Poller]
    21245:20110105:143019.812 server #19 started [Discoverer. SNMP:YES]
    21251:20110105:143019.931 server #21 started [DB Syncer]
    21252:20110105:143019.973 server #22 started [DB Syncer]
    21258:20110105:143020.011 server #23 started [DB Syncer]
    21260:20110105:143020.057 server #24 started [Escalator]
    21262:20110105:143020.085 server #25 started [Proxy Poller]
    21207:20110105:143020.127 server #0 started [Watchdog]
    21221:20110105:143021.827 NODE 10: Received history from node 20 for node 20 datalen 9878
    21231:20110105:143022.117 NODE 10: Received history_uint from node 20 for node 20 datalen 3526
    21231:20110105:143022.423 NODE 10: Received auditlog from node 20 for node 20 datalen 329
    21231:20110105:143022.495 NODE 10: Received auditlog_details from node 20 for node 20 datalen 10356
    21231:20110105:143022.495 Got signal [signal:11(SIGSEGV),reason:128,refaddrnil)]. Crashing ...
    21231:20110105:143022.496 ====== Fatal information: ======
    21231:20110105:143022.496 Program counter: 0x3d35479a10
    21231:20110105:143022.496 === Registers: ===
    21231:20110105:143022.496 r8 = 0 = 0 = 0
    21231:20110105:143022.496 r9 = adadadadadadadad = 12514849900987264429 = -5931894172722287187
    21231:20110105:143022.496 r10 = 22 = 34 = 34
    21231:20110105:143022.496 r11 = 246 = 582 = 582
    21231:20110105:143022.496 r12 = adadadadadadadad = 12514849900987264429 = -5931894172722287187
    21231:20110105:143022.496 r13 = 73 = 115 = 115
    21231:20110105:143022.496 r14 = d = 13 = 13
    21231:20110105:143022.497 r15 = 7fff9bac92ec = 140735805166316 = 140735805166316
    21231:20110105:143022.497 rdi = adadadadadadadad = 12514849900987264429 = -5931894172722287187
    21231:20110105:143022.497 rsi = 1 = 1 = 1
    21231:20110105:143022.497 rbp = 7fff9bac8b10 = 140735805164304 = 140735805164304
    21231:20110105:143022.497 rbx = 7fff9bac8ca0 = 140735805164704 = 140735805164704
    21231:20110105:143022.497 rdx = 7fff9bac8ce8 = 140735805164776 = 140735805164776
    21231:20110105:143022.497 rax = adadadadadadadad = 12514849900987264429 = -5931894172722287187
    21231:20110105:143022.497 rcx = 3 = 3 = 3
    21231:20110105:143022.498 rsp = 7fff9bac8468 = 140735805162600 = 140735805162600
    21231:20110105:143022.498 rip = 3d35479a10 = 262886890000 = 262886890000
    21231:20110105:143022.498 efl = 10217 = 66071 = 66071
    21231:20110105:143022.498 csgsfs = 33 = 51 = 51
    21231:20110105:143022.498 err = 0 = 0 = 0
    21231:20110105:143022.498 trapno = d = 13 = 13
    21231:20110105:143022.498 oldmask = 0 = 0 = 0
    21231:20110105:143022.498 cr2 = 0 = 0 = 0
    21231:20110105:143022.498 === Backtrace: ===
    21231:20110105:143022.504 15: zabbix_server(print_fatal_info+0xcd) [0x43bc3d]
    21231:20110105:143022.505 14: zabbix_server(child_signal_handler+0xeb) [0x43b48b]
    21231:20110105:143022.505 13: /lib64/libc.so.6 [0x3d354302d0]
    21231:20110105:143022.505 12: /lib64/libc.so.6(strlen+0x10) [0x3d35479a10]
    21231:20110105:143022.505 11: /lib64/libc.so.6(_IO_vfprintf+0x4479) [0x3d35446b69]
    21231:20110105:143022.505 10: /lib64/libc.so.6(vsnprintf+0x9a) [0x3d3546988a]
    21231:20110105:143022.505 9: zabbix_server(zbx_vsnprintf+0x16) [0x443bf6]
    21231:20110105:143022.505 8: zabbix_server(__zbx_zbx_snprintf_alloc+0x112) [0x443d42]
    21231:20110105:143022.505 7: zabbix_server [0x42040d]
    21231:20110105:143022.505 6: zabbix_server(node_history+0x409) [0x420cb9]
    21231:20110105:143022.505 5: zabbix_server(process_trapper_child+0x2b4) [0x41ea64]
    21231:20110105:143022.506 4: zabbix_server(child_trapper_main+0xa2) [0x41f272]
    21231:20110105:143022.506 3: zabbix_server(MAIN_ZABBIX_ENTRY+0x5a8) [0x4106f8]
    21231:20110105:143022.506 2: zabbix_server(daemon_start+0x1fe) [0x43b27e]
    21231:20110105:143022.506 1: /lib64/libc.so.6(__libc_start_main+0xf4) [0x3d3541d994]
    21231:20110105:143022.506 0: zabbix_server [0x40cd29]

    It seems that the trapper is going mad after receiving the auditlog_details from the node 20.
    This error is reproducable when I start node 10 again.

    This is not a productive environment yet but it should be in the next days.

    Can anyone help in this case or should I open a bug@zabbix?

    Thanks
    Peter
  • ApolloDS
    Junior Member
    • Dec 2010
    • 11

    #2
    Addition

    Until now I was playing around with this new installation and I tried the following:

    - Restores DB Backup from initial clean 1.8.3 config.
    - Set both nodes to "0" (Without DM)
    - did an upgrade to 1.8.4 on both nodes.
    - testet everything on both nodes and everything is working fine. I have a clean log of zabbix_server.log without any problem messages. on both nodes I'm monitoring the node itself and on one node I'm monitoring two additional hosts.

    So it seems that this is running well.
    Now I tried again to use this DM thing:
    - Master Node is NodeID=10, Slave Node is NodeID=20
    - Did on both Nodes a "zabbix_server -n <id> -c /etc/zabbix/zabbix_server.conf" and it was running without error.
    - Restarted zabbix servers on each node.

    BAM! Again the same error like in my first post.

    It seems to be a bug in the 1.8.4 release so that DM is not working.

    Comment

    • jonh
      Junior Member
      • Aug 2010
      • 8

      #3
      Be sure to file a bug in http://support.zabbix.com if you haven't already!

      Comment

      • ApolloDS
        Junior Member
        • Dec 2010
        • 11

        #4
        The right link would be https://support.zabbix.com under http:// there's no answer.
        I opened BUG here: https://support.zabbix.com/browse/ZBX-3384
        Last edited by ApolloDS; 06-01-2011, 10:26.

        Comment

        • Bill Wang
          Member
          • Jul 2010
          • 66

          #5
          Originally posted by ApolloDS
          Until now I was playing around with this new installation and I tried the following:

          - Restores DB Backup from initial clean 1.8.3 config.
          - Set both nodes to "0" (Without DM)
          - did an upgrade to 1.8.4 on both nodes.
          - testet everything on both nodes and everything is working fine. I have a clean log of zabbix_server.log without any problem messages. on both nodes I'm monitoring the node itself and on one node I'm monitoring two additional hosts.

          So it seems that this is running well.
          Now I tried again to use this DM thing:
          - Master Node is NodeID=10, Slave Node is NodeID=20
          - Did on both Nodes a "zabbix_server -n <id> -c /etc/zabbix/zabbix_server.conf" and it was running without error.
          - Restarted zabbix servers on each node.

          BAM! Again the same error like in my first post.

          It seems to be a bug in the 1.8.4 release so that DM is not working.
          How long did "zabbix_server -n <id> -c /etc/zabbix/zabbix_server.conf" take?
          this command is running for a couple hours on my server, no percentage showed to me, don't know what to do now.

          Comment

          • ApolloDS
            Junior Member
            • Dec 2010
            • 11

            #6
            Hi Bill,
            I switched to 2.0.0rc4 and using at the moment DM with proxies instead of Zabbix-Servers and it is working well.
            When 2.0 is released I'm thinking again to try the DM with Zabbix Servers.

            Peter

            Comment

            Working...