Ad Widget

Collapse

zabbix_server crashes on host addition

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • AdamLundrigan
    Junior Member
    • Jan 2008
    • 15

    #1

    zabbix_server crashes on host addition

    I start from a clean installation of Zabbix 1.4.4, which humms along smoothly without any issues while there are no hosts. I also have compiled the Agent on a machine I wish to monitor. I made no configuration changes on either the client or server, except for the MySQL login details. The only thing I did through the front-end is add a single host, one with the agent running on it, and associated it with the Template_Solaris template.

    When I start the agent and add the host via the Zabbix front-end (and vice versa) the server immediately panics:

    Code:
     20146:20080130:103328 End update_triggers [19821]
     20149:20080130:103328 End update_functions()
     20146:20080130:103328 Query [commit;]
     20149:20080130:103328 In update_triggers [itemid:19824]
     20149:20080130:103328 Query [select distinct t.triggerid,t.expression,t.description,t.url,t.comments,t.status,t.value,t.priority from triggers t,functions f,items i where i.status<>3 and i.itemid=f.itemid and t.status=0 and f.triggerid=t.triggerid and f.itemid=19824]
     20147:20080130:103328 End update_functions()
     20147:20080130:103328 In update_triggers [itemid:19767]
     20147:20080130:103328 Query [select distinct t.triggerid,t.expression,t.description,t.url,t.comments,t.status,t.value,t.priority from triggers t,functions f,items i where i.status<>3 and i.itemid=f.itemid and t.status=0 and f.triggerid=t.triggerid and f.itemid=19767]
     20145:20080130:103328 In evaluate_expression({12088}=0)
     20147:20080130:103328 In evaluate_expression({12090}=0)
     20145:20080130:103328 In substitute_simple_macros()
     20145:20080130:103328 In substitute_simple_macros (data:{12088}=0)
     20147:20080130:103328 In substitute_simple_macros()
     20147:20080130:103328 In substitute_simple_macros (data:{12090}=0)
     20144:20080130:103328 One child process died. Exiting ...
     20146:20080130:103328 Got signal. Exiting ...
       ....
     20149:20080130:103328 Got signal. Exiting ...
     20144:20080130:103331 ZABBIX Server stopped
    The data for the new host is successfully fetched from the client, and shows up in the web frontend (under "latest data" page)...so the fault isn't with the client connections themselves, it is something that happens later (?)

    Running pstack on the main server process shows it receives signal 18 (SIGCLD), which makes sense since "One child process died":

    Code:
    20144:  zabbix_server
     feea079c unlink   (13ee08, 0, 0, 0, 0, 0) + 8
     00052e84 daemon_stop (0, f, ffbfee14, 7d8, 1, 1e) + c
     00036a68 zbx_on_exit (a, 21, 0, 0, 0, 0) + 188
     000529f4 parent_signal_handler (12, 0, ffbfefe8, 0, 0, 0) + 3c
     feba56c8 __sighndlr (12, 0, ffbfefe8, 529b8, 0, 0) + c
     feb9f320 call_user_handler (12, 0, ffbfefe8, 0, 0, 0) + 234
     feb9f4d0 sigacthandler (12, 0, ffbfefe8, 0, 0, 0) + 64
     --- called from signal handler with signal 18 (SIGCLD) ---
     fee9ccd0 _libc_nanosleep (3c, 0, 0, 0, 0, 0) + 8
     00049d80 main_watchdog_loop (a, 1f, 0, fffffff8, 0, 145ea5) + 28
     0003624c MAIN_ZABBIX_ENTRY (a6710, ffbffc74, 0, 0, 8d, 21) + 4ec
     00052e38 daemon_start (0, ffbffd74, c8ff0, c8ff8, 0, 2) + 3d8
     00035d28 main     (1, ffbffd74, ffbffd7c, c8c00, 0, 0) + 248
     00024e40 _start   (0, 0, 0, 0, 0, 0) + 108
    I have a simple system of shell scripts which runs pstack on each zabbix_server process, and prints the result to a text file. pstack is not catching any of the main process' children dying....which is perplexing, as it should find something, since they are exiting on a signal. It is a rather coarse "monitoring system", with pstack output recorded - on average - 3 times per second, so that probably explains it. I wish I had DTrace on this machine

    Any ideas what might be going on here? The log for the server run can be found here

    Thanks in advance!
    --
    Adam Lundrigan
    Computer Systems Programmer
    Biological & Physical Oceanography Section
    Science, Oceans & Environment Branch
    Department of Fisheries and Oceans Canada
    Northwest Atlantic Fisheries Centre
    St. John's, Newfoundland & Labrador
    CANADA A1C 5X1

    Tel: (709) 772-8136
    Fax: (709) 772-8138
    Cell: (709) 277-4575
    Office: G10-117J
    Email: [email protected]
    Last edited by AdamLundrigan; 30-01-2008, 16:20. Reason: Updated with latest information about crashes
  • AdamLundrigan
    Junior Member
    • Jan 2008
    • 15

    #2
    truss'd

    OK. I revised my strategy slightly, and turned to truss for better stack tracing.

    I disabled the lone host in my test system, restarted the server, waited to make sure it would keep running (it did). Then I attached an instance of truss to each zabbix_server process, and dumped the output to files. When I enabled the lone monitored host again, the server crapped out - as per my expectations.

    Here is the truncated zabbix_server.log:

    Code:
     28701:20080130:110554 End process_httptests()
     28701:20080130:110554 Spent 0 seconds while processing HTTP tests
     28701:20080130:110554 Query [select count(*),min(nextcheck) from httptest t where t.status=0 and mod(t.httptestid,5)=3 and  t.httptestid>=100000000000000*0 and t.httptestid<=(100000000000000*0+99999999999999) ]
     28701:20080130:110554 No httptests to process in get_minnextcheck.
     28701:20080130:110554 Nextcheck:-1 Time:1201703754
     28701:20080130:110554 Sleeping for 5 seconds
     28686:20080130:110554 End update_functions()
     28682:20080130:110554 In evaluate_expression({12088}=0)
     28686:20080130:110554 In update_triggers [itemid:19824]
     28682:20080130:110554 In substitute_simple_macros()
     28686:20080130:110554 Query [select distinct t.triggerid,t.expression,t.description,t.url,t.comments,t.status,t.value,t.priority from triggers t,functions f,items i where i.status<>3 and i.itemid=f.itemid and t.status=0 and f.triggerid=t.triggerid and f.itemid=19824]
     28682:20080130:110554 In substitute_simple_macros (data:{12088}=0)
     28681:20080130:110554 One child process died. Exiting ...
     28683:20080130:110554 Got signal. Exiting ...
        ....
     28694:20080130:110554 Got signal. Exiting ...
     28681:20080130:110557 ZABBIX Server stopped
    A snippet of the interesting bit of truss output from the parent process:

    Code:
        Received signal #18, SIGCLD, in nanosleep() [caught]
          siginfo: SIGCLD CLD_KILLED pid=28682 status=0x000B
    nanosleep(0xFFBFF318, 0xFFBFF310)               Err#4 EINTR
    sigprocmask(SIG_SETMASK, 0xFFBFEF14, 0x00000000) = 0
    open("/tmp/zabbix_server.log", O_RDWR|O_APPEND|O_CREAT, 0666) = 7
    time()                                          = 1201703754
    getpid()                                        = 28681 [1]
    It says pid# 28682 triggered the signal by being killed. Here is the snippet of truss output from pid# 28682:

    Code:
    close(7)                                        = 0
    stat("/tmp/zabbix_server.log", 0xFFBED4A0)      = 0
    open("/tmp/zabbix_server.log", O_RDWR|O_APPEND|O_CREAT, 0666) = 7
    time()                                          = 1201703754
    getpid()                                        = 28682 [28681]
    fstat64(7, 0xFFBEBEA0)                          = 0
    fstat64(7, 0xFFBEBD48)                          = 0
    ioctl(7, TCGETA, 0xFFBEBE2C)                    Err#25 ENOTTY
        Incurred fault #6, FLTBOUNDS  %pc = 0xFEE34694
          siginfo: SIGSEGV SEGV_MAPERR addr=0x00000000
        Received signal #11, SIGSEGV [default]
          siginfo: SIGSEGV SEGV_MAPERR addr=0x00000000
    According to this article on Sun.com, SEGV_MAPERR generally denotes a stack overflow.

    Oh dear...

    Comment

    • Alexei
      Founder, CEO
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Sep 2004
      • 5654

      #3
      I looks very much like a problem we fixed recently in pre 1.4.5 code. It was related to incorrect processing of zero length strings for trigger functions str(), regexp() and iregexp().

      May I ask you to try the latest code? Get it by executing:

      svn checkout svn:/svn.zabbix.com/branches/1.4 1.4

      Thank you.
      Alexei Vladishev
      Creator of Zabbix, Product manager
      New York | Tokyo | Riga
      My Twitter

      Comment

      • AdamLundrigan
        Junior Member
        • Jan 2008
        • 15

        #4
        No joy with the svn version. Same truss output as before:

        Code:
        open("/tmp/zabbix_server.log", O_RDWR|O_APPEND|O_CREAT, 0666) = 6
        time()                                          = 1201803212
        getpid()                                        = 11466 [11465]
        fstat64(6, 0xFFBEBE90)                          = 0
        fstat64(6, 0xFFBEBD38)                          = 0
        ioctl(6, TCGETA, 0xFFBEBE1C)                    Err#25 ENOTTY
            Incurred fault #6, FLTBOUNDS  %pc = 0xFEE34694
              siginfo: SIGSEGV SEGV_MAPERR addr=0x00000000
            Received signal #11, SIGSEGV [default]
              siginfo: SIGSEGV SEGV_MAPERR addr=0x00000000
        I wiped out the Zabbix 1.4.4 installation on both the client and server, built the version checked out from SVN, and used that to install the client and server.

        I have a spare machine kicking around of modest capabilities (AMD64 3500+) and a Ubuntu Server 7.10 CD. I will combine the two, install zabbix, and report back how that goes.

        Here are the logs (zabbix_server and truss): ftp://ocgftp.nfl.dfo-mpo.gc.ca/outgo...abbix_logs.tgz
        Last edited by AdamLundrigan; 31-01-2008, 20:31. Reason: Added link to logs

        Comment

        • twydyn
          Junior Member
          • Feb 2008
          • 4

          #5
          I am having the exact same problem mentioned above. I have tried the latest version as well with no luck. What are our options at this point?

          Comment

          • marcelein
            Junior Member
            • Apr 2008
            • 21

            #6
            same errors

            Code:
             17024:20080614:000033 Requested [vfs.fs.size[/var,free]]
             17024:20080614:000033 Sending back [130952196]
             17022:20080614:000033 Processing request.
             17022:20080614:000033 In check_security()
             17022:20080614:000033 Requested [net.if.out[eth1,bytes]]
             17022:20080614:000033 Sending back [16636352054]
             17023:20080614:000033 Processing request.
             17023:20080614:000033 In check_security()
             17023:20080614:000033 Requested [system.cpu.load[,avg5]]
             17023:20080614:000033 Sending back [1.090000]
             17021:20080614:000037 Got signal. Exiting ...
             17023:20080614:000037 Got signal. Exiting ...
             17024:20080614:000037 Got signal. Exiting ...
             17025:20080614:000037 Got signal. Exiting ...
             17022:20080614:000037 Got signal. Exiting ...
             17020:20080614:000037 Got signal. Exiting ...
             17020:20080614:000037 zbx_on_exit() called.
             17020:20080614:000039 ZABBIX Agent stopped
            thats what i got every day to the same time 00:00:39, its a vserver
            the other servers with a zabbix agentd running doesnt get this error and are running stable

            im really looking forward to a solution or a new stable version of zabbix?

            Comment

            Working...