Ad Widget

Collapse

Zabbix v1.4.4 died suddenly

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • nitestar
    Junior Member
    • Feb 2008
    • 3

    #1

    Zabbix v1.4.4 died suddenly

    I used zabbix v1.4.4 for several months and felt stable always, however, it died suddenly few days ago and then become very unstable. Below I showed the last 10+ rows server log in debug level 4 during start.

    14337:20080225:093514 In delete_history(history,17152,31268674383,699408)
    14337:20080225:093514 Query [select min(clock) from history where itemid=17152]
    14337:20080225:093514 In delete_history(history_uint,17152,31268674383,6994 08)
    14337:20080225:093514 Query [select min(clock) from history_uint where itemid=17152]
    14337:20080225:093514 In delete_history(history_str,17152,31268674383,69940 8)
    14337:20080225:093514 Query [select min(clock) from history_str where itemid=17152]
    14307:20080225:093514 One child process died. Exiting ...
    14324:20080225:093514 Got signal. Exiting ...
    14325:20080225:093514 Got signal. Exiting ...
    14327:20080225:093514 Got signal. Exiting ...
    14328:20080225:093514 Got signal. Exiting ...
    14329:20080225:093514 Got signal. Exiting ...
    14330:20080225:093514 Got signal. Exiting ...
    14331:20080225:093514 Got signal. Exiting ...
    14332:20080225:093514 Got signal. Exiting ...
    14333:20080225:093514 Got signal. Exiting ...
    14335:20080225:093514 Got signal. Exiting ...
    14339:20080225:093514 Got signal. Exiting ...
    14340:20080225:093514 Got signal. Exiting ...
    14341:20080225:093514 Got signal. Exiting ...
    14342:20080225:093514 Got signal. Exiting ...
    14343:20080225:093514 Got signal. Exiting ...
    14344:20080225:093514 Got signal. Exiting ...
    14347:20080225:093514 Got signal. Exiting ...
    14337:20080225:093514 Got signal. Exiting ...
    14307:20080225:093516 ZABBIX Server stopped

    In this log, I can't figure out what's the child process that leads zabbix die for troubleshooting.

    I hope that I can fix this problem because I feel zabbix is a good monitoring tools. If need, I can provide the whole server log as well as the audit log for investigation.
  • nitestar
    Junior Member
    • Feb 2008
    • 3

    #2
    What's the forth child process?

    Originally posted by nitestar
    I used zabbix v1.4.4 for several months and felt stable always, however, it died suddenly few days ago and then become very unstable. Below I showed the last 10+ rows server log in debug level 4 during start.

    14337:20080225:093514 In delete_history(history,17152,31268674383,699408)
    14337:20080225:093514 Query [select min(clock) from history where itemid=17152]
    14337:20080225:093514 In delete_history(history_uint,17152,31268674383,6994 08)
    14337:20080225:093514 Query [select min(clock) from history_uint where itemid=17152]
    14337:20080225:093514 In delete_history(history_str,17152,31268674383,69940 8)
    14337:20080225:093514 Query [select min(clock) from history_str where itemid=17152]
    14307:20080225:093514 One child process died. Exiting ...
    14324:20080225:093514 Got signal. Exiting ...
    14325:20080225:093514 Got signal. Exiting ...
    14327:20080225:093514 Got signal. Exiting ...
    14328:20080225:093514 Got signal. Exiting ...
    14329:20080225:093514 Got signal. Exiting ...
    14330:20080225:093514 Got signal. Exiting ...
    14331:20080225:093514 Got signal. Exiting ...
    14332:20080225:093514 Got signal. Exiting ...
    14333:20080225:093514 Got signal. Exiting ...
    14335:20080225:093514 Got signal. Exiting ...
    14339:20080225:093514 Got signal. Exiting ...
    14340:20080225:093514 Got signal. Exiting ...
    14341:20080225:093514 Got signal. Exiting ...
    14342:20080225:093514 Got signal. Exiting ...
    14343:20080225:093514 Got signal. Exiting ...
    14344:20080225:093514 Got signal. Exiting ...
    14347:20080225:093514 Got signal. Exiting ...
    14337:20080225:093514 Got signal. Exiting ...
    14307:20080225:093516 ZABBIX Server stopped

    In this log, I can't figure out what's the child process that leads zabbix die for troubleshooting.

    I hope that I can fix this problem because I feel zabbix is a good monitoring tools. If need, I can provide the whole server log as well as the audit log for investigation.

    I used truss and found that the server died because the forth child process died, what is it? Thanks for your input.

    lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    schedctl() = 0xFEFBA000
    fork1() = 21584
    lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    schedctl() = 0xFEFBA000
    fork1() = 21585
    lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    schedctl() = 0xFEFBA000
    fork1() = 21586
    lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    sigaction(SIGCLD, 0xFFBFF328, 0x00000000) = 0
    Received signal #18, SIGCLD [caught]
    siginfo: SIGCLD CLD_KILLED pid=21565 status=0x000B
    schedctl() = 0xFEFBA000
    setcontext(0xFFBFF010)
    getcontext(0xFFBFF0A0)
    lwp_sigmask(SIG_SETMASK, 0x00020000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
    kill(21562, SIGTERM) = 0
    kill(21563, SIGTERM) = 0
    kill(21564, SIGTERM) = 0
    kill(21565, SIGTERM) = 0
    kill(21566, SIGTERM) = 0
    kill(21567, SIGTERM) = 0
    kill(21568, SIGTERM) = 0
    kill(21569, SIGTERM) = 0
    kill(21570, SIGTERM) = 0
    kill(21571, SIGTERM) = 0
    kill(21572, SIGTERM) = 0
    kill(21576, SIGTERM) = 0
    kill(21577, SIGTERM) = 0
    kill(21578, SIGTERM) = 0
    kill(21579, SIGTERM) = 0
    kill(21580, SIGTERM) = 0
    kill(21581, SIGTERM) = 0
    kill(21582, SIGTERM) = 0
    kill(21583, SIGTERM) = 0
    kill(21584, SIGTERM) = 0
    kill(21585, SIGTERM) = 0
    kill(21586, SIGTERM) = 0
    getpid() = 21558 [1]

    Comment

    • nitestar
      Junior Member
      • Feb 2008
      • 3

      #3
      This is the log during the forth child process died.

      pollsys(0xFFBFA688, 1, 0xFFBFA620, 0x00000000) = 0
      write(6, "A5\0\0\003 s e l e c t ".., 169) = 169
      read(6, "01\0\00101 *\0\00203 d e".., 16384) = 78
      pollsys(0xFFBFD6E8, 1, 0xFFBFD680, 0x00000000) = 0
      write(6, " X\0\0\003 s e l e c t ".., 92) = 92
      read(6, "01\0\00105 <\0\00203 d e".., 16384) = 363
      time() = 1204167623
      open("/usr/share/lib/zoneinfo/Hongkong", O_RDONLY) = 7
      fstat64(7, 0xFFBFCF38) = 0
      read(7, " T Z i f\0\0\0\0\0\0\0\0".., 426) = 426
      close(7) = 0
      time() = 1204167623
      pollsys(0xFFBFD5C0, 1, 0xFFBFD558, 0x00000000) = 0
      write(6, " W\0\0\003 s e l e c t ".., 91) = 91
      read(6, "01\0\00101 .\0\00203 d e".., 16384) = 73
      Incurred fault #6, FLTBOUNDS %pc = 0xFEE2EA58
      siginfo: SIGSEGV SEGV_MAPERR addr=0x00000017
      Received signal #11, SIGSEGV [default]
      siginfo: SIGSEGV SEGV_MAPERR addr=0x00000017


      So, it is a bug? My OS is Solaris 10 sparc.

      Comment

      Working...