Ad Widget

Collapse

Agent's kernel panic causes whole zabbix monitoring to become unreliable

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • just2blue4u
    Senior Member
    • Apr 2006
    • 347

    #1

    Agent's kernel panic causes whole zabbix monitoring to become unreliable

    I had big problems with zabbix (Server V1.1, client V1.1beta9) this morning:
    1 monitored server had a kernel panic (not syncing: mm/slab.c:1982: spin_lock(mm/slab.c:f7c044c4) already locked by mm/slab.c/2854).

    That caused the machine to freeze, but it was still answering ping (and ONLY ping) !
    Zabbix server didn't realize that this machine's agent wasn't reachable. The result was that zabbix logs looked like
    027424:20061204:080001 Timeout while receiving data from [zds-file-2-pbs.bfk]
    027424:20061204:080001 Getting value of [io[disk_io]] from host [zds-file-2-pbs.bfk] failed
    027424:20061204:080001 The value is not stored in database.
    027424:20061204:080006 Timeout while receiving data from [zds-file-2-pbs.bfk]
    027424:20061204:080006 Getting value of [net.tcp.service.perf[smtp,,25 ]] from host [zds-file-2-pbs.bfk] failed
    027424:20061204:080006 The value is not stored in database.
    027424:20061204:080011 Timeout while receiving data from [zds-file-2-pbs.bfk]
    027424:20061204:080011 Getting value of [system[procrunning]] from host [zds-file-2-pbs.bfk] failed
    027424:20061204:080011 The value is not stored in database.
    ...
    027424:20061204:080031 Timeout while receiving data from [zds-file-2-pbs.bfk]
    027424:20061204:080031 Getting value of [proc.num[]] from host [zds-file-2-pbs.bfk] failed
    027424:20061204:080031 The value is not stored in database.
    027424:20061204:080036 Timeout while receiving data from [zds-file-2-pbs.bfk]
    027424:20061204:080036 Getting value of [vm.memory.size[buffers]] from host [zds-file-2-pbs.bfk] failed
    027424:20061204:080036 The value is not stored in database.
    Zabbix' graphs didn't show any data from any host when time period was below 12h. from 24h period on, data was shown. Here is some catched data in Plain text format, taken from a host which was normally reachable when the problem was:
    2006-12-04 11:09:32 1165226972 587090
    2006-12-04 11:05:54 1165226754 431928
    2006-12-04 11:02:12 1165226532 782793
    2006-12-04 10:58:29 1165226309 804615
    2006-12-04 10:54:52 1165226092 687077
    2006-12-04 10:51:10 1165225870 803026
    2006-12-04 10:47:31 1165225651 482231
    2006-12-04 10:43:53 1165225433 673026
    2006-12-04 10:40:11 1165225211 553468
    2006-12-04 10:36:31 1165224991 195923
    ...
    2006-12-04 09:31:14 1165221074 407798
    2006-12-04 09:20:33 1165220433 305847
    2006-12-04 09:09:40 1165219780 321637
    2006-12-04 08:58:54 1165219134 372145
    2006-12-04 08:48:02 1165218482 89440
    2006-12-04 08:37:14 1165217834 149001
    2006-12-04 08:26:24 1165217184 113331
    2006-12-04 08:15:34 1165216534 123223
    2006-12-04 08:04:41 1165215881 170538
    normally, the item should have been updated every 60s.

    After disabling the monitoring of that host manually, everything was fine again and all hosts were monitored correctly again.

    Of course zabbix didn't send me an alert-email, because it thought the client was alive. That's very bad.
    I also don't like it, that 1 host's kernel panic causes the whole zabbix monitoring to become unreliable.
    If this isn't already improved in the latest versions, perhaps it could be done in future?


    Thanks,
    J2B4U
    Last edited by just2blue4u; 04-12-2006, 13:41.
    Big ZABBIX is watching you!
    (... and my 48 hosts, 4513 items, 1280 triggers via zabbix v1.6 on CentOS 5.0)
  • Calimero
    Senior Member
    • Nov 2006
    • 481

    #2
    Did you create the required triggers ?
    If icmp ping is your only trigger you won't get an alert if ping is still working (I'm stating the obvious )

    Do you have a trigger on special item 'status' to see if you're still getting data from your agents ?

    Comment

    Working...