Agent's kernel panic causes whole zabbix monitoring to become unreliable

just2blue4u

Senior Member

Joined: Apr 2006

Posts: 347
#1

Agent's kernel panic causes whole zabbix monitoring to become unreliable

04-12-2006, 13:14

I had big problems with zabbix (Server V1.1, client V1.1beta9) this morning:
1 monitored server had a kernel panic (not syncing: mm/slab.c:1982: spin_lock(mm/slab.c:f7c044c4) already locked by mm/slab.c/2854).

That caused the machine to freeze, but it was still answering ping (and ONLY ping) !
Zabbix server didn't realize that this machine's agent wasn't reachable. The result was that zabbix logs looked like

027424:20061204:080001 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080001 Getting value of [io[disk_io]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080001 The value is not stored in database.
027424:20061204:080006 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080006 Getting value of [net.tcp.service.perf[smtp,,25 ]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080006 The value is not stored in database.
027424:20061204:080011 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080011 Getting value of [system[procrunning]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080011 The value is not stored in database.
...
027424:20061204:080031 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080031 Getting value of [proc.num[]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080031 The value is not stored in database.
027424:20061204:080036 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080036 Getting value of [vm.memory.size[buffers]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080036 The value is not stored in database.

Zabbix' graphs didn't show any data from any host when time period was below 12h. from 24h period on, data was shown. Here is some catched data in Plain text format, taken from a host which was normally reachable when the problem was:

2006-12-04 11:09:32 1165226972 587090
2006-12-04 11:05:54 1165226754 431928
2006-12-04 11:02:12 1165226532 782793
2006-12-04 10:58:29 1165226309 804615
2006-12-04 10:54:52 1165226092 687077
2006-12-04 10:51:10 1165225870 803026
2006-12-04 10:47:31 1165225651 482231
2006-12-04 10:43:53 1165225433 673026
2006-12-04 10:40:11 1165225211 553468
2006-12-04 10:36:31 1165224991 195923
...
2006-12-04 09:31:14 1165221074 407798
2006-12-04 09:20:33 1165220433 305847
2006-12-04 09:09:40 1165219780 321637
2006-12-04 08:58:54 1165219134 372145
2006-12-04 08:48:02 1165218482 89440
2006-12-04 08:37:14 1165217834 149001
2006-12-04 08:26:24 1165217184 113331
2006-12-04 08:15:34 1165216534 123223
2006-12-04 08:04:41 1165215881 170538

normally, the item should have been updated every 60s.

After disabling the monitoring of that host manually, everything was fine again and all hosts were monitored correctly again.

Of course zabbix didn't send me an alert-email, because it thought the client was alive. That's very bad.
I also don't like it, that 1 host's kernel panic causes the whole zabbix monitoring to become unreliable.
If this isn't already improved in the latest versions, perhaps it could be done in future?

Thanks,
J2B4U

Last edited by just2blue4u; 04-12-2006, 13:41.

Big ZABBIX is watching you!
(... and my 48 hosts, 4513 items, 1280 triggers via zabbix v1.6 on CentOS 5.0)
Tags: None
Calimero

Senior Member

Joined: Nov 2006

Posts: 481
#2

04-12-2006, 14:09

Did you create the required triggers ?
If icmp ping is your only trigger you won't get an alert if ping is still working (I'm stating the obvious )

Do you have a trigger on special item 'status' to see if you're still getting data from your agents ?
Comment

Ad Widget

Agent's kernel panic causes whole zabbix monitoring to become unreliable

Agent's kernel panic causes whole zabbix monitoring to become unreliable

Comment