I had big problems with zabbix (Server V1.1, client V1.1beta9) this morning:
1 monitored server had a kernel panic (not syncing: mm/slab.c:1982: spin_lock(mm/slab.c:f7c044c4) already locked by mm/slab.c/2854).
That caused the machine to freeze, but it was still answering ping (and ONLY ping) !
Zabbix server didn't realize that this machine's agent wasn't reachable. The result was that zabbix logs looked like
Zabbix' graphs didn't show any data from any host when time period was below 12h. from 24h period on, data was shown. Here is some catched data in Plain text format, taken from a host which was normally reachable when the problem was:
normally, the item should have been updated every 60s.
After disabling the monitoring of that host manually, everything was fine again and all hosts were monitored correctly again.
Of course zabbix didn't send me an alert-email, because it thought the client was alive. That's very bad.
I also don't like it, that 1 host's kernel panic causes the whole zabbix monitoring to become unreliable.
If this isn't already improved in the latest versions, perhaps it could be done in future?
Thanks,
J2B4U
1 monitored server had a kernel panic (not syncing: mm/slab.c:1982: spin_lock(mm/slab.c:f7c044c4) already locked by mm/slab.c/2854).
That caused the machine to freeze, but it was still answering ping (and ONLY ping) !
Zabbix server didn't realize that this machine's agent wasn't reachable. The result was that zabbix logs looked like
027424:20061204:080001 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080001 Getting value of [io[disk_io]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080001 The value is not stored in database.
027424:20061204:080006 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080006 Getting value of [net.tcp.service.perf[smtp,,25 ]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080006 The value is not stored in database.
027424:20061204:080011 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080011 Getting value of [system[procrunning]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080011 The value is not stored in database.
...
027424:20061204:080031 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080031 Getting value of [proc.num[]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080031 The value is not stored in database.
027424:20061204:080036 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080036 Getting value of [vm.memory.size[buffers]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080036 The value is not stored in database.
027424:20061204:080001 Getting value of [io[disk_io]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080001 The value is not stored in database.
027424:20061204:080006 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080006 Getting value of [net.tcp.service.perf[smtp,,25 ]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080006 The value is not stored in database.
027424:20061204:080011 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080011 Getting value of [system[procrunning]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080011 The value is not stored in database.
...
027424:20061204:080031 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080031 Getting value of [proc.num[]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080031 The value is not stored in database.
027424:20061204:080036 Timeout while receiving data from [zds-file-2-pbs.bfk]
027424:20061204:080036 Getting value of [vm.memory.size[buffers]] from host [zds-file-2-pbs.bfk] failed
027424:20061204:080036 The value is not stored in database.
2006-12-04 11:09:32 1165226972 587090
2006-12-04 11:05:54 1165226754 431928
2006-12-04 11:02:12 1165226532 782793
2006-12-04 10:58:29 1165226309 804615
2006-12-04 10:54:52 1165226092 687077
2006-12-04 10:51:10 1165225870 803026
2006-12-04 10:47:31 1165225651 482231
2006-12-04 10:43:53 1165225433 673026
2006-12-04 10:40:11 1165225211 553468
2006-12-04 10:36:31 1165224991 195923
...
2006-12-04 09:31:14 1165221074 407798
2006-12-04 09:20:33 1165220433 305847
2006-12-04 09:09:40 1165219780 321637
2006-12-04 08:58:54 1165219134 372145
2006-12-04 08:48:02 1165218482 89440
2006-12-04 08:37:14 1165217834 149001
2006-12-04 08:26:24 1165217184 113331
2006-12-04 08:15:34 1165216534 123223
2006-12-04 08:04:41 1165215881 170538
2006-12-04 11:05:54 1165226754 431928
2006-12-04 11:02:12 1165226532 782793
2006-12-04 10:58:29 1165226309 804615
2006-12-04 10:54:52 1165226092 687077
2006-12-04 10:51:10 1165225870 803026
2006-12-04 10:47:31 1165225651 482231
2006-12-04 10:43:53 1165225433 673026
2006-12-04 10:40:11 1165225211 553468
2006-12-04 10:36:31 1165224991 195923
...
2006-12-04 09:31:14 1165221074 407798
2006-12-04 09:20:33 1165220433 305847
2006-12-04 09:09:40 1165219780 321637
2006-12-04 08:58:54 1165219134 372145
2006-12-04 08:48:02 1165218482 89440
2006-12-04 08:37:14 1165217834 149001
2006-12-04 08:26:24 1165217184 113331
2006-12-04 08:15:34 1165216534 123223
2006-12-04 08:04:41 1165215881 170538
After disabling the monitoring of that host manually, everything was fine again and all hosts were monitored correctly again.
Of course zabbix didn't send me an alert-email, because it thought the client was alive. That's very bad.
I also don't like it, that 1 host's kernel panic causes the whole zabbix monitoring to become unreliable.
If this isn't already improved in the latest versions, perhaps it could be done in future?
Thanks,
J2B4U


)
Comment