One of our machines ("A") had a kernel panic and crashed. Another host ("B") had mounted a filesystem on that crashed machine. This made "lsof" freeze on "B".
We have a UserParameter "files.open" in our agents that use lsof:
Normally, this works quite good. But now, when lsof freezes, there are many processes from user zabbix, that have to do with "files.open".
The agent's log says:
Zabbix server logs said:
There was no data fetched from that client for every item as long as lsof didn't work.
Can someone please explain to me,
- why the server got no data from the whole host, and
- why the processes that got the timeout aren't killed?
The zabbix agent's version is 1.1.1, server is 1.1.
Thanks,
J2B4U
We have a UserParameter "files.open" in our agents that use lsof:
Code:
UserParameter=files.open,/usr/sbin/lsof|grep -v lsof|grep -v grep|wc -l|sed s/" "//g
The agent's log says:
Timeout while answering request
027424:20061204:111220 Timeout while receiving data from [backup-pbs.bfk]
027424:20061204:111220 Getting value of [proc.num[]] from host [backup-pbs.bfk] failed
027424:20061204:111220 The value is not stored in database.
027424:20061204:111225 Timeout while receiving data from [backup-pbs.bfk]
027424:20061204:111225 Getting value of [net.if.in[eth0]] from host [backup-pbs.bfk] failed
027424:20061204:111225 The value is not stored in database.
...
027424:20061204:111220 Getting value of [proc.num[]] from host [backup-pbs.bfk] failed
027424:20061204:111220 The value is not stored in database.
027424:20061204:111225 Timeout while receiving data from [backup-pbs.bfk]
027424:20061204:111225 Getting value of [net.if.in[eth0]] from host [backup-pbs.bfk] failed
027424:20061204:111225 The value is not stored in database.
...
Can someone please explain to me,
- why the server got no data from the whole host, and
- why the processes that got the timeout aren't killed?
The zabbix agent's version is 1.1.1, server is 1.1.
Thanks,
J2B4U

