If you've got an externalscript which ever hangs it appears to create quite a bit of badness.
First of all, zabbix is multithreaded so the listen_sock is fd 3 in the zabbix server and that leaks into the called external script because zabbix only does a popen(). It would be better to have zabbix manually do a fork()+exec(), or fork()+system() and have it close(listen_sock) at the very least. The behavior on trying to restart the zabbix server if you've got a hung external script is that the external script will still have the listen_sock open and will prevent the zabbix server from restarting.
The zabbix_server should also probably try to SIGINT+SIGKILL any external scripts that it times out on and gives up (although if zabbix does a fork() + system() that will invoke a subshell and killing a subshell doesn't necessarily kill the hung subprocess invoked by the shell).
For writers of external scripts, they should probably try to close file descriptors >=3 (up to 255 or 1024 or whatever) and should aggressively try to timeout the entire process. Since alarm timeouts in the same process can be overridden by library calls which use the same facility to do its own timeouts, it might make sense to wrap an external script with a parent process that only fork()s off a worker and then manages timing out the child worker.
I also noticed that with a hung external check, I started losing datapoints on other items (e.g. 3 different agent.ping items stopped being polled although the rest of the items on the zabbix agent were still getting polled). This is with 1.4.6.
First of all, zabbix is multithreaded so the listen_sock is fd 3 in the zabbix server and that leaks into the called external script because zabbix only does a popen(). It would be better to have zabbix manually do a fork()+exec(), or fork()+system() and have it close(listen_sock) at the very least. The behavior on trying to restart the zabbix server if you've got a hung external script is that the external script will still have the listen_sock open and will prevent the zabbix server from restarting.
The zabbix_server should also probably try to SIGINT+SIGKILL any external scripts that it times out on and gives up (although if zabbix does a fork() + system() that will invoke a subshell and killing a subshell doesn't necessarily kill the hung subprocess invoked by the shell).
For writers of external scripts, they should probably try to close file descriptors >=3 (up to 255 or 1024 or whatever) and should aggressively try to timeout the entire process. Since alarm timeouts in the same process can be overridden by library calls which use the same facility to do its own timeouts, it might make sense to wrap an external script with a parent process that only fork()s off a worker and then manages timing out the child worker.
I also noticed that with a hung external check, I started losing datapoints on other items (e.g. 3 different agent.ping items stopped being polled although the rest of the items on the zabbix agent were still getting polled). This is with 1.4.6.
Comment