Ad Widget

Collapse

If externalscripts hang...

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • lamont
    Member
    • Nov 2007
    • 89

    #1

    If externalscripts hang...

    If you've got an externalscript which ever hangs it appears to create quite a bit of badness.

    First of all, zabbix is multithreaded so the listen_sock is fd 3 in the zabbix server and that leaks into the called external script because zabbix only does a popen(). It would be better to have zabbix manually do a fork()+exec(), or fork()+system() and have it close(listen_sock) at the very least. The behavior on trying to restart the zabbix server if you've got a hung external script is that the external script will still have the listen_sock open and will prevent the zabbix server from restarting.

    The zabbix_server should also probably try to SIGINT+SIGKILL any external scripts that it times out on and gives up (although if zabbix does a fork() + system() that will invoke a subshell and killing a subshell doesn't necessarily kill the hung subprocess invoked by the shell).

    For writers of external scripts, they should probably try to close file descriptors >=3 (up to 255 or 1024 or whatever) and should aggressively try to timeout the entire process. Since alarm timeouts in the same process can be overridden by library calls which use the same facility to do its own timeouts, it might make sense to wrap an external script with a parent process that only fork()s off a worker and then manages timing out the child worker.

    I also noticed that with a hung external check, I started losing datapoints on other items (e.g. 3 different agent.ping items stopped being polled although the rest of the items on the zabbix agent were still getting polled). This is with 1.4.6.
  • Alexei
    Founder, CEO
    Zabbix Certified Trainer
    Zabbix Certified SpecialistZabbix Certified Professional
    • Sep 2004
    • 5654

    #2
    External script must be written so that it will timeout, exit and clear all associated resources without Zabbix intervention. Sure, we can make Zabbix kill timeouted script, but this won't solve the problem. The script may do fork(), system(), whatever, so killing just one process is not a proper solution anyway.
    Alexei Vladishev
    Creator of Zabbix, Product manager
    New York | Tokyo | Riga
    My Twitter

    Comment

    • Alexei
      Founder, CEO
      Zabbix Certified Trainer
      Zabbix Certified SpecialistZabbix Certified Professional
      • Sep 2004
      • 5654

      #3
      By the way, same applies to user parameters. It is Zabbix administrator responsibility to make sure that all the scripts handles timeouts properly, clear resources, temporary files, etc etc.
      Alexei Vladishev
      Creator of Zabbix, Product manager
      New York | Tokyo | Riga
      My Twitter

      Comment

      • lamont
        Member
        • Nov 2007
        • 89

        #4
        Yes, but sometimes you can get bitten.

        I had a simple script using perl LWP which set timeouts on the LWP connection which works 99.9% of the time, but I've found some edge case, which I think is only during SSL negotiation which will hang indefinitely.

        This means that I just need very paranoid explicit timeouts around the external script (like I said, fork() to a child and then have the parent timeout and kill the child if it hangs).

        It would be good if Zabbix would close the LISTEN socket, though, before forking off the process, otherwise you inherit a file descriptor you do not expect. That can be mitigated by simply closing that socket in the external script, but it surprised me to find it was holding the socket open.

        Comment

        • alixen
          Senior Member
          • Apr 2006
          • 474

          #5
          Originally posted by lamont
          Yes, but sometimes you can get bitten.

          I had a simple script using perl LWP which set timeouts on the LWP connection which works 99.9% of the time, but I've found some edge case, which I think is only during SSL negotiation which will hang indefinitely.
          I have got the same problem with wget and SSL and I have found a nice shell wrapper that has solved my problem.

          I found it there : http://www.pixelbeat.org/scripts/timeout

          Hope this helps
          Alixen
          http://www.alixen.fr/zabbix.html

          Comment

          • steev
            Member
            • Aug 2010
            • 38

            #6
            this condition hangs the whole agent.

            I have a curl check that was hanging and I noticed that this stopped the agent from doing ANYTHING.

            I discovered this through another zabbix agent (active) check that sends localtime every thirty seconds. I use this as a 'heartbeat'. If I get nodata(600) from this check then a trigger pages me because the server isn't responding.

            I've also encountered a condition where an init script that I run via an action to restart a process has completely hung up the zabbix agent.

            I can see the point about how the scripts should time themselves out in a reasonable amount of time but if they don't, I really don't think that it should break the whole zabbix agent.

            Comment

            Working...