Ad Widget

Collapse

Timeout while executing a shell script.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • niall.porter
    Junior Member
    • Jun 2021
    • 8

    #1

    Timeout while executing a shell script.

    We have a custom template supported by a couple of client UserParameters for monitoring status of filesystems defined in /etc/fstab on Linux clients. The one for checking individual filesystems looks like this:

    Code:
    UserParameter=filesystem.mounted[*],timeout 10s sudo stat -f $1 >/dev/null 2>&1 ; echo $?
    The "stat" command is wrapped in a "timeout" command so that if an NFS server is offline it doesn't hang indefinitely. This only works if I set the duration on the timeout command to 1s for one second, if I set it to a more sensible duration like 10s the item becomes unsupported with the following error in the Zabbix server log:

    item "<hostname>:filesystem.mounted[/]" became not supported: Timeout while executing a shell script.
    The duration value given to "timeout" isn't how long it runs for, it's the maximum time for the child command to be allowed to run before killing it so I can't figure out why setting it to any more than 1s causes Zabbix to think it's timing out. I've tried setting the Timeout value in the client settings to a longer duration than that which we give to the "timeout" command but that didn't help. Any ideas gratefully received...
  • tim.mooney
    Senior Member
    • Dec 2012
    • 1427

    #2
    You don't say what version of Zabbix you're using.

    Some versions of Zabbix have the client timeout for (passive) Zabbix agent checks compiled into the software, so that it cannot be adjusted with a configuration setting. You literally have to modify the source code and adjust the timeout and recompile.

    That's changed with more recent versions of Zabbix, where it is now runtime-controllable, but I forget which Zabbix 5.x series added it.

    This doesn't apply to active checks though.

    Comment

    • niall.porter
      Junior Member
      • Jun 2021
      • 8

      #3
      It's Zabbix server and agent 5.4.3.

      The odd thing is that the command "timeout 1s <other command>" means that if the other command is still running after 1 second then it gets killed, it doesn't keep it running for 1 second or whatever you set that to. If the other command only takes 0.1 second to run then setting the duration to 10 seconds, 1 minute, 42 years etc. won't make it take any longer to run so I can't figure out why Zabbix thinks it's still running and timing out...

      Comment

      • splitek
        Senior Member
        • Dec 2018
        • 101

        #4
        Can you change your logic and first check if filesystem is mounted and if so then run stat command, if not then just exit? I think here can be used some trick like 'ls $1' to check if filesystem is available.
        Last edited by splitek; 15-11-2021, 22:18.

        Comment

        • riBoon
          Junior Member
          • May 2017
          • 25

          #5
          Originally posted by splitek
          Can you change your logic and first check if filesystem is mounted and if so then run stat command, if not then just exit? I think here can be used some trick like 'ls $1' to check if filesystem is available.
          If the mounted filesystem hangs, the ls command also would run too long. So this would not solve his problem.
          @niall.porter: Are you sure you've increased the timeout for the agent and the proxies/server? (for a active agent check)

          Comment

          • niall.porter
            Junior Member
            • Jun 2021
            • 8

            #6
            Yes, so what we're checking for here really is if the device/NFS server hosting the mounted filesystem has gone offline or unreachable. This whole request was prompted by a few events of an NFS server in our datacenter becoming unreachable by our SAP servers in the cloud. When that happens, running a simple "mount | grep <mountpoint>" still returns true because nobody told the system to unmount the filesystem. We use the stat command but wrap it in the timeout command so when stat hangs due to the server/device being offline timeout kills it after the set duration to give a non-zero return code for Zabbix to use and avoid presumably ending up with loads of hung "stat" commands.

            I think we found the root cause of the issue - SELinux. I tried testing the item prototype in the template to run the check from the Zabbix side on an affected host and find at it times out, the duration of the time out corresponds to the Timeout setting in the Zabbix client configuration. Checking /var/log/secure showed a load of this:

            Code:
            Nov 15 14:12:48 TPVINF093 sudo[1329156]: zabbix : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/stat -f /
            Nov 15 14:12:48 TPVINF093 systemd[1329160]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
            Nov 15 14:13:13 TPVINF093 sudo[1329156]: pam_systemd(sudo:session): Failed to create session: Connection timed out
            and in the messages file there are loads of SELinux errors for zabbix. The device we're testing on is just a simple test box so I disabled SELinux and bingo - working fine. SELinux has caused us a lot of trouble with Zabbix, guess it's just not done yet...

            Comment

            Working...