Ad Widget

**tim.mooney** · 13-11-2021, 10:46

You don't say what version of Zabbix you're using.

Some versions of Zabbix have the client timeout for (passive) Zabbix agent checks compiled into the software, so that it cannot be adjusted with a configuration setting. You literally have to modify the source code and adjust the timeout and recompile.

That's changed with more recent versions of Zabbix, where it is now runtime-controllable, but I forget which Zabbix 5.x series added it.

This doesn't apply to active checks though.

**niall.porter** · 15-11-2021, 12:54

It's Zabbix server and agent 5.4.3.

The odd thing is that the command "timeout 1s <other command>" means that if the other command is still running after 1 second then it gets killed, it doesn't keep it running for 1 second or whatever you set that to. If the other command only takes 0.1 second to run then setting the duration to 10 seconds, 1 minute, 42 years etc. won't make it take any longer to run so I can't figure out why Zabbix thinks it's still running and timing out...

**splitek** · 15-11-2021, 22:14

Can you change your logic and first check if filesystem is mounted and if so then run stat command, if not then just exit? I think here can be used some trick like 'ls $1' to check if filesystem is available.

**riBoon** · 16-11-2021, 10:03

Originally posted by splitek

Can you change your logic and first check if filesystem is mounted and if so then run stat command, if not then just exit? I think here can be used some trick like 'ls $1' to check if filesystem is available.

If the mounted filesystem hangs, the ls command also would run too long. So this would not solve his problem.
@niall.porter: Are you sure you've increased the timeout for the agent and the proxies/server? (for a active agent check)

**niall.porter** · 16-11-2021, 10:57

Yes, so what we're checking for here really is if the device/NFS server hosting the mounted filesystem has gone offline or unreachable. This whole request was prompted by a few events of an NFS server in our datacenter becoming unreachable by our SAP servers in the cloud. When that happens, running a simple "mount | grep <mountpoint>" still returns true because nobody told the system to unmount the filesystem. We use the stat command but wrap it in the timeout command so when stat hangs due to the server/device being offline timeout kills it after the set duration to give a non-zero return code for Zabbix to use and avoid presumably ending up with loads of hung "stat" commands.

I think we found the root cause of the issue - SELinux. I tried testing the item prototype in the template to run the check from the Zabbix side on an affected host and find at it times out, the duration of the time out corresponds to the Timeout setting in the Zabbix client configuration. Checking /var/log/secure showed a load of this:

Code:

Nov 15 14:12:48 TPVINF093 sudo[1329156]: zabbix : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/stat -f /
Nov 15 14:12:48 TPVINF093 systemd[1329160]: pam_unix(systemd-user:session): session opened for user root by (uid=0)
Nov 15 14:13:13 TPVINF093 sudo[1329156]: pam_systemd(sudo:session): Failed to create session: Connection timed out

and in the messages file there are loads of SELinux errors for zabbix. The device we're testing on is just a simple test box so I disabled SELinux and bingo - working fine. SELinux has caused us a lot of trouble with Zabbix, guess it's just not done yet...

Ad Widget

Timeout while executing a shell script.

Timeout while executing a shell script.

Comment

Comment

Comment

Comment

Comment