It gets worse and stranger by the day..
I created the following script..
<script>
#!/bin/bash
while :
#HOST_NAME="censored_hostname_here.amazonaws.com"
HOST_NAME="censored_hostname_here2.amazonaws.com"
#HOST_NAME="censored_hostname_here3.amazonaws.com" #This is down and WILL fail.
LOG_FILE="/tmp/uptime.test.${HOST_NAME}"
do
DATE="`date +%D-%T`"
STATUS="`zabbix_get -s ${HOST_NAME} -p 10050 -k \"agent.ping\" 2>&1`"
echo "${DATE} - Host : ${HOST_NAME} - Status = ${STATUS}" | tee -a ${LOG_FILE}
sleep 1s
done
</script>
I let this run endlessly, to monitor the situation in "realtime" where I can see it. I got an alert @ 6:22AM this morning. I set that script off on the effected host at around 9 AM, everything came back as status=1 or no error all day. Then around 11:30 I received an "All OK" notice for that alert. Then at 1:33 as I was writing this.. I got another alert that the host has been unreachable for 10 mins.
However, the host never stops responding "ALL OK" in the script, it keeps coming back OK all while the interface is sending out e-mails stating that it's been out of touch with the node for greater than 10 minutes., but that can't be possible.. Something is seriously amiss here.
Does that give you any ideas? I am working toward first migrating to centos and then upgrading to the latest version once the migration is done.

J
I created the following script..
<script>
#!/bin/bash
while :
#HOST_NAME="censored_hostname_here.amazonaws.com"
HOST_NAME="censored_hostname_here2.amazonaws.com"
#HOST_NAME="censored_hostname_here3.amazonaws.com" #This is down and WILL fail.
LOG_FILE="/tmp/uptime.test.${HOST_NAME}"
do
DATE="`date +%D-%T`"
STATUS="`zabbix_get -s ${HOST_NAME} -p 10050 -k \"agent.ping\" 2>&1`"
echo "${DATE} - Host : ${HOST_NAME} - Status = ${STATUS}" | tee -a ${LOG_FILE}
sleep 1s
done
</script>
I let this run endlessly, to monitor the situation in "realtime" where I can see it. I got an alert @ 6:22AM this morning. I set that script off on the effected host at around 9 AM, everything came back as status=1 or no error all day. Then around 11:30 I received an "All OK" notice for that alert. Then at 1:33 as I was writing this.. I got another alert that the host has been unreachable for 10 mins.
However, the host never stops responding "ALL OK" in the script, it keeps coming back OK all while the interface is sending out e-mails stating that it's been out of touch with the node for greater than 10 minutes., but that can't be possible.. Something is seriously amiss here.
Does that give you any ideas? I am working toward first migrating to centos and then upgrading to the latest version once the migration is done.

J
Comment