Hi all,
I have a bit of a strange requirement which I'm looking for some advice on.
I have a large number of HP servers, initially about 200 but this could grow to as many as 2,000 in the future, where I need to be able to detect and report on the power state of the servers.
I have a fair amount of scripting which allows me to pull back various server metrics via IPMI (iLO), here's a small one for just the power state:
It delivers the following:
This is added as a external check on each monitored server.
My struggle is in producing a count of servers in an off state, one state and error state.
In aggregate checks the group function of count doesn't seem to exist. Additionally, there doesn't seem to be a way of filtering the items in the aggregate key based on their values.
I'm also wary that as this scales, the number of external scripts running may become an issue.
Hoping someone out there has some advice for me please?
Thanks,
Sean
I have a bit of a strange requirement which I'm looking for some advice on.
I have a large number of HP servers, initially about 200 but this could grow to as many as 2,000 in the future, where I need to be able to detect and report on the power state of the servers.
I have a fair amount of scripting which allows me to pull back various server metrics via IPMI (iLO), here's a small one for just the power state:
Code:
#!/bin/bash
# Setup some variables
USER=user
PASS=password
IPMIPATH=/usr/sbin/ipmi-chassis
SCRIPTTIMEOUT=5
# Get the power state
IPMIRESULT=$(timeout $SCRIPTTIMEOUT $IPMIPATH -D LAN2_0 -h $1 -u $USER -p $PASS -l USER -W discretereading --get-status 2>/dev/null)
if [ -z "$IPMIRESULT" ]
then
# Set a value of 2 if no result is received
echo 2
else
# Strip out just the power state line
SYSTEMPOWER=$(echo "$IPMIRESULT" | grep "System Power")
if [ -z "$SYSTEMPOWER" ]
then
# Set a value of 2 if the System Power status is not returned
echo 2
else
# Strip out just the power state
SYSTEMPOWERSTATE=$(echo "$SYSTEMPOWER" | awk '{ print $4 }')
if [ "$SYSTEMPOWERSTATE" = "on" ]
then
# Set a value of 1 for powered on
echo 1
else
# Set a value of 0 for powered off
echo 0
fi
fi
fi
It delivers the following:
- 0 - powered off
- 1 - powered on
- 2 - error (no result, timeout (5s) or incorrect values)
This is added as a external check on each monitored server.
My struggle is in producing a count of servers in an off state, one state and error state.
In aggregate checks the group function of count doesn't seem to exist. Additionally, there doesn't seem to be a way of filtering the items in the aggregate key based on their values.
I'm also wary that as this scales, the number of external scripts running may become an issue.
Hoping someone out there has some advice for me please?
Thanks,
Sean
Comment