PDA

View Full Version : linux software RAID monitoring


ghislain
12-06-2006, 11:06
Hello,

Before going on building the best regex in the world to monitor software RAID disk on linux i wanted to know if any of you had allready build such a command :)

Basicaly this is looking inside /proc/mdstats to see if any disk has failed.


regards,
Ghislain.

Clansman
12-06-2006, 17:00
Should not be hard, but tricky if you want to monitor several arrays.

Anyway, I'd suggest to write a script to do so and generate some predictable output (like 0 - all ok; 1 - at least one array not ok) and then use it on zabbix, instead of polluting the zabbix_agent configuration file with multine 10-piped command with awk scripting between the pipes... :-)

Cheers,

LEM
23-06-2006, 23:14
I personnaly use things like:

UserParameter=custom.md.md0,/etc/zabbix/bin/custom.md md0
UserParameter=custom.md.md1,/etc/zabbix/bin/custom.md md1

Where /etc/zabbix/bin/custom.md is a script that just:
. system call 'mdadm --display /dev/$1'
. cut/grep/ to return the 'State : ' string.

Hope this'll help.

cameronsto
24-06-2006, 22:37
LEM, any chance you can post the full script you're using. I'm trying to piece one together and am having issues for some reason. If I run my command manually it works, but if I run it via a bash script I get errors.

Thanks,

Cameron

LEM
28-06-2006, 11:39
Here is what I use in zabbix_agentd.conf:

UserParameter=custom.raidstate.md0,/etc/zabbix/bin/custom.raidstate md0
UserParameter=custom.raidstate.md1,/etc/zabbix/bin/custom.raidstate md1


And here is the code for /etc/zabbix/bin/custom.raidstate :

#!/usr/bin/perl
#
#
#sudo /sbin/mdadm --detail /dev/md0|grep -i "State :"|cut -d ":" -f 2
#

use strict;
use warnings;

my $device = $ARGV[0];

my $return = `/usr/bin/sudo /sbin/mdadm --detail /dev/$device |grep -i \"State :\"|cut -d \":\" -f 2`;

chomp ($return);
$return =~ s/\ //g;

if ( $return eq 'clean' ) {
print "0";
} else {
print "1";
}

# - The End


I use Numeric (float) to store this kind of value with no custom multiplier. For triggering, I use something like:

{MyHost:custom.raidstate.md0.last(0)}>0


To be able to use mdadm --detail as zabbix user, I use sudo with the following statements in sudoers file:

# Cmnd alias specification
Cmnd_Alias ZABBIXCMD = /sbin/mdadm --detail *
# ZABBIX special privileges
zabbix ALL=NOPASSWD: ZABBIXCMD


Hope this'll help you.

Cheers,

cameronsto
28-06-2006, 15:05
That definitely helped. I don't know why I didn't think to use perl, but I was using bash and for whatever reason it wasn't working. I did get an error trying to setup the sudo command. When trying to run it as zabbix I received "permission denied" on /dev/md0. In the meantime I just have a cron job running as root and printing the status out to a file.

Thanks for the tips.

cameron

Nate Bell
28-06-2006, 16:59
If you want an alternative bash script, here's what I use:

#!/bin/bash
# Usage: raid.sh <disk device name to check>
# Ex: ./raid.sh md0
disk=$1
temp=$(grep -A1 $disk /proc/mdstat | grep UU | wc -l)
echo $temp

Since mdstat in /proc keeps track of the raid arrays, and prints UU if things are kosher, and either _U or U_ or even __ if things have gone really downhill, then grepping for UU works. Do a word count on the results and you get a 1 or 0 response. My results are the opposite of LEM's since a 1 for me is good, but a 1 for LEM is bad.

You know, now that I look at that script again, I could just make it one line and throw the script away.
UserParameter=mdstat ,grep -A1 $1 /proc/mdstat | grep UU | wc -l
Huh, that's even easier. Hell, there might be a way to make that a system.run command. Maybe it's time to take a look at my scripts and see what I've learned since I wrote them. Anyhow, just giving more options.

Nate

cameronsto
28-06-2006, 21:06
Since mdstat in /proc keeps track of the raid arrays, and prints UU if things are kosher, and either _U or U_ or even __ if things have gone really downhill, then grepping for UU works. Do a word count on the results and you get a 1 or 0 response. My results are the opposite of LEM's since a 1 for me is good, but a 1 for LEM is bad.
My output with 4 drives is [UUUU]. So even if 1 drive failed it could still pass your script if the output was [UUU_] right?

-cameron

Nate Bell
28-06-2006, 22:56
Ah, true, though you could just grep for UUUU and it would work.

How about trying this one on for size:
UserParameter=mdstat ,grep -A1 $1 /proc/mdstat | tail -n1 | grep _ | wc -l
That one doesn't care how many drives you have, only that one or more of them has gone missing, and it even gives the same results LEM's does.

Nate

pdwalker
18-10-2006, 09:33
Just a small revision

Change:
UserParameter=mdstat ,grep -A1 $1 /proc/mdstat | tail -n1 | grep _ | wc -l
To:
UserParameter=mdstat ,grep -A1 $1 /proc/mdstat | tail -n1 | grep -c _
The -c argument will count the number of occurances.

Acutally, you can even remove the tail command since (at least on my linux systems, the underscore ('_') only occurs when a device has failed and does not appear on the first status line for the device

UserParameter=mdstat ,grep -A1 $1 /proc/mdstat | grep -c _

simix
26-10-2006, 14:26
I'm maintaining my own raidmon tool which can easy be integrated with zabbix. The tool is here http://www.invoca.ch/pub/packages/raidmon/

In zabbix, I have this config to monitor disks for zabbix-1.1.x:
Items:
RAID number of failed devices in arrays system.run[raidmon status failed,wait] 60 7 365 ZABBIX agent
RAID number of syncing arrays system.run[raidmon status syncing,wait] 60 7 365 ZABBIX agent
RAID number of arrays system.run[raidmon status number,wait] 60 7 365 ZABBIX agent

Triggers:
RAID has failed devices in arrays on {HOSTNAME} {Unix_t:system.run[raidmon status failed,wait].last(0)}>0 High
RAID is syncing arrays on {HOSTNAME} {Unix_t:system.run[raidmon status syncing,wait].last(0)}>0 Average
RAID number of arrays has changed on {HOSTNAME} {Unix_t:system.run[raidmon status number,wait].diff(0)}>0 Information

prh
04-11-2006, 13:26
If you have EnableRemoteCommands set in you agents (WARNING: potential security issues involved) you could just use this item:

Description: Failed RAID devices
Key: system.run[cat /proc/mdstat | egrep '(U_|_U)' | wc -l]

Returns the number of failed RAID devices.
Returns zero if no failed RAID devices or no RAID devices at all.

zalink
30-12-2006, 18:55
Sadly none of the solutions presented here work for me.

I have found that mdstat Status sometimes returns dirty when its still busy raiding data.

Then the solutions for /proc/mdstat do not work well for multiple devices.

It seems however that mdadm returns a numeric result code that can very easily be used.

0 The array is functioning normally.

1 The array has at least one failed device.

2 The array has multiple failed devices and hence is unus-
able (raid4 or raid5).

4 There was an error while trying to get information about
the device.


Thus I used the following:


UserParameter=mdstat ,sudo /sbin/mdadm --detail -b /dev/$1 >/dev/null 2>&1; echo $?



You will still need the mentioned addition to the sudo config via visudo


Cmnd_Alias ZABBIXCMD = /sbin/mdadm --detail *
# ZABBIX special privileges
zabbix ALL=NOPASSWD: ZABBIXCMD

Pak
27-01-2010, 12:15
I had some little issue to implement this monitor, casue I'm newbe (it's only one week I use Zabbix)

So I write how I done it, maybe it could help someone


in /etc/zabbix/zabbix_agentd.conf i add this

#CONTROLLO RAID
UserParameter=custom.mdstat ,cat /proc/mdstat | grep -c _

then I add a Item to the host with
Key: custom.mdstat
and no particular settings

then I add a trigger
Expression: {hostname:custom.mdstat .last(0)}>0


It works like a charm :) and it count how many disks fails

I try this settings in 3 mirrors environment (/dev/md0 (sda1,sdb1), /dev/md1 (sda2,sdb2), /dev/md1 (sda3,sdb3)), and I try to put in fail every mirror and it works...
I don't know if it works with more than 2 device (for example Raid 5), but I suppose that it works :)

my 2 cents :)

Paolo

sybex
01-02-2010, 07:18
Hi, ...

i just put the output of the Raid status into an file. After that i use the zabbix standard function to checksum this file.

If there are any changes on the RAID status, the checksum will also change and anyway i have to check the status if here is any changes. Because this means that something have changed there.

To have an trigger at high severity could also help to get informed by an error.

Btw i dont have a software raid by linux, it is a hardware raid from a HP machine.

fratotec
28-06-2010, 16:37
Hi, this script give false positives on a busy RAID1.
I realized that the "State" of my RAID1 array changes from "clean" to "active".. when heavy writes occures.

so I changed the the following line to
my $return = `/usr/bin/sudo /sbin/mdadm --detail /dev/$device |grep \"Active\"|cut -d \":\" -f 2`;

this hopefully alerts me if the number of Active devices not equal 2.



Here is what I use in zabbix_agentd.conf:

UserParameter=custom.raidstate.md0,/etc/zabbix/bin/custom.raidstate md0
UserParameter=custom.raidstate.md1,/etc/zabbix/bin/custom.raidstate md1


And here is the code for /etc/zabbix/bin/custom.raidstate :

#!/usr/bin/perl
#
#
#sudo /sbin/mdadm --detail /dev/md0|grep -i "State :"|cut -d ":" -f 2
#

use strict;
use warnings;

my $device = $ARGV[0];

my $return = `/usr/bin/sudo /sbin/mdadm --detail /dev/$device |grep -i \"State :\"|cut -d \":\" -f 2`;

chomp ($return);
$return =~ s/\ //g;

if ( $return eq 'clean' ) {
print "0";
} else {
print "1";
}

# - The End


I use Numeric (float) to store this kind of value with no custom multiplier. For triggering, I use something like:

{MyHost:custom.raidstate.md0.last(0)}>0


To be able to use mdadm --detail as zabbix user, I use sudo with the following statements in sudoers file:

# Cmnd alias specification
Cmnd_Alias ZABBIXCMD = /sbin/mdadm --detail *
# ZABBIX special privileges
zabbix ALL=NOPASSWD: ZABBIXCMD


Hope this'll help you.

Cheers,

Cheers

Franz

casshan
09-07-2010, 10:32
How about?

sudo mdadm -D /dev/md3| grep 'Failed Devices' | cut -d ':' -f 2 | tr -d '


Just have a trigger if the value is > 0. No client side scrips needed, just remote commands enabled.

nack
07-10-2011, 14:44
Monitoring software raid:

Item1
To check if software raid exist

Name: Software Raid exist
Key: vfs.file.regmatch[/proc/mdstat,raid]


Item2
To check if raid is broken

Name: Software Raid broken
Key: vfs.file.regmatch[/proc/mdstat,_]


Trigger

If raid exist and the raid is broken, then send alert.

Name: Software raid broken {HOSTNAME}
Expression: {Template:vfs.file.regmatch[/proc/mdstat,raid].last(0)}>0&{Template:vfs.file.regmatch[/proc/mdstat,_].last(0)}>0

Gav
16-10-2011, 13:09
My script

#!/bin/sh
[ -b "/dev/$1" ] || { echo -1; exit 1; }

/sbin/mdadm -D /dev/$1 | /bin/grep '^[\t ]*State' | /bin/sed 's/^[\t ]*State :[\t ]*//g' | /usr/bin/awk 'BEGIN{a=0};/clean/{a+=1};/degraded/{a+=2};/resyncing/{a+=4};/recovering/{a+=8};/Not Started/{a+=16};END{if (NR==1) print a; else print -1 }'

usage myscript.sh mdX

return following values (coresponds to value map on Zabix)

0 ⇒ OK
1 ⇒ OK
2 ⇒ Degraded
3 ⇒ Degraded
4 ⇒ Resyncing
5 ⇒ Resyncing
6 ⇒ Degraded
7 ⇒ Degraded
8 ⇒ Recovering
9 ⇒ Recovering
10 ⇒ Degraded
11 ⇒ Degraded
12 ⇒ N/A
13 ⇒ N/A
14 ⇒ N/A
15 ⇒ N/A
16 ⇒ Not Started
99999 ⇒ Not Found

This script uses mdstat which can be run only by root so you need to edit sudoers to match with zabbix users and allow him to execute this script

zabbix ALL=(root) NOPASSWD: /etc/scripts/mdstat.sh

Shad0w
17-10-2011, 09:29
or take a look at this site(german): http://lab4.org/wiki/Zabbix_linux_software_raid_ueberwachen

Jason
18-10-2011, 10:52
I've a script somewhere for monitoring linux servers with MegaRaid that I found and hacked a bit which works with LSI/Dell Perc cards and can also be installed on openfiler boxes although it is quite basic and in need of updating. Will try and post it later on this week.

frater
18-10-2011, 21:41
Reading all the different posts I think this is a clean/efficient interpretation....

UserParameter=vfs.softraid.faulty, [ -e /proc/mdstat ] && grep blocks /proc/mdstat | egrep -vc '\[U+\]'