Ad Widget


linux software RAID monitoring

  • Filter
  • Time
  • Show
Clear All
new posts

    linux software RAID monitoring


    Before going on building the best regex in the world to monitor software RAID disk on linux i wanted to know if any of you had allready build such a command

    Basicaly this is looking inside /proc/mdstats to see if any disk has failed.


    Should not be hard, but tricky if you want to monitor several arrays.

    Anyway, I'd suggest to write a script to do so and generate some predictable output (like 0 - all ok; 1 - at least one array not ok) and then use it on zabbix, instead of polluting the zabbix_agent configuration file with multine 10-piped command with awk scripting between the pipes... :-)



      I personnaly use things like:
      Code:,/etc/zabbix/bin/ md0,/etc/zabbix/bin/ md1
      Where /etc/zabbix/bin/ is a script that just:
      . system call 'mdadm --display /dev/$1'
      . cut/grep/ to return the 'State : ' string.

      Hope this'll help.


        LEM, any chance you can post the full script you're using. I'm trying to piece one together and am having issues for some reason. If I run my command manually it works, but if I run it via a bash script I get errors.




          UserParameter script used for md monitoring (sample)

          Here is what I use in zabbix_agentd.conf:
          UserParameter=custom.raidstate.md0,/etc/zabbix/bin/custom.raidstate md0
          UserParameter=custom.raidstate.md1,/etc/zabbix/bin/custom.raidstate md1
          And here is the code for /etc/zabbix/bin/custom.raidstate :
          #sudo /sbin/mdadm --detail /dev/md0|grep -i "State :"|cut -d ":" -f 2
          use strict;
          use warnings;
          my $device = $ARGV[0];
          my $return = `/usr/bin/sudo /sbin/mdadm --detail /dev/$device  |grep -i \"State :\"|cut -d \":\" -f 2`;
          chomp ($return);
          $return =~ s/\ //g;
          if ( $return eq 'clean' ) {
            print "0";
          } else {
            print "1";
          # - The End
          I use Numeric (float) to store this kind of value with no custom multiplier. For triggering, I use something like:
          To be able to use mdadm --detail as zabbix user, I use sudo with the following statements in sudoers file:
          # Cmnd alias specification
          Cmnd_Alias ZABBIXCMD = /sbin/mdadm --detail *
          # ZABBIX special privileges
          zabbix  ALL=NOPASSWD:   ZABBIXCMD
          Hope this'll help you.

          Last edited by LEM; 28-06-2006, 10:42.


            That definitely helped. I don't know why I didn't think to use perl, but I was using bash and for whatever reason it wasn't working. I did get an error trying to setup the sudo command. When trying to run it as zabbix I received "permission denied" on /dev/md0. In the meantime I just have a cron job running as root and printing the status out to a file.

            Thanks for the tips.



              If you want an alternative bash script, here's what I use:
              # Usage: <disk device name to check>
              # Ex:     ./ md0
              temp=$(grep -A1 $disk /proc/mdstat | grep UU | wc -l)
              echo $temp
              Since mdstat in /proc keeps track of the raid arrays, and prints UU if things are kosher, and either _U or U_ or even __ if things have gone really downhill, then grepping for UU works. Do a word count on the results and you get a 1 or 0 response. My results are the opposite of LEM's since a 1 for me is good, but a 1 for LEM is bad.

              You know, now that I look at that script again, I could just make it one line and throw the script away.
              UserParameter=mdstat[*],grep -A1 $1 /proc/mdstat | grep UU | wc -l
              Huh, that's even easier. Hell, there might be a way to make that a command. Maybe it's time to take a look at my scripts and see what I've learned since I wrote them. Anyhow, just giving more options.



                Originally posted by Nate Bell
                Since mdstat in /proc keeps track of the raid arrays, and prints UU if things are kosher, and either _U or U_ or even __ if things have gone really downhill, then grepping for UU works. Do a word count on the results and you get a 1 or 0 response. My results are the opposite of LEM's since a 1 for me is good, but a 1 for LEM is bad.
                My output with 4 drives is [UUUU]. So even if 1 drive failed it could still pass your script if the output was [UUU_] right?



                  Ah, true, though you could just grep for UUUU and it would work.

                  How about trying this one on for size:
                  UserParameter=mdstat[*],grep -A1 $1 /proc/mdstat | tail -n1 | grep _ | wc -l
                  That one doesn't care how many drives you have, only that one or more of them has gone missing, and it even gives the same results LEM's does.

                  Last edited by Nate Bell; 28-06-2006, 21:59.


                    Just a small revision

                    UserParameter=mdstat[*],grep -A1 $1 /proc/mdstat | tail -n1 | grep _ | wc -l
                    UserParameter=mdstat[*],grep -A1 $1 /proc/mdstat | tail -n1 | grep -c _
                    The -c argument will count the number of occurances.

                    Acutally, you can even remove the tail command since (at least on my linux systems, the underscore ('_') only occurs when a device has failed and does not appear on the first status line for the device

                    UserParameter=mdstat[*],grep -A1 $1 /proc/mdstat | grep -c _


                      I'm maintaining my own raidmon tool which can easy be integrated with zabbix. The tool is here

                      In zabbix, I have this config to monitor disks for zabbix-1.1.x:
                      RAID number of failed devices in arrays[raidmon status failed,wait] 60 7 365 ZABBIX agent
                      RAID number of syncing arrays[raidmon status syncing,wait] 60 7 365 ZABBIX agent
                      RAID number of arrays[raidmon status number,wait] 60 7 365 ZABBIX agent

                      RAID has failed devices in arrays on {HOSTNAME} {[raidmon status failed,wait].last(0)}>0 High
                      RAID is syncing arrays on {HOSTNAME} {[raidmon status syncing,wait].last(0)}>0 Average
                      RAID number of arrays has changed on {HOSTNAME} {[raidmon status number,wait].diff(0)}>0 Information


                        RAID monitoring using

                        If you have EnableRemoteCommands set in you agents (WARNING: potential security issues involved) you could just use this item:

                        Description: Failed RAID devices
                        Key:[cat /proc/mdstat | egrep '(U_|_U)' | wc -l]

                        Returns the number of failed RAID devices.
                        Returns zero if no failed RAID devices or no RAID devices at all.


                          Sadly none of the solutions presented here work for me.

                          I have found that mdstat Status sometimes returns dirty when its still busy raiding data.

                          Then the solutions for /proc/mdstat do not work well for multiple devices.

                          It seems however that mdadm returns a numeric result code that can very easily be used.

                          0 The array is functioning normally.

                          1 The array has at least one failed device.

                          2 The array has multiple failed devices and hence is unus-
                          able (raid4 or raid5).

                          4 There was an error while trying to get information about
                          the device.

                          Thus I used the following:

                          UserParameter=mdstat[*],sudo /sbin/mdadm --detail -b /dev/$1 >/dev/null 2>&1; echo $?
                          You will still need the mentioned addition to the sudo config via visudo

                          Cmnd_Alias ZABBIXCMD = /sbin/mdadm --detail *
                          # ZABBIX special privileges
                          zabbix  ALL=NOPASSWD:   ZABBIXCMD


                            I had some little issue to implement this monitor, casue I'm newbe (it's only one week I use Zabbix)

                            So I write how I done it, maybe it could help someone

                            in /etc/zabbix/zabbix_agentd.conf i add this

                            #CONTROLLO RAID
                            UserParameter=custom.mdstat[*],cat /proc/mdstat | grep -c _

                            then I add a Item to the host with
                            Key: custom.mdstat[*]
                            and no particular settings

                            then I add a trigger
                            Expression: {hostname:custom.mdstat[*].last(0)}>0

                            It works like a charm and it count how many disks fails

                            I try this settings in 3 mirrors environment (/dev/md0 (sda1,sdb1), /dev/md1 (sda2,sdb2), /dev/md1 (sda3,sdb3)), and I try to put in fail every mirror and it works...
                            I don't know if it works with more than 2 device (for example Raid 5), but I suppose that it works

                            my 2 cents



                              Hi, ...

                              i just put the output of the Raid status into an file. After that i use the zabbix standard function to checksum this file.

                              If there are any changes on the RAID status, the checksum will also change and anyway i have to check the status if here is any changes. Because this means that something have changed there.

                              To have an trigger at high severity could also help to get informed by an error.

                              Btw i dont have a software raid by linux, it is a hardware raid from a HP machine.