Announcement

Collapse
No announcement yet.

Monitoring LSI / Symbios MegaRAID SAS raid controller (found in several Dell servers)

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

    Monitoring LSI / Symbios MegaRAID SAS raid controller (found in several Dell servers)

    Hello fellow zabbixers!

    I thought that it would be nice to get notified about raid disk failures by zabbix, so I've put together a small bash script to generate monitoring XML templates for LSI and Symbios MegaRAID controllers using the MegaCli executable downloadable from www.lsi.com.

    You should run the script on the host to be monitored since it only generates template items and notification triggers for existing drives. You must set the path to the MegaCli executable in the header of the script, then run it like:

    Code:
    bash confgen_zabbix_megacli.sh > megaraid_template.xml
    On success you should see something like:
    xml
    Code:
    + detecting adapters
    + found 1 adapter(s)
    + examining adapter 0
    + found disk: 32:0
    + found disk: 32:1
    + found disk: 32:2
    + done
    And of course the file 'megaraid_template.xml' would contain the template generated for your configuration.

    Don't forget to add the following line to your zabbix_agentd.conf and restart your agent:

    Code:
    UserParameter=megaraid[*],sudo $CMD -pdInfo -PhysDrv[$2:$3] -a$1 | grep '$4' | cut -f2 -d':' | cut -b2-
    Where $CMD is the path to your copy of MegaCli executable. Also consider that the command above presumes that the user running your agent is permitted to use 'sudo' (eg.: the zabbix user is in the sudoers file).

    Cheers:
    Erno Rigo
    http://rigo.info
    Attached Files

    #2
    It's always interesting to see how someone else solved the same problem I did... Yours is more elegant but this works for us.

    We have some servers with 32-bit OS installs and some with 64-bit. So I tried to make this universal and able to work on either. A zero returned is good, anything non-zero means an error of some kind. Could be an actual RAID error or a problem with the command. So a simple trigger to alert on non-zero values.

    system.run[(/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL 2>&1 || /opt/MegaRAID/MegaCli/MegaCli -LDInfo -Lall -aALL 2>&1) | grep -i 'State\|Permission' | grep -v Optimal | wc -l]

    Comment


      #3
      Thanks, mcree, wdingus, for the solutions!

      wdingus, I have a question on your solution:
      I have never seen a failed Drive with MegaCli. In normal condition the MegaCli state is:
      # ---
      ...
      State : Optimal
      ...
      # ---
      , but what MegaCLI says in failing state?
      My thoughts are: if in failing state it says "Not Optimal", then your command "... | grep -v Optimal" will not detect the error. Have you ever tested your commands on failing environment?

      Thanks again,
      Todor

      Originally posted by wdingus View Post
      It's always interesting to see how someone else solved the same problem I did... Yours is more elegant but this works for us.

      We have some servers with 32-bit OS installs and some with 64-bit. So I tried to make this universal and able to work on either. A zero returned is good, anything non-zero means an error of some kind. Could be an actual RAID error or a problem with the command. So a simple trigger to alert on non-zero values.

      system.run[(/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL 2>&1 || /opt/MegaRAID/MegaCli/MegaCli -LDInfo -Lall -aALL 2>&1) | grep -i 'State\|Permission' | grep -v Optimal | wc -l]

      Comment


        #4
        Good question... I'm pretty sure we have had success with this monitor, alerting us to failures. I don't have a specific example handy. How about looking into the possibilities this way though:

        # strings /opt/MegaRAID/MegaCli/MegaCli64 | grep Optimal -B3
        Offline
        Partially Degraded
        Degraded
        Optimal

        Looks like those are the 4 possible values following State: so the "grep -v Optimal" would hopefully be fine. We're just interested in being alerted to *anything* other than Optimal, human eyes can investigate further.

        Comment


          #5
          Yes, you are right about the "strings". These should be the possible values.
          I like the solution and particularly the simplicity. Of course, one should be interested in *any* change in the RAID status and any change should be investigated. I am implementing this in our environment.
          Thanks a lot,
          Todor

          Comment


            #6
            Hi,

            To monitor the hardware of my dell servers under Ubuntu I am using OMSA to populate snmp with all the hardware info. So no need sudo access with the zabbix agent, no need to add user parameter in agent config

            If you want I can share my Dell R720 snmp template Tomorrow.

            Cheers.

            Comment


              #7
              geek74, please share. We will be thankful with one more solution.

              Comment


                #8
                Hi,

                So When you have OMSA populating snmp you can use the attached template.
                To make it work under ubuntu 12.04LTS install OMSA from dell repository and do the following fix http://administratosphere.wordpress....-ubuntu-a-fix/

                It needs a lot of value mapping to be human readable.

                DellArrayDiskState
                1 ⇒ ready
                2 ⇒ failed
                3 ⇒ online
                4 ⇒ offline
                6 ⇒ degraded
                7 ⇒ recovering
                11 ⇒ removed
                13 ⇒ non-raid
                15 ⇒ resynching
                24 ⇒ rebuild
                25 ⇒ noMedia
                26 ⇒ formatting
                28 ⇒ diagnostics
                34 ⇒ predictiveFailure
                35 ⇒ initializing
                39 ⇒ foreign
                40 ⇒ clear
                41 ⇒ unsupported
                53 ⇒ incompatible


                DellBatteryState
                1 ⇒ ready
                2 ⇒ failed
                6 ⇒ degraded
                7 ⇒ reconditioning
                9 ⇒ high
                10 ⇒ low
                12 ⇒ charging
                21 ⇒ missing
                36 ⇒ learning

                DellLogDriveState
                1 ⇒ ready
                2 ⇒ failed
                3 ⇒ online
                4 ⇒ offline
                6 ⇒ degraded
                7 ⇒ verifying
                15 ⇒ resynching
                16 ⇒ regenerating
                18 ⇒ failedRedundancy
                24 ⇒ rebuilding
                26 ⇒ formatting
                32 ⇒ reconstructing
                35 ⇒ initializing
                36 ⇒ backgroundInit
                52 ⇒ permanentlyDegraded

                DellLogDriveType
                1 ⇒ concatenated
                2 ⇒ raid-0
                3 ⇒ raid-1
                4 ⇒ raid-2
                5 ⇒ raid-3
                6 ⇒ raid-4
                7 ⇒ raid-5
                8 ⇒ raid-6
                9 ⇒ raid-7
                10 ⇒ raid-10
                11 ⇒ raid-30
                12 ⇒ raid-50
                13 ⇒ addSpares
                14 ⇒ deleteLogical
                15 ⇒ transformLogical
                18 ⇒ raid-0-plus-1
                19 ⇒ concatRaid-1
                20 ⇒ concatRaid-5
                21 ⇒ noRaid
                22 ⇒ volume
                23 ⇒ raidMorph
                24 ⇒ raid-60
                25 ⇒ cacheCade

                Dell Open Manage System Status
                1 ⇒ Other
                2 ⇒ Unknown
                3 ⇒ OK
                4 ⇒ NonCritical
                5 ⇒ Critical
                6 ⇒ NonRecoverable

                DellsDiskControllerState
                1 ⇒ ready
                2 ⇒ failed
                3 ⇒ online
                4 ⇒ offline
                6 ⇒ degraded

                DellStatus
                1 ⇒ other
                2 ⇒ unknown
                3 ⇒ ok
                4 ⇒ nonCritical
                5 ⇒ critical
                6 ⇒ nonRecoverable

                DellStatusProbe
                1 ⇒ other
                2 ⇒ unknown
                3 ⇒ ok
                4 ⇒ nonCriticalUpper
                5 ⇒ criticalUpper
                6 ⇒ nonRecoverableUpper
                7 ⇒ nonCriticalLower
                8 ⇒ criticalLower
                9 ⇒ nonRecoverableLower
                10 ⇒ failed

                DellStatusRedundancy
                1 ⇒ other
                2 ⇒ unknown
                3 ⇒ full
                4 ⇒ degraded
                5 ⇒ lost
                6 ⇒ notRedundant
                7 ⇒ redundnacyOffline

                DellStorageGlobalStatus
                1 ⇒ critical
                2 ⇒ warning
                3 ⇒ normal
                4 ⇒ unknown


                Please comment and update if you found wrong stuff.

                Cheers
                Attached Files

                Comment


                  #9
                  template with discovery

                  First, thanks for this nice contribution to the community. I've used your template on our Fujitsu servers with LSI Megaraid.

                  A suggestion for you. There is a script that generates an XML template. How about converting it for auto-discovery. It is really easy since you have most of the stuff in place already.

                  BTW, I am building MD RAID (software ) template with auto-discovery and borrowed couple ideas from you.

                  Thanks again

                  OB

                  Comment


                    #10
                    Not related to zabbix but perhaps this can be adapted. As is these instructions show how to get MegaCLI to send an email if the RAID array is degraded.

                    Configure LSI MegaRAID email alerts
                    Code:
                    cd /etc/cron.hourly
                    Once you are in the folder, user your favorite editor to create a new file called MegaRAIDcron. For the purpose of this guide, we are going to use nano.
                    Code:
                    nano MegaRAIDcron
                    In this file, we are going to place the following. Be sure to replace with the email address that the alerts will be sent to.
                    Code:
                    #!/bin/bash
                    cd /opt/MegaRAID/MegaCli
                    ./MegaCli64 -AdpAllInfo -aALL | grep "Degraded" > degraded.txt
                    ./MegaCli64 -AdpAllInfo -aALL | grep "Failed" >> degraded.txt
                    cat degraded.txt | grep "1" > /dev/null
                    if [[ $? -eq 0 ]];
                    then
                    cat degraded.txt | mailx -s 'Degraded RAID on '$HOSTNAME <REPLACE WITH EMAIL>
                    fi
                    Save the changes to the file. Once the changes are made to the file, we need to assign execute permissions to the file.
                    Code:
                    chmod +x MegaRAIDcron
                    To test cron, we need to make one small change to the file. Change the following:
                    From
                    Code:
                    cat degraded.txt | grep "1" > /dev/null
                    To
                    Code:
                    cat degraded.txt | grep "0" > /dev/null
                    Save the changes and run the cron manually:
                    Code:
                    /etc/cron.hourly/MegaRAIDcron
                    If you have installed everything correctly, you should receive an email which shows the following:

                    Degraded : 0
                    Security Key Failed : No
                    Failed Disks : 0
                    Deny Force Failed : No

                    To change the cron from testing back to production use, change the 0 back to a 1 and you are set.
                    Again, the cron job will only send you an email if the array is degraded or a disk has failed. No news is good news.
                    Last edited by vic; 13-07-2013, 00:47.

                    Comment


                      #11
                      One of the Zabbix virtues as many other IT tools, is centralized management and maintenance.

                      First, Zabbix notification does allow escalation and other actions taken for a specific trigger event. You can easily change who is alerted and when

                      Second, relying on email servers for communication is not always prudent. While zabbix dashboard shows all ongoing issues, you can have other notification mechanisms: SMS, sound alarm, flashing lights in the building, etc

                      Third, I don't think managing LSI RAID alerts locally on each server scales well...

                      OB

                      Comment


                        #12
                        Originally posted by linuxsquad View Post
                        One of the Zabbix virtues as many other IT tools, is centralized management and maintenance.

                        First, Zabbix notification does allow escalation and other actions taken for a specific trigger event. You can easily change who is alerted and when

                        Second, relying on email servers for communication is not always prudent. While zabbix dashboard shows all ongoing issues, you can have other notification mechanisms: SMS, sound alarm, flashing lights in the building, etc

                        Third, I don't think managing LSI RAID alerts locally on each server scales well...

                        OB
                        Depends on your situation. For me a RAID failure requires no escalation. It goes straight to the top highest priority. It's not like it's a common occurrence. If it is you have bigger problems.

                        I will probably try adapt my email method to zabbix.
                        Last edited by vic; 13-07-2013, 00:49.

                        Comment


                          #13
                          1) Zabbix really shines when IT dept comprises of more than couple hands and/or there are other entities with vested interest in IT resources. They might want to see historical data to address bottlenecks.

                          2) Zabbix lets you to prioritize and classify events. So from this perspective, I like to have ability to select what event trigger what action:

                          - yellow alert on Z dashboard
                          - email to a single IT person
                          - blasting email to all IT dept
                          - SMS to IT dept head and whoever on call

                          3) For instance, hysteresis. For LSI RAID it is not an issue. However, if storage capacity or network bandwidth values fluctuate around trigger line (for instance, 80%) alerts will spam your inbox 'till someone screams "... get the f### Z### off my ermail !!!". A single change in a trigger configuration will allow to avoid such annoyance.

                          So there are plenty of reasons to spend time and bring Z up on your network, even local alerts can do all you need ... for now.

                          OB

                          Comment


                            #14
                            Where possible we like to use snmp monitoring... It's not complicated to use snmp to discover all drives both physical and logical in a server and then report on it. Also this gets round the problem of someone adding a drive and then forgetting to update the template.

                            This wont work for ESXi systems, but for those can just use python wbem to read the health status (assuming have loaded the lsi providers) and then parse that straight in to zabbix.

                            For remote sites where there isn't direct SNMP access then we just shove in a proxy either as a small virtual or on a rasp pi

                            Comment


                              #15
                              We use LSI Logic MegaRAID controllers in AberNAS units. I made a low-level discovery template which automatically monitors what's there. You should be able to use it with Dell also. This template uses low-level discovery to detect and monitor virtual devices (volumes), physical devices (drives), adapters, enclosures and batteries.

                              To monitor MegaRAID via SNMP requires the sas_snmp rpm, available for your controller from the LSI logic website (buried in "megaRAID_SNMP_Installers"). I also suggest updating net-snmp to version 5.5-44 or higher, so you can get the correct size for large volumes.

                              Substitute your own community string for {{ your_snmp_community }}, and your Zabbix Server IP address for {{ zabbix_server_ip_address }} in the example below. Install/update net-snmp before installing sas_snmp.

                              Code:
                              # copy the following files:
                              #     net-snmp-5.5-44.el6.x86_64.rpm
                              #     net-snmp-libs-5.5-44.el6.x86_64.rpm
                              #     net-snmp-utils-5.5-44.el6.x86_64.rpm
                              #     sas_snmp-13.04-0301.x86_64.rpm
                              #
                              # as root:
                              
                              rpm -Uvh net-snmp-5.5-44.el6.x86_64.rpm net-snmp-libs-5.5-44.el6.x86_64.rpm net-snmp-utils-5.5-44.el6.x86_64.rpm
                              rpm -ivh sas_snmp-13.04-0301.x86_64.rpm
                              vi /etc/snmp/snmpd.conf
                              # at the top of /etc/snmp/snmpd.conf, add the following lines.  Use the IP address of the zabbix server in place of "{{ zabbix_server_ip_address }}"
                              
                              rocommunity public 127.0.0.1
                              rocommunity {{ your_snmp_community }} 0.0.0.0
                              trapcommunity public
                              trap2sink {{ zabbix_server_ip_address }}
                              # report fake allocation unit size for large volumes so size calc is right
                              realStorageUnits 0
                              
                              # save the edited /etc/snmp/snmpd.conf, then restart snmpd:
                              /sbin/service snmpd restart
                              Attached Files
                              Last edited by kevind; 20-09-2013, 09:39.

                              Comment

                              Working...
                              X