2021-09-02
This HowTo show how to check the health of Hard Disks connected to a LSI Logic/Symbios Logic MegaRAID SAS 2108 RAID controller under linux. But is very useful for another hw raid controllers.
We look for its presence in the system:
~] lspci | grep RAID
01:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS 2108 [Liberator] (rev 03)
Bingo!, we can work with this one.
LSI provide megacli, a proprietary management command line utility. Debian repository containing all packages to install proprietary and opensource tools for you any HW RAID card can be found here.
My linux system is debian bullseye now. Add repository to /etc/apt/sources.list file in this format:
deb http://hwraid.le-vert.net/distrib branch main
For my server it is:
deb http://hwraid.le-vert.net/debian bullseye main
Edit your /etc/apt/sources.list and add repository to last line:
/etc/apt/sources.list
deb http://hwraid.le-vert.net/debian bullseye main
Packages are now signed, please run the following command after adding the repository to sources.list:
wget -O - https://hwraid.le-vert.net/debian/hwraid.le-vert.net.gpg.key | sudo apt-key add -
Make apt--update and install MegaCli utility and megaclisas-status script wrapper.
~] apt-get update
~] apt-get install megacli
~] apt-get install megaclisas-status
megacli is a proprietary tool by LSI which can perform both reporting and management for MegaRAID SAS cards. However it's really hard to use because it's use tones of command line parameters and there's no documentation.
Get all adapters status and config:
~] megacli -AdpAllInfo -aAll
Adapter #0
==============================================================================
Versions
================
Product Name : ServeRAID M5014 SAS/SATA Controller
Serial No : SV01506370
FW Package Build: 12.15.0-0199
Mfg. Data
================
Mfg. Date : 04/10/10
Rework Date : 00/00/00
Revision No :
Battery FRU : N/A
Image Versions in Flash:
================
FW Version : 2.130.403-3588
BIOS Version : 3.30.02.2_4.16.08.00_0x06060A05
Preboot CLI Version: 04.04-020:#%00009
WebBIOS Version : 6.0-53-e_49-Rel
NVDATA Version : 2.09.03-0051
Boot Block Version : 2.02.00.00-0000
BOOT Version : 09.250.01.219
[...]
Logical drive 0 on adapter 0 status and type
~] megacli -LDInfo -L0 -a0
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 135.972 GB
Sector Size : 512
Mirror Data : 135.972 GB
State : Optimal
Strip Size : 128 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disabled
Encryption Type : None
Is VD Cached: No
Exit Code: 0x00
Show physical disks from first controller:
~] megacli -PDList -a0
Adapter #0
Enclosure Device ID: 252
Slot Number: 0
Enclosure position: N/A
Device Id: 13
WWN: 5000C50023595FFC
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 1
Last Predictive Failure Event Seq Number: 54338
PD Type: SAS
Hotspare Information:
Type: Global, is revertible
Raw Size: 136.731 GB [0x11176d60 Sectors]
Non Coerced Size: 136.231 GB [0x11076d60 Sectors]
Coerced Size: 135.972 GB [0x10ff2000 Sectors]
Sector Size: 0
Firmware state: Hotspare, Spun Up
Device Firmware Level: B62C
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c50023595ffd
SAS Address(1): 0x0
Connected Port Number: 2(path0)
Inquiry Data: IBM-ESXSST9146852SS B62C3TB19TDM0324B62C
IBM FRU/CRU: 42D0668
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive: Not Certified
Drive Temperature :32C (89.60 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : Yes
Enclosure Device ID: 252
Slot Number: 1
Drive's position: DiskGroup: 0, Span: 0, Arm: 0
Enclosure position: N/A
Device Id: 10
WWN: 5000C500235A1D08
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 136.731 GB [0x11176d60 Sectors]
Non Coerced Size: 136.231 GB [0x11076d60 Sectors]
Coerced Size: 135.972 GB [0x10ff2000 Sectors]
Sector Size: 0
Firmware state: Online, Spun Up
Commissioned Spare : No
Emergency Spare : No
Device Firmware Level: B62C
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c500235a1d09
SAS Address(1): 0x0
Connected Port Number: 1(path0)
Inquiry Data: IBM-ESXSST9146852SS B62C3TB1H60G0324B62C
IBM FRU/CRU: 42D0668
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive: Not Certified
Drive Temperature :35C (95.00 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
Enclosure Device ID: 252
Slot Number: 2
Drive's position: DiskGroup: 0, Span: 0, Arm: 1
Enclosure position: N/A
Device Id: 9
WWN: 5000C500235A08D8
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 136.731 GB [0x11176d60 Sectors]
Non Coerced Size: 136.231 GB [0x11076d60 Sectors]
Coerced Size: 135.972 GB [0x10ff2000 Sectors]
Sector Size: 0
Firmware state: Online, Spun Up
Commissioned Spare : No
Emergency Spare : No
Device Firmware Level: B62C
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c500235a08d9
SAS Address(1): 0x0
Connected Port Number: 0(path0)
Inquiry Data: IBM-ESXSST9146852SS B62C3TB1H5J50324B62C
IBM FRU/CRU: 42D0668
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive: Not Certified
Drive Temperature :31C (87.80 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
Exit Code: 0x00
megaclisas-status is a wrapper script around megacli that report summarized RAID status with periodic checks feature. It is available in the packages repository too.
The packages comes with a python wrapper around megacli and an initscript that periodic run this wrapper to check status. It keeps a file with latest status and thus is able to detect RAID status changes and/or brokeness. It will log a ligne to syslog when something failed and will send you a mail. Until arrays are healthy again a reminder will be sent each 2 hours.
Wrapper output example
~] megaclisas-status
-- Controller information --
-- ID | H/W Model | RAM | Temp | BBU | Firmware
c0 | ServeRAID M5014 SAS/SATA Controller | 256MB | N/A | REPL | FW: 12.15.0-0199
-- Array information --
-- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Path | CacheCade |InProgress
c0u0 | RAID-1 | 136G | 128 KB | RA,WT | Disabled | Optimal | /dev/sda | None |None
-- Disk information --
-- ID | Type | Drive Model | Size | Status | Speed | Temp | Slot ID | LSI ID
c0u0p0 | HDD | IBM-ESXSST9146852SS B62C3TB1H60G0324B62C | 135.9 Gb | Online, Spun Up | 6.0Gb/s | 35C | [252:1] | 10
c0u0p1 | HDD | IBM-ESXSST9146852SS B62C3TB1H5J50324B62C | 135.9 Gb | Online, Spun Up | 6.0Gb/s | 32C | [252:2] | 9
-- Unconfigured Disk information --
-- ID | Type | Drive Model | Size | Status | Speed | Temp | Slot ID | LSI ID | Path
c0uXpY | HDD | IBM-ESXSST9146852SS B62C3TB19TDM0324B62C | 135.9 Gb | Hotspare, Spun Up | 6.0Gb/s | 33C | [252:0] | 13 | N/A
The script can be called with --nagios parameter. It will force a single line output and will return exit code 0 if all good, or 2 if at least one thing is wrong. It's the standard nagios expected return code.
~] megaclisas-status --nagios
RAID OK - Arrays: OK:1 Bad:0 - Disks: OK:3 Bad:0
~] echo $?
0
# find full path to megaclisas-status script
~] which megaclisas-status
/usr/sbin/megaclisas-status
# go to nagios plugins directory
~] cd /usr/lib/nagios/plugins/
# create symlink with name check_megaclisas_status
~] ln -s /usr/sbin/megaclisas-status check_megaclisas_status
megaclisas-status must root privileges to run command. So, go to /etc/sudoers.d/ directory and create file monitoring with this contain:
/etc/sudoers.d/monitoring
Cmnd_Alias CMD_MONITORING = /usr/lib/nagios/plugins/check_megaclisas_status, /usr/sbin/megaclisas-status
nagios ALL=(ALL) NOPASSWD: CMD_MONITORING
Check that it works:
~] su - nagios
~] sudo /usr/sbin/megaclisas-status --nagios
RAID OK - Arrays: OK:1 Bad:0 - Disks: OK:3 Bad:0
~] sudo /usr/lib/nagios/plugins/check_megaclisas_status --nagios
RAID OK - Arrays: OK:1 Bad:0 - Disks: OK:3 Bad:0
Create megaclisas_status.conf file in your icinga2 config directory with this content:
megaclisas_status.conf
object CheckCommand "megaclisas_status" {
command = [ "sudo", "/usr/lib/nagios/plugins/check_megaclisas_status" ]
arguments = {
"--nagios" = {
required = true
}
}
}
Go to icinga2 config dir, create file with service definiton and add service to server
megaraid.conf
object Service "megaraid" {
host_name = "monitoring.secar.cz"
check_command = "megaclisas_status"
check_interval = 1m
retry_interval = 30s
max_check_attempts = 5
}
Check icinga2 configuration files integrity and reload config
~] icinga2 daemon -C
~] /etc/init.d/icinga2 restart