Hi all. I use Zabbix Agent 2 SMART disk monitoring (this one https://www.zabbix.com/integrations/smart) on our servers, but I have some alerts with nearly false-positive data.
For example:
Then I go to this server, do smartctl -a /dev/sdb and get:
SMART overall-health self-assessment test result: PASSED
Almost good SMART:
And no errors:
SMART Error Log Version: 1
No Errors Logged
Ok, next one:

smartctl -a /dev/sdc
Device Model: TOSHIBA MG03ACA200
SMART overall-health self-assessment test result: PASSED
A bit strange SMART report with enormous RAW data, but I think it is not a problem, it is a disk feature.
SMART Error Log Version: 1
No Errors Logged
And another example:

smartctl -a /dev/sda
Everything good, except that the disk really has error in log, but it is really really old.
The error says that it happend on 9300 lifehour:

And now I have 21500 hour:
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 21535 -
I don't now how to supress this alerts, because it will go back in 6 hours (I think). They all gets from Get Disk Attributes item, and then preprocessing to smart.disk.es - exit code.
I don't think that everybody just change disks to new when they get that kind of errors.
Want to ask some advice or your experience - what do you do in that kind of situations?
For example:
Then I go to this server, do smartctl -a /dev/sdb and get:
SMART overall-health self-assessment test result: PASSED
Almost good SMART:
Code:
1 Raw_Read_Error_Rate 0x002f 200 199 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 180 163 021 Pre-fail Always - 5958 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 18 5 Reallocated_Sector_Ct 0x0033 180 180 140 Pre-fail Always - 644 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 035 035 000 Old_age Always - 47766 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 15 183 Runtime_Bad_Block 0x0032 001 001 000 Old_age Always - 279 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 3 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 53 194 Temperature_Celsius 0x0022 108 096 000 Old_age Always - 42 196 Reallocated_Event_Count 0x0032 181 181 000 Old_age Always - 19 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
Ok, next one:
smartctl -a /dev/sdc
Device Model: TOSHIBA MG03ACA200
SMART overall-health self-assessment test result: PASSED
A bit strange SMART report with enormous RAW data, but I think it is not a problem, it is a disk feature.
Code:
1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0 2 Throughput_Performance 0x0004 100 100 000 Old_age Offline - 0 3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 7488 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 40 5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0 7 Seek_Error_Rate 0x000a 100 100 000 Old_age Always - 0 8 Seek_Time_Performance 0x0004 100 100 000 Old_age Offline - 0 9 Power_On_Hours 0x0032 001 001 000 Old_age Always - 50096 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 40 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 35 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 44 194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 32 (Min/Max 20/50) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 798731482533 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 169762696429
No Errors Logged
And another example:
smartctl -a /dev/sda
Everything good, except that the disk really has error in log, but it is really really old.
The error says that it happend on 9300 lifehour:
And now I have 21500 hour:
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 21535 -
I don't now how to supress this alerts, because it will go back in 6 hours (I think). They all gets from Get Disk Attributes item, and then preprocessing to smart.disk.es - exit code.
I don't think that everybody just change disks to new when they get that kind of errors.
Want to ask some advice or your experience - what do you do in that kind of situations?