Ad Widget

**etosamoe** · 30-11-2022, 13:52

Hi all. I use Zabbix Agent 2 SMART disk monitoring (this one https://www.zabbix.com/integrations/smart) on our servers, but I have some alerts with nearly false-positive data.
For example:

Then I go to this server, do smartctl -a /dev/sdb and get:
SMART overall-health self-assessment test result: PASSED

Almost good SMART:

Code:

  1 Raw_Read_Error_Rate     0x002f   200   199   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   180   163   021    Pre-fail  Always       -       5958
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       18
  5 Reallocated_Sector_Ct   0x0033   180   180   140    Pre-fail  Always       -       644
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   035   035   000    Old_age   Always       -       47766
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
183 Runtime_Bad_Block       0x0032   001   001   000    Old_age   Always       -       279
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       53
194 Temperature_Celsius     0x0022   108   096   000    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   181   181   000    Old_age   Always       -       19
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

And no errors:
SMART Error Log Version: 1
No Errors Logged

Ok, next one:

smartctl -a /dev/sdc
Device Model: TOSHIBA MG03ACA200
SMART overall-health self-assessment test result: PASSED
A bit strange SMART report with enormous RAW data, but I think it is not a problem, it is a disk feature.

Code:

  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   100   100   000    Old_age   Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7488
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       40
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       50096
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       44
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       32 (Min/Max 20/50)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       798731482533
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       169762696429

SMART Error Log Version: 1
No Errors Logged

And another example:

smartctl -a /dev/sda

Everything good, except that the disk really has error in log, but it is really really old.
The error says that it happend on 9300 lifehour:

And now I have 21500 hour:
9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 21535 -

I don't now how to supress this alerts, because it will go back in 6 hours (I think). They all gets from Get Disk Attributes item, and then preprocessing to smart.disk.es - exit code.
I don't think that everybody just change disks to new when they get that kind of errors.
Want to ask some advice or your experience - what do you do in that kind of situations?

Ad Widget

A lot of alerts from Zabbix Agent2 SMART monitoring

A lot of alerts from Zabbix Agent2 SMART monitoring