Ad Widget

Collapse

A lot of alerts from Zabbix Agent2 SMART monitoring

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • etosamoe
    Junior Member
    • Nov 2022
    • 5

    #1

    A lot of alerts from Zabbix Agent2 SMART monitoring

    Hi all. I use Zabbix Agent 2 SMART disk monitoring (this one https://www.zabbix.com/integrations/smart) on our servers, but I have some alerts with nearly false-positive data.
    For example:
    Click image for larger version

Name:	image.png
Views:	297
Size:	24.9 KB
ID:	455198
    Then I go to this server, do smartctl -a /dev/sdb and get:
    SMART overall-health self-assessment test result: PASSED

    Almost good SMART:
    Code:
      1 Raw_Read_Error_Rate     0x002f   200   199   051    Pre-fail  Always       -       0
      3 Spin_Up_Time            0x0027   180   163   021    Pre-fail  Always       -       5958
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       18
      5 Reallocated_Sector_Ct   0x0033   180   180   140    Pre-fail  Always       -       644
      7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
      9 Power_On_Hours          0x0032   035   035   000    Old_age   Always       -       47766
     10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
    183 Runtime_Bad_Block       0x0032   001   001   000    Old_age   Always       -       279
    192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
    193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       53
    194 Temperature_Celsius     0x0022   108   096   000    Old_age   Always       -       42
    196 Reallocated_Event_Count 0x0032   181   181   000    Old_age   Always       -       19
    197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
    200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0​
    And no errors:
    SMART Error Log Version: 1
    No Errors Logged​

    Ok, next one:
    Click image for larger version

Name:	image.png
Views:	223
Size:	27.3 KB
ID:	455199
    smartctl -a /dev/sdc
    Device Model: TOSHIBA MG03ACA200
    SMART overall-health self-assessment test result: PASSED
    A bit strange SMART report with enormous RAW data, but I think it is not a problem, it is a disk feature.
    Code:
      1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0004   100   100   000    Old_age   Offline      -       0
      3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7488
      4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       40
      5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000a   100   100   000    Old_age   Always       -       0
      8 Seek_Time_Performance   0x0004   100   100   000    Old_age   Offline      -       0
      9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       50096
     10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       35
    193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       44
    194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       32 (Min/Max 20/50)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
    241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       798731482533
    242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       169762696429​
    SMART Error Log Version: 1
    No Errors Logged​


    And another example:
    Click image for larger version

Name:	image.png
Views:	217
Size:	26.6 KB
ID:	455200

    smartctl -a /dev/sda

    Everything good, except that the disk really has error in log, but it is really really old.
    The error says that it happend on 9300 lifehour:
    Click image for larger version

Name:	image.png
Views:	249
Size:	1.03 MB
ID:	455201

    ​And now I have 21500 hour:
    9 Power_On_Hours 0x0012 097 097 000 Old_age Always - 21535 -


    I don't now how to supress this alerts, because it will go back in 6 hours (I think). They all gets from Get Disk Attributes item, and then preprocessing to smart.disk.es - exit code.
    I don't think that everybody just change disks to new when they get that kind of errors.
    Want to ask some advice or your experience - what do you do in that kind of situations?
Working...