Hello,
i had some days ago a serious server breakdown due to RAM failure.
So i checked where to look up for ECC errors to monitor them.
Seems so the zabbix agend has no key for that?
I know i can realize that with an UserParameter, but this is important
enough to be implemented directly inside the agent.
The corrected and uncorrected errors should be visible in
/sys/devices/system/edac/mc/mc0/ce_count
and
/sys/devices/system/edac/mc/mc0/ue_count
In my opinion a value > 0 inside ue_count should be so seriously
to shutdown the server immediately and get repaired.
Kind regards
Thomas
i had some days ago a serious server breakdown due to RAM failure.
So i checked where to look up for ECC errors to monitor them.
Seems so the zabbix agend has no key for that?
I know i can realize that with an UserParameter, but this is important
enough to be implemented directly inside the agent.
The corrected and uncorrected errors should be visible in
/sys/devices/system/edac/mc/mc0/ce_count
and
/sys/devices/system/edac/mc/mc0/ue_count
In my opinion a value > 0 inside ue_count should be so seriously
to shutdown the server immediately and get repaired.
Kind regards
Thomas
Comment