ECC corrections too important on one memory module of granet
A hardware error is logged on the granet iDRAC relative an ECC threshold / CPU issue
All the alerts were raised the 2021-10-01 around 21h
2021-10-01 21:05:33 MEM8000 Correctable memory error logging disabled for a memory device at location DIMM_B3.
2021-10-01 21:05:33 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:32 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:31 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:31 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:30 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:30 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:29 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:28 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:28 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:27 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:26 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:25 CPU9000 An OEM diagnostic event occurred.
2021-10-01 21:05:24 CPU0012 Correctable Machine Check Exception detected on CPU 2.
According to the dell manual :
- CPU0012 [1]
CPU0012
Message
Correctable Machine Check Exception detected on CPU arg1 .
Arguments
arg1 = number
Detailed Description
None.
Recommended Response Action
Review System Event Log and Operating System Logs. If the issue persists, contact technical support. Refer to the product documentation to choose a convenient contact method.
Category
System Health
Subcategory
CPU = Processor
Severity
Severity 2 (Warning)
Trap/EventID
2242
LCD Message
No LCD message display defined.
Initial Default
IPMI Alert;LC Log
Server Administrator Event ID
5603
Server Administrator Trap ID
5603
- CPU9000 [1]
CPU9000
Message
An OEM diagnostic event occurred.
Detailed Description
None
Recommended Response Action
No response action is required.
Category
System Health
Subcategory
CPU = Processor
Severity
Severity 3 (Informational)
LCD Message
No LCD message display defined.
Initial Default
LC Log
Server Administrator Event ID
Not Applicable
Server Administrator Trap ID
Not Applicable
- MEM8000 [2]
MEM8000
Message
Correctable memory error logging disabled for a memory device at location arg1 .
Arguments
arg1 = location
Detailed Description
Errors are being corrected but no longer logged.
Recommended Response Action
Review system logs for memory exceptions. Re-install memory at location <location>
Category
System Health
Subcategory
MEM = Memory
Severity
Severity 1 (Critical)
Trap/EventID
2265
LCD Message
SBE log disabled on <location>. Reseat memory
Initial Default
LC Log
Server Administrator Event ID
Not Applicable
Server Administrator Trap ID
Not Applicable
The version of the bios of this server is 2.3.2
According to the memory autorepair documentation[3], an PPR (Post Package Repair) is not planned if other errors are not detected.
The recommanded first action is to proceed to a reboot:
With BIOS 2.1.x or later, the first recommended step is to reboot/restart (without moving DIMMs to a different slot). This allows the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need to schedule any DIMM replacements.
It's also recommended to upgrade the bios and idrac software to improve the error detection but let this for later if the problem is still present after the reboot
- [1] https://www.dell.com/support/manuals/fr-fr/dell-opnmang-sw-v8.0.1/eemi_13g-v1/cpu-event-messages?guid=guid-789ec7d2-2a52-4063-a753-c5dc51e91359&lang=en-us
- [2] https://www.dell.com/support/manuals/fr-fr/dell-opnmang-sw-v8.0.1/eemi_13g-v1/mem-event-messages?guid=guid-ff360c01-4e4c-4f20-871d-1d24ced52985&lang=en-us
- [3] https://www.dell.com/support/kbdoc/fr-fr/000053203/what-is-ddr4-self-healing-on-dell-poweredge-servers-with-intel-xeon-scalable-processors?lang=en
Migrated from T3699 (view on Phabricator)