Skip to content

ECC corrections too important on one memory module of granet

A hardware error is logged on the granet iDRAC relative an ECC threshold / CPU issue

All the alerts were raised the 2021-10-01 around 21h

2021-10-01 21:05:33 	MEM8000 	Correctable memory error logging disabled for a memory device at location DIMM_B3.
2021-10-01 21:05:33 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:32 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:31 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:31 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:30 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:30 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:29 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:28 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:28 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:27 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:26 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:25 	CPU9000 	An OEM diagnostic event occurred.
2021-10-01 21:05:24 	CPU0012 	Correctable Machine Check Exception detected on CPU 2.

According to the dell manual :

  • CPU0012 [1]
CPU0012

Message
    Correctable Machine Check Exception detected on CPU arg1 . 
Arguments

        arg1 = number

Detailed Description
    None. 
Recommended Response Action
    Review System Event Log and Operating System Logs. If the issue persists, contact technical support. Refer to the product documentation to choose a convenient contact method. 
Category
    System Health 
Subcategory
    CPU = Processor 
Severity
    Severity 2 (Warning)
Trap/EventID
    2242
LCD Message
    No LCD message display defined.
Initial Default
    IPMI Alert;LC Log
Server Administrator Event ID
    5603
Server Administrator Trap ID
    5603 
  • CPU9000 [1]
CPU9000

Message
    An OEM diagnostic event occurred. 
Detailed Description
    None 
Recommended Response Action
    No response action is required. 
Category
    System Health 
Subcategory
    CPU = Processor 
Severity
    Severity 3 (Informational)
LCD Message
    No LCD message display defined.
Initial Default
    LC Log
Server Administrator Event ID
    Not Applicable
Server Administrator Trap ID
    Not Applicable 
  • MEM8000 [2]
MEM8000

Message
    Correctable memory error logging disabled for a memory device at location arg1 . 
Arguments

        arg1 = location

Detailed Description
    Errors are being corrected but no longer logged. 
Recommended Response Action
    Review system logs for memory exceptions. Re-install memory at location <location> 
Category
    System Health 
Subcategory
    MEM = Memory 
Severity
    Severity 1 (Critical)
Trap/EventID
    2265
LCD Message
    SBE log disabled on <location>. Reseat memory
Initial Default
    LC Log
Server Administrator Event ID
    Not Applicable
Server Administrator Trap ID
    Not Applicable 

The version of the bios of this server is 2.3.2 According to the memory autorepair documentation[3], an PPR (Post Package Repair) is not planned if other errors are not detected. The recommanded first action is to proceed to a reboot:

With BIOS 2.1.x or later, the first recommended step is to reboot/restart (without moving DIMMs to a different slot). This allows the new BIOS enhancements to run, potentially resolving (self-healing) the DIMM errors without the need to schedule any DIMM replacements. 

It's also recommended to upgrade the bios and idrac software to improve the error detection but let this for later if the problem is still present after the reboot


Migrated from T3699 (view on Phabricator)

Edited by Vincent Sellier