Troubleshoot Memory Errors on UCS Servers

Available Languages

Download Options

PDF (808.5 KB)
View with Adobe Reader on a variety of devices
ePub (777.6 KB)
View in various apps on iPhone, iPad, Android, Sony Reader, or Windows Phone
Mobi (Kindle) (741.0 KB)
View on Kindle device or Kindle app on multiple devices

Updated:October 25, 2024

Document ID:222546

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

This document describes the troubleshooting steps to handle memory errors on UCS Servers.

Prerequisites

Requirements

Cisco recommends that you have knowledge of these topics.

Basic understanding of UCS.
Basic understanding of Memory Architecture.

Components Used

The information in this document is based on these software and hardware versions:

UCS Family Servers M5, M6, M7 and higher.
UCS Manager
Cisco Integrated Management Controller (CIMC)
Cisco Intersight Managed Mode (IMM)

The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.

Background Information

Memory Errors

Memory errors are encountered when an attempt is made to read a memory location. The value read from the memory does not match the value that is supposed to be there. These errors are classified into two types:

1. Soft errors

Soft errors are transient and do not continue to be repeated. These are temporary and can often be corrected by retrying the read or rewriting the memory location.

2. Hard Errors

Permanent physical defects cause these. Rewriting the memory location and retrying the read access does not eliminate a hard error. As a result, this memory error is uncorrectable, and the memory needs to be replaced as the error continues to repeat.

Correctable Errors

If errors are detected and corrected, they are considered correctable. This can be accomplished by retrying the read or by calculating the correct memory contents using Error Correction Code (ECC) data and writing the proper data back into memory. After an error is detected and corrected, the Cisco Integrated Management Controller (IMC) logs the event in the System Event Log.

Typically, correctable errors are the result of soft errors. If correctable errors persist within the same memory location over an extended period, it could indicate a potential hard error.

Adaptive Double Device Data Correction (ADDDC)

ADDDC Sparing can correct two successive DRAM failures if they reside in the same region. ADDDC dynamically moves data from failing bits to spare memory, preventing correctable errors from becoming uncorrectable. A threshold of correctable ECC errors is required to trigger the mechanism.

ADDDC helps in some scenarios where correctable ECC errors precede Uncorrectable ECC errors.

Post Package Repair (PPR)

Post Package Repair (PPR) can permanently repair failing memory regions within a DIMM by leveraging redundant DRAM rows. This permanent in-field repair allows for rapid recovery from hard errors without needing to replace the DIMM. To perform a repair, the system must experience an ADDDC event and go through at least one reboot cycle. This repair activity does not affect performance or the total memory available to the OS.

PPR and ADDDC are enabled by default, however, can be configurable. PPR requires ADDDC Sparing RAS mode to also be enabled. If the RAS setting is other than ADDDC Sparing or Platform Default, PPR is not operational. The only supported PPR mode is Hard PPR, which means repairs are permanent.

Partial Cache Line Sparing (PCLS)

There is an error-prevention mechanism in the memory controller. It works by identifying faulty small portions of data in memory. These faulty locations are recorded in a special directory, along with backup data that can replace them. When the memory is accessed, if there is an error in those faulty spots, the controller uses the backup data from the directory to ensure everything runs smoothly.

Note: The features are available depending on the CPU architecture and the firmware version running on the server. Ensure you are in the last recommended version to handle the memory errors better.

Troubleshoot RAS Faults

UCS Manager

Generally, you see these faults in UCS Manager as an RAS event.

In the health summary, you can find more information about the error, whether PCLS or PPR was triggered.

PCLS example

On M6 servers and newer, you have the option to enable Patrial cache line sparing (PCLS) as a BIOS option, which is an error-prevention mechanism. The server must be rebooted as soon as possible, so PPR can kick in and repair the DIMM. Once the server is rebooted, monitor for additional UCS Manager faults for the same DIMM.

As the alert mentions, it is recommended to reboot the server at the earliest convenience, since there is an associated risk of experiencing an Uncorrectable Error, and consequently an unexpected server downtime.

PCLS2 Example

PPR example

The server has ADDDC and PPR enabled, and a RAS event occurred. The fault suggests rebooting for PPR to repair the DIMM. The server needs to be rebooted as soon as possible for PPR to kick in and repair the DIMM.

Once the server is rebooted, monitor for additional UCS Manager faults for the same DIMM.

PPR Example

Intersight Managed Mode

The Server has ADDDC enabled, and a BANK VLS event occurred, creating the fault you see. In this scenario, the next step is to perform a server reboot as soon as possible to allow PPR to be executed.

Intersight Managed Mode

Cisco Integrated Management Controller (CIMC)

The fault appears as shown when using the Cisco Integrated Management Controller. If the server has ADDDC and a VLS event occurred, then this is working as designed to prevent uncorrectable errors.

Cisco Integrated Management Controller

Troubleshooting Steps

Verify no other DIMM faults are present for instance and Uncorrectable Error.
Schedule a maintenance window.
Place a host in maintenance mode, and reboot the server to attempt permanent repair of the DIMM using Post Package Repair (PPR).

UCSM Reboot Steps

Note: You can reboot the server from the OS as well. This example uses the reboot option from the server UI.

Navigate to your UCS Manager web Interface.

Blade Server

Navigate to Equipment > Chassis > Server X.

Integrated Server

Navigate to Equipment > Rack-Mounts > Server X.

Click KVM console.

Reboot KVM Console

on the KVM windows, click server actions, select Reset, and click OK.

KVM Console

Monitor in the KVM the reboot process, and ensure the OS boots up correctly.

IMM Reboot Steps

Navigate to the Servers Tab, identify the server, and click the Action (three dots) menu.

IMM Reboot Steps

Next, select the Power menu and then Power Cycle option.

Power Menu

Click the Power Cycle button to confirm the action.

Power Cycle Server

Validate the progress under the Requests menu.

Requests Menu

CIMC Reboot Steps

Navigate to the Host Power option and select Power Cycle.

Launch the KVM to monitor the reboot process, and ensure the OS boots up correctly.

Monitor for New Faults

If no errors occur after reboot meaning that there is no other RAS event or fault related to the DIMM, PPR was successful and the server can be put back into use.

If new ADDDC events occur, repeat the reboot process outlined in the previous steps to perform additional permanent repairs with PPR.

If an Uncorrectable Error, or inoperable fault occurs after reboot, the fault indicates that a memory needs to be replaced.

Note: Please open a case with Cisco TAC to replace the DIMM if you encounter any of these faults.

UCS Manager Uncorrectable Memory Error

UCS Manager Uncorrectable Memory Error - 1

UCS Manager Uncorrectable Memory Error - 2

IMM Memory Uncorrectable Error

Uncorrectable error fault. The fault indicates the DIMM has an uncorrectable error and needs to be replaced.

IMM Memory Uncorrectable Error

CIMC Uncorrectable Memory Error

Related Information

Memory Technical Overview - Memory RAS Feat ures

Revision History

Revision	Publish Date	Comments
1.0	25-Oct-2024	Initial Release

Contributed by Cisco Engineers

Ricardo Martinez
Customer Delivery Engineering Technical Leader

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

Troubleshoot Memory Errors on UCS Servers

Available Languages

Download Options

Bias-Free Language

Contents

Introduction

Prerequisites

Requirements

Components Used

Background Information

Memory Errors

Correctable Errors

Adaptive Double Device Data Correction (ADDDC)

Post Package Repair (PPR)

Partial Cache Line Sparing (PCLS)

Troubleshoot RAS Faults

UCS Manager

Intersight Managed Mode

Cisco Integrated Management Controller (CIMC)

Troubleshooting Steps

UCSM Reboot Steps

IMM Reboot Steps

CIMC Reboot Steps

Monitor for New Faults

UCS Manager Uncorrectable Memory Error

IMM Memory Uncorrectable Error

CIMC Uncorrectable Memory Error

Related Information

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products