THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.
Affected Product Name | Description | Comments |
---|---|---|
APIC-MR-X16G1RS-H | ^16GB DDR4-2666-MHz RDIMM/PC4-21300/single rank/x4/1.2v | |
APIC-SERVER-L3 | ^APIC Appliance - Large Config. (> 1200 Edge Ports) | |
APIC-SERVER-M3 | ^APIC Appliance - Medium Configuration (Upto 1200 Edge Ports) | |
N9K-C93108TC-FX | Nexus 9300 with 48p 10G-T, 6p 100G QSFP28 | |
N9K-C93108TC-FX-24 | Nexus 9300-FX w/ 24p 100M/1/10GT & 6p 40/100G | |
N9K-C93108TC-FX3P | Nexus 9300 48x 100M/1/2.5/5/10GT, 6x 100G Switch | |
N9K-C93108TC-FX3P= | Nexus 9300 N9K-C93108TC-FXP Spare (no fan/PSU/acc) | |
N9K-C93108TC-FX= | Nexus 9300 with 48p 10G-T, 6p 100G QSFP28 | |
N9K-C93180YC-FX | Nexus 9300 with 48p 1/10/25G, 6p 40/100G, MACsec | |
N9K-C93180YC-FX-24 | Nexus 9300-FX w/24p 1/10/25G & 6p 40/100G | |
N9K-C93180YC-FX3 | Nexus 9300 48p 1/10/25G, 6p 40/100G, MACsec,SyncE | |
N9K-C93180YC-FX3= | Nexus 9300 48p 1/10/25G, 6p 40/100G, MACsec,SyncE | |
N9K-C93180YC-FX3S | Nexus 9300 with 48p 1/10/25G SFP, 6p 40/100G QSFP, SyncE | |
N9K-C93180YC-FX3S= | Nexus 9300 with 48p 1/10/25G SFP, 6p 40/100G QSFP, SyncE | |
N9K-C93216TC-FX2 | Nexus 9300 with 96p 10G-T, 12p 100G QSFP, MACsec capable | |
N9K-C93240YC-FX2 | Nexus 9300 with 48p 10/25G SFP+ and 12p 100G QSFP28 | |
N9K-C93240YC-FX2= | Nexus 9K fixed,48p 10/25G SFP+12p 100G spare(no PSU/fan/acc) | |
N9K-C9332C | ^Nexus 9K ACI & NX-OS Spine, 32p 40/100G & 2p 10G | |
N9K-C93360YC-FX2 | Nexus 9300 w/ 96p 1/10/25G, 12p 100G, MACsec capable | |
N9K-C9336C-FX2 | Nexus 9300 Series, 36p 40/100G QSFP28 | |
N9K-C9336C-FX2= | Nexus 9K fixed, 36p 40/100G QSFP28,spare(no fan/psu/acc) | |
N9K-C9348GC-FXP | Nexus 9300 with 48p 100M/1GT, 4p 10/25G & 2p 40/100G QSFP28 | |
N9K-C9348GC-FXP= | Nexus 9K,48x1GT,4x10/25G,2x40/100G Spare(No Acc kit,PS&fan) | |
N9K-C9364C-GX | Nexus 9K ACI & NX-OS Leaf/Spine, 64p 40/100G QSFP28 | |
N9K-C9364C-GX= | Nexus 9K ACI & NX-OS Leaf/Spine, 64p 40/100G QSFP28 | |
N9K-SUP-A+ | Supervisor for Nexus 9500 | |
NXK-MEM-16GB | Additional memory of 16GB for Nexus Switches | |
SE-NODE-G2 | ^Service Node |
Defect ID | Headline |
CSCwb98743 | FN72464: Some DIMMs failing at higher than expected rate |
A limited number of dual in-line memory modules (DIMMs) shipped from Cisco are impacted by a known deviation in the memory supplier's manufacturing process. This deviation can result in a higher-than-expected rate of failure.
DIMM manufacturers compose their DIMMs of multiple memory modules to reach the desired capacity. In this case, a manufacturing deviation in specific modules impacts 8GB DIMMs and 16GB DIMMs. This deviation was contained to a specific date range, and the DIMMs that were manufactured during the middle to end of 2020.
Since the discovery of this deviation, additional limits have been imposed on the manufacturing process to help prevent future DIMMs from experiencing early failure due to this process variation.
The DIMMs with this manufacturing deviation will exhibit persistent correctable memory errors. If not replaced, the DIMMs can eventually encounter an uncorrectable memory event. If encountered during runtime, uncorrectable errors will cause an unexpected switch reset.
Various DIMM Reliability, Availability, and Serviceability (RAS) features or even operating system features can mask the extent of these correctable errors. It is recommended to check your DIMMs for exposure using the Serial Number Validation Tool described in the Serial Number Validation section of this field notice. Only specific DIMMs are impacted by this issue.
Solution
Customers should replace the hardware DIMM to avoid the potential for unexpected switch/server failure. For information about requesting a replacement, see the Upgrade Program Information section of this field notice after validating the Serial Number(s) as described in the How To Identify Affected Products section.
Important: For an updated list of impacted products, re-check serial numbers for all products that are listed in this field notice. All 8 GB and 16 GB are included in the Serial Number Validation tool.
All Serial Numbers are the Switch or Server Serial Number, not the DIMM Serial Number.
DIMM Replacement
Cisco is offering field services free of charge for DIMM replacement through a qualified Cisco third-party Field Engineer.
After the replacement DIMMs have arrived onsite and you are ready to schedule the replacement(s), send an email to ciscodimmswap@parkplacetech.com to engage the field services team.
Impacted DIMMs can be identified based on the Product ID (PID) and their serial number. To determine if a serial number is affected, see the Serial Number Validation section of this field notice.
Note: Cisco recommends notifying our engineers for onsite service to schedule repairs or replace the device. See the Additional Information section.
Switch Logs That Contain Memory Errors
When running NX-OS Standalone, a unit that experiences this issue can show these messages in the syslogs on the device:
%DAEMON-3-SYSTEM_MSG: Location: SOCKET:0 CHANNEL:? DIMM:? [] - mcelog
%DEVICE_TEST-3-MCE_24HR_FAIL: Module 1 has exceeded MCE 24 hour correctable threshold of 100 with #### correctable errors within 24 hours.
%DAEMON-3-SYSTEM_MSG: corrected Socket memory error count exceeded threshold: #### in 24h - mcelog
or
%DAEMON-3-SYSTEM_MSG: MESSAGE : corrected DIMM memory error count exceeded threshold: #### in 24h - mcelog
%DAEMON-3-SYSTEM_MSG: MESSAGE_Location: /var/log/mcelog - mcelog
Note: Syslogs that reference the count of errors will be generated whenever 100 errors are corrected and the cumulative count of errors will be printed.
These errors indicate that correctable errors are being generated, which should not impact switch performance. If errors continue, an uncorrectable error can be experienced, and the device will undergo a kernel panic.
When running ACI software, a unit that experiences this issue can display errors in the /mnt/pss/bootlogs/current/dmesg archived log or in the latest logs in dmesg -T in the CLI. For example, the following logs confirm that DIMMs are bad and in which DIMM-0 is bad:
[ 167.751610] sbridge: HANDLING MCE MEMORY ERROR
[ 167.751614] CPU 0: Machine Check Exception: 0 Bank 7: 8c00004000010091
[ 168.415928] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Channel#1_DIMM#0 (channel:1 slot:0 page:0x53232 offset:0xfc0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0091 socket:0 channel_mask:2 rank:0)
Switch Special Notes
This field notice is to replace memory on site, and it is important to note that if your impacted switch/server has failed or has memory errors and is degraded, use the standard RMA replacement.
Onsite replacements are usually done during maintenance windows and typically scheduled.
Some switches are originally assembled with 24GB of memory, this is one 16GB and one 8GB DIMM. Because of this, we send both DIMMs on those devices. When the switch cover is removed, both DIMMs are replaced even though the 8GB is known to be extremely reliable. This is a proactive action decision made by the Cisco team.
“Cisco highly recommends that you take advantage of the field service offering of a field engineer. Please note that if you decided to change the DIMMs on your own there are some switch models that have a more extensive process to gain access to the DIMM location."
It is highly recommended for a Cisco Field Engineer to complete the replacement for switches with high screw counts.
You can view a list of the switch module and the number of screws involved in this table.
Switch Model (PID) | DIMM Access Method | Number of Screws |
---|---|---|
N9K-C93180YC-FX3S | DIMM Door Access | 6 |
N9K-C93240YC-FX2 | DIMM Door Access | 6 |
N9K-C9364C-GX | DIMM Door Access | 6 |
N9K-C93180YC-FX-24 | DIMM Door Access | 6 |
N9K-C93180YC-FX3 | DIMM Door Access | 6 |
N9K-C93108TC-FX3P | DIMM Door Access | 5 |
N9K-C9336C-FX2 | Top Cover Access | 37 |
N9K-C9348GC-FXP | Top Cover Access | 35 |
N9K-C93108TC-FX | Top Cover Access | 33 |
N9K-C93108TC-FX-24 | Top Cover Access | 33 |
APIC Server Special Notes
The CIMC BIOS issue is noted in UCS field notice FN72272. This BIOS issue will show higher EC errors counts that are potentially higher than the actual EC error count. You can see Uncorrectable errors because of older BIOS.
Additional Details About Revisions
Revision 3.1
Minor text changes to CLI command switch and RMA all need to be done through this field notice.
Revision 3.0
In revision 3.0 Cisco Product number N9K-C93216TC-FX2 was added to the field notice.
Revision 2.0
DIMMS with 8 GB density that were Replace on Failure were moved to Proactive replacements. This is due to a higher fail rate than expected in the field.
Note: Any switch that was repaired previously has had all of its memory replaced. There is no need for the memory to be replaced again.
For Cisco N9K-C93360YC-FX2 or N9K-C93216TC-FX2, the DIMMs are not field replaceable. Cisco will replace the switch, not DIMMs, using the field notice to create the switch RMA.
Revision 1.2
Cisco Application Policy Infrastructure Controller (APIC) products were moved from Replace on Failure to Proactive DIMM replacement.
It is required to replace the DIMMS and update from the older Cisco Integrated Management Controller (CIMC) BIOS (Version 4.1(3c) or earlier) in the same maintenance window.
Cisco provides the Serial Number Validation Tool to verify whether a device is impacted by this issue. To check the device, enter the serial number in the Serial Number Validation Tool.
Important: For security reasons, you must click the Serial Number Validation Tool link that is provided in this section. Do not copy and paste the link into a browser. Use of the Serial Number Validation Tool URL external to this field notice will fail.
Support Case Manager (SCM) must be used for ordering replacement parts for this Field Notice. To open SCM in a new tab, click the following link:
https://mycase.cloudapps.cisco.com/fieldnotice?fn=FN72464
SCM will validate eligibility and ensure that a request for a particular serial number has not already been submitted. If there is already a request, SCM will indicate RMA already submitted and NOT eligible for replacement.
Provide the following information:
Order entry supports up to 50 serial numbers per request. For more than 50, submit additional requests.
Version | Description | Section | Date |
3.1 | Added information about errors in archived and latest logs. | How to Identify Affected Products | 2024-OCT-02 |
3.0 | Added product number N9K-C93216TC-FX2. Updated ACI DIMM error messages. | Products Affected, How to Identify Affected Products | 2024-AUG-08 |
2.0 | Added instructions for 8GB DIMM support. Updated the email address for the field services team. | Problem Description, Background, Workaround/Solution, Additional Information | 2024-JUL-11 |
1.4 | Updated the Upgrade Program Section. | — | 2023-AUG-24 |
1.3 | Updated the How To Identify Affected Products Section. | — | 2023-JUN-23 |
1.2 | Updated the Products Affected, Problem Description, Workaround/Solution, and Additional Information Sections. | — | 2022-DEC-13 |
1.1 | Updated the Background, Workaround/Solution, and Additional Information Sections. | — | 2022-NOV-14 |
1.0 | Initial Release | — | 2022-OCT-13 |
For further assistance or for more information about this field notice, contact the Cisco Technical Assistance Center (TAC) using one of the following methods:
To receive email updates about Field Notices (reliability and safety issues), Security Advisories (network security issues), and end-of-life announcements for specific Cisco products, set up a profile in My Notifications.
Unleash the Power of TAC's Virtual Assistance