THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.
Revision | Publish Date | Comments |
---|---|---|
1.5 |
18-Sep-22 |
Updated the Background Section |
1.4 |
28-Oct-21 |
Updated the Problem Symptom Section |
1.3 |
13-Jun-21 |
Updated the Workaround/Solution Section |
1.23 |
03-Jun-21 |
Updated the Workaround/Solution Section |
1.22 |
01-Jun-21 |
Updated the Defect Information Section |
1.21 |
24-May-21 |
Updated the Problem Description and Background Sections |
1.2 |
14-May-21 |
Updated the How to Identify Affected Products Section and Added the Serial Number Validation Section |
1.1 |
21-Apr-21 |
Updated the Workaround/Solution Section |
1.0 |
29-Mar-21 |
Initial Release |
Affected Product ID | Comments |
---|---|
N9K-C9396TX |
|
N9K-C9396PX |
|
N9K-C93128TX |
|
N9K-C9332PQ |
|
N9K-C9372PX |
|
N9K-C9372TX |
|
N9K-C93120TX |
|
N9K-C9372PX-E |
|
N9K-C9372TX-E |
|
N9K-C93180YC-EX |
|
N9K-C93108TC-EX |
|
N9K-C93180LC-EX |
|
N9K-SUP-B+ |
|
N9K-SUP-B |
|
N9K-SUP-A+ |
|
N9K-SUP-A |
|
N9K-C9396TX= |
|
N9K-C9396PX= |
|
N9K-C93128TX= |
|
N9K-C9332PQ= |
|
N9K-C9372PX= |
|
N9K-C9372TX= |
|
N9K-C93120TX= |
|
N9K-C9372PX-E= |
|
N9K-C9372TX-E= |
|
N9K-C93180YC-EX= |
|
N9K-C93108TC-EX= |
|
N9K-C93180LC-EX= |
|
N9K-SUP-B+= |
|
N9K-SUP-B= |
|
N9K-SUP-A+= |
|
N9K-SUP-A= |
|
N9K-C9336PQ |
|
N9K-C9336PQ= |
Defect ID | Headline |
---|---|
CSCvx19640 | Nexus 9000 switch in read-only mode with M500IT SSD |
Due to a flaw in the Solid State Drive (SSD) firmware, the SSD will no longer respond after approximately 3.2 years of cumulative operation. Restart the system in order to allow the drive to operate for another 1008 hours (approximately six weeks) before it will no longer respond again.
After approximately 3.2 years (28,224 accumulated Power On Hours (POH)), a memory buffer overrun condition occurs which triggers the firmware event in the SSD.
This causes the drive to become unresponsive until the drive is power-cycled. No data loss will occur when the memory buffer overrun firmware event occurs. A power-cycle restores normal operation of the drive.
The drive continues to operate normally for approximately six weeks (1008 additional accumulated POH), at which time the drive will become unresponsive again.
Power-cycle the system in order to temporarily recover from this problem. However, this failure will reappear after 1008 hours of operation.
The SSD goes into read-only mode, which prevents normal switch functionality. When this happens the behavior of the switch is unpredictable, can cause operational impact, and an unexpected reload. Diagnostic test ssd-acc will fail repeatedly and fault F1222 will be raised on the switch until it is reloaded.
The failure conditions can be identified with these outputs:
Fault Code : F1222
Cause : diag-failed
Affected Object : topology/pod-1/node-350/sys/diag/rule-ssd-acc-trig-forever/subj-[topology/pod-1/node-350/sys/ch/supslot-1/sup]/rslt-2019-03-11T00:32:59.000+04:00
Description : Diagnostics test failed. reason:Failed to write to file
leaf101# show diagnostic result module 1 test 24 detail Current bootup diagnostic level: bypass Module 1: 48x10/25G (Active) Test results: (. = Pass, F = Fail, I = Incomplete, U = Untested, A = Abort, E = Error disabled) ---------------------------------------------------------------------- 24) ssd-acc F Error code ------------------> DIAG TEST FAIL Total run count -------------> 216 Last test execution time ----> 2019-03-11 00:32:59 First test failure time -----> 2019-03-11 00:32:59 Last test failure time ------> 2019-03-11 00:32:59 Last test pass time ---------> 2019-03-11 00:27:59 Total failure count ---------> 1 Consecutive failure count ---> 1 Last failure reason ---------> Failed to write to file Next Execution time ---------> 2019-03-11 00:37:59 ----------------------------------------------------------------------
leaf101# mount | grep bootflash /dev/sda4 on /bootflash type ext4 (ro,nodev,noexec,noatime,data=ordered) /dev/sda4 on /bootflash type ext4 (ro,nodev,noexec,noatime,data=ordered)
Workaround
Restart the system in order to temporarily recover from this problem. However, this failure will reappear after 1008 hours of operation.
Solution
Upgrade the firmware of the SSD.
In order to prevent this issue and disruption to the network and operations, Cisco recommends to upgrade the firmware of the SSD proactively before the uptime reaches 28,224 hours. See the How to Identify Affected Products section and follow the firmware upgrade procedure accordingly.
If the system is already impacted, the SSD firmware upgrade will permanently resolve this defect.
Notes:
Option 1. Upgrade the Application Centric Infrastructure NX-OS Version
First, verify if the /bootflash directory is already in the failed state (read-only state). If so, the switch must first be reloaded prior to the software upgrade.
The new firmware with the fix for this issue will be packaged in 13.2(10), 14.2(7), 15.1(4), 15.2(1), and later Application Centric Infrastructure (ACI) NX-OS versions.
When the switch is upgraded using the fixed (ACI) NX-OS version, the SSD firmware version will be automatically upgraded. See the Cisco APIC Installation, Upgrade, and Downgrade Guide for more information.
Option 2. Upgrade the SSD Firmware with SSDUpgrader
Use this procedure to download the SSDUpgrader app from the Cisco DC App Center and upgrade the SSD firmware.
Notes:
Check the model and firmware revision of the SSD.
apic1# moquery -c eqptFlash -f 'eqpt.Flash.model*"Micron_M500IT"' Total Objects shown: 2 # eqpt.Flash acc : read-write cap : 61057 childAction : cimcVersion : deltape : 9 descr : flash dn : topology/pod-1/node-101/sys/ch/supslot-1/sup/flash gbb : 0 id : 1 lba : 0 lifetime : 1 majorAlarm : no mfgTm : 2021-01-27T14:28:18.226+00:00 minorAlarm : no modTs : 2021-02-22T14:27:10.100+00:00 model : Micron_M500IT_MTFDDAT064SBD monPolDn : uni/fabric/monfab-default operSt : ok peCycles : 565 readErr : 0 rev : MC02.00 rn : flash ser : MSA241001K2 status : tbw : 2.881872 type : flash vendor : Micron warning : no wlc : 0
leaf101# cat /mnt/pss/smartctl_full_dump.log | tail -n 103 | egrep "Device Model|Firmware Version|ATTRIBUTE_NAME|Power_On_Hours" Device Model: Micron_M500IT_MTFDDAT064SBD Firmware Version: MC02.00 ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 3347
From the output of the previous commands, both of these conditions are true on an affected switch:
If any of the conditions are not true, then the switch is not in the affected list and no action is required.
Power_On_Hours from the output can be used to calculate how much time is left before this issue occurs.
This field notice provides the ability to determine if the serial number(s) of a device is impacted by this issue. In order to verify your serial number(s), enter it in the Serial Number Validation tool at https://snvui.cisco.com/snv/FN72145.
If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:
My Notifications—Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.
Unleash the Power of TAC's Virtual Assistance