THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.
Revision | Publish Date | Comments |
---|---|---|
1.0
|
15-Dec-16
|
Initial Release
|
10.0
|
09-Oct-17
|
Migration to new field notice system
|
Affected Product ID | Comments |
---|---|
N5K-C56128P
|
|
N5K-C56128P=
|
|
N5K-C5624Q
|
|
N5K-C5624Q=
|
|
N5K-C5648Q
|
|
N5K-C5648Q=
|
|
N5K-C5672UP
|
|
N5K-C5672UP-16G
|
|
N5K-C5672UP-16G=
|
|
N5K-C5672UP=
|
|
N5K-C5696Q
|
|
N5K-C5696Q=
|
|
N6K-C6001-64P
|
|
N6K-C6001-64P=
|
|
N6K-C6001-64T
|
|
N6K-C6001-64T=
|
Defect ID | Headline |
---|---|
CSCva91270 | Nexus 6001/5600: NOHMS-2-NOHMS_ENV_SERR |
CSCus68610 | Nexus 5672/56128 - Hang or silent reset, uC reset code: 0x4800 or 0x400b |
CSCux41730 | N56K/6001: New BIOS to addresses source of correctable PCIE errors |
CSCuv79564 | Nexus 600x: Hang due to NMI interrupts |
CSCuv40217 | Excessive NMIs due to PCIe correctable errors causing reboot or hang |
A Nexus 6001 Series or Nexus 5600 Series switch might hang or reboot with no response via the console, mgmt0, or inband interfaces.
A Nexus 6001 or 5600 Series switch might experience hangs or reboots during a storm of non-maskable hardware interrupts (NMIs) to the main CPU. An interrupt is a message sent between hardware components in the system, and "non-maskable" refers to the fact that the receiving side normally cannot ignore the message. In the context of this issue, the NMI messages originate from the chassis PCIe bus are caused by a software defect or hardware component failure.
There are three methods to identify this issue:
1. If the switch runs NX-OS 7.0(3)N1(1) and later, it can raise a syslog event when excessive interrupts are generated on the PCIe bus.
Example:
%USER-0-SYSTEM_MSG: 2032: PCIe critical FAILURE DETECTED, contact Cisco TAC - pfm
2. During the NMI storm to the CPU, erroneous messages are often written to the "uC reset code" field within the reload records which indicates a "microcontroller brownout" (platform reload) occurred when one did not occur.
Example:
SWITCH# show logging onboard internal reset-reason
Mon Jan 1 00:00:00 2015: Card Uptime Record
----------------------------------------------
Uptime: 0, 0 days 0 hour(s) 0 minute(s) 0 second(s)
Reset Reason: Unknown (0)
Reset Reason SW: Unknown (0)
Reset Reason (HW): uC reset code: 0x4800
Host Requested Reset: reload
Microcontroller Detected Platform Reset <=====
3. The most reliable method to identify the issue is to run the following command several times back to back waiting about a minute in-between. This command tells you if NMI events occur in the background:
show system internal file /proc/interrupts | incl NMI
Example:
SWITCH# show system internal file /proc/interrupts | incl NMI
NMI: 57000 0 0 0 0 0 0 0 Non-maskable interrupts
SWITCH# show system internal file /proc/interrupts | incl NMI
NMI: 58000 0 0 0 0 0 0 0 Non-maskable interrupts
SWITCH# show system internal file /proc/interrupts | incl NMI
NMI: 59000 0 0 0 0 0 0 0 Non-maskable interrupts
PCIe errors can also result from a hardware component failure. If running one of the fixed NX-OS versions, the system will generate the following syslog messages if the PCIE errors are due to hardware component failure:
In NX-OS 7.0(8)N1(1)
2016 Aug 11 06:56:20 N6001 %NOHMS-2-NOHMS_ENV_SERR: Bus:Dev.Func 0a:00.00 Vendor 0x1137 Device0xbe is giving out high correctable Error ..
2016 Aug 11 06:56:20 N6001 %NOHMS-2-NOHMS_ENV_SERR: Bus:Dev.Func 13:00.00 Vendor 0x1137 Device0xbe is giving out high correctable Error ..
2016 Aug 11 06:56:20 N6001 %NOHMS-2-NOHMS_ENV_SERR: Bus:Dev.Func 0f:00.00 Vendor 0x1137 Device0xbe is giving out high correctable Error ..
In NX-OS 7.1(4)N1(1) or 7.3(1)N1(1)
2016 Aug 16 13:44:06.686 N56128 %NOHMS-2-NOHMS_ENV_PCI: Bus:Dev.Func 00:01.01 Vendor 0x8086 DeviceID 0x155 PCIe critical FAILURE DETECTED (1-C/1), Contact Cisco TAC
2016 Aug 16 13:44:06.686 N56128 %NOHMS-2-NOHMS_ENV_PCI: Bus:Dev.Func 18:00.00 Vendor 0x10b5 DeviceID 0x8608 PCIe critical FAILURE DETECTED (1-C/1), Contact Cisco TAC
The show command show nohms pcie counter can be used to view these details in the fixed NX-OS versions only.
The software issue has been fixed in the form of a BIOS update in later NX-OS releases. Upgrade the BIOS of the impacted switch to any Fixed-In NX-OS release by using an "install operation". Note that if the switch is upgraded non-disruptively, reload the switch once for the new BIOS to take effect.
If an upgrade cannot be performed immediately, contact the Technical Assistance Center (TAC) in order to load a debugging plugin. This grants access to the Linux shell where the ports can be temporarily blocked via the CLI. Note that this workaround does not survive a reload, and the NX-OS upgrade is suggested as a long-term fix.
Use the following table to determine the NX-OS upgrade path to an appropriate Fixed-In NX-OS release.
Impacted NX-OS Release | Fixed-In NX-OS Release |
---|---|
7.0(7) and prior | 7.0(8)N1(1) |
7.1(x) | 7.1(4)N1(1) |
7.2(x) | 7.3(0)N1(1), 7.3(1)N1(1) |
This table details the fixed BIOS versions per platform
PLATFORM | FIXED BIOS VERSION for 7.0(8)N1(1) and 7.3(0)N1(1) | FIXED BIOS VERSION for 7.1(4)N1(1) and 7.3(1)N1(1) |
---|---|---|
Nexus 5624 | 1.1.3 | 1.1.6 |
Nexus 5648 | 1.1.4 | 1.1.7 |
Nexus 5672UP | 2.1.5 | 2.1.7 |
Nexus 5672UP-16G | 0.1.6 | 0.2.0* |
Nexus 5696 | 2.3.0 | 2.6.0 |
Nexus 56128 | 3.3.0 | 3.7.0 |
Nexus 6001 | 2.2.0 | 2.5.0 |
*Nexus 5672UP-16G is only supported in 7.3 releases
If it has been determined that the error symptoms are due to a hardware component failure, contact the Technical Assistance Center (TAC) or your Account Team to request a hardware replacement.
If you require further assistance, or if you have any further questions regarding this field notice, please contact the Cisco Systems Technical Assistance Center (TAC) by one of the following methods:
Cisco Notification Service—Set up a profile to receive email updates about reliability, safety, network security, and end-of-sale issues for the Cisco products you specify.
Unleash the Power of TAC's Virtual Assistance