Introduction
This article is an extension to the document “Nexus 7000 Supervisor 2/2E Compact Flash Failure Recovery” that addresses all possible failure scenarios. A possibility where flash recovery tool fails to run, this document may come handy. It is recommended to have console access to the device to perform the changes. Also, it is strongly recommended to not make any changes under the Linux kernel, which is not mentioned in the document, as this may have an impact on the switch operations. Cisco TAC supervision is advisable.
Background
As explained in the other document, each N7K supervisor 2/2E is equipped with 2 eUSB flash devices in RAID1 configuration, one primary and one mirror. Together they provide non-volatile repositories for boot images, startup configuration and persistent application data. In a situation where the Raid fails for a supervisor in the chassis, we run the flash recovery tool, to fix the same. In almost all cases, we resort to reloading/failing over the supervisor, if the flash recovery tool fails to run. There is a possibility to fix this without a reload/failover in certain scenario.
Prerequisites
Requirements
Cisco recommends that you have knowledge ofCisco Nexus OS, storage or flash disk recovery methods and Linux level debugging.
Components Used
Nexus 7000 series switches
Symptom
Raid failure is observed on a supervisor and while trying to recover the flash for the affected supervisors, following error appears when running the flash recovery tool,
Switches would run into Raid failure state with error code - 0xe1
ERROR: Cannot perform recovery. /dev/sdb has incorrect partition info.
ERROR: Disk /dev/sdb needs to be manually inspected for errors.
INFO: No recovery was attempted on module 5. All flashes left intact.
INFO: A detailed copy of the this log was saved as volatile:flash_repair_log_mod5.tgz.
Solution
Load the debug plugin on the switch, to login to the linux shell,
Switch# load bootflash:n7000-s2-debug-sh.6.1.4a.gbin
Please be careful, while running the commands here.
Once we get the linux prompt, look for the affected partition as per the error message. In our case it is /dev/sdb. It could be some other partitions too.
Linux(debug)# ls -l /dev/sd?
brw-r----- 1 root root 8, 0 Aug 28 2015 sda
brw-rw-r-- 1 root disk 8, 32 Dec 18 2013 sdc
brw-rw-r-- 1 root disk 8, 48 Dec 18 2013 sdd
brw-rw-r-- 1 root disk 8, 64 Dec 18 2013 sde
brw-rw-r-- 1 root disk 8, 80 Dec 18 2013 sdf
brw-rw-r-- 1 root disk 8, 96 Dec 18 2013 sdg
brw-rw-r-- 1 root disk 8, 112 Dec 18 2013 sdh
brw-rw-r-- 1 root disk 8, 128 Dec 18 2013 sdi
brw-rw-r-- 1 root disk 8, 144 Dec 18 2013 sdj
brw-rw-r-- 1 root disk 8, 160 Dec 18 2013 sdk
brw-rw-r-- 1 root disk 8, 176 Dec 18 2013 sdl
brw-rw-r-- 1 root disk 8, 192 Dec 18 2013 sdm
The partition is found to be missing, leading to error, while running the recovery tool. Create the missing partition manually, with same permission as other blocks.
Linux(debug)# mknod -m 664 /dev/sdb b 8 16
Now, we can see the sdb partition under /dev,
Linux(debug)# ls -l /dev/sd?
brw-r----- 1 root root 8, 0 Aug 28 2015 sda
brw-rw-r-- 1 root root 8, 16 May 26 07:31 sdb
brw-rw-r-- 1 root disk 8, 32 Dec 18 2013 sdc
brw-rw-r-- 1 root disk 8, 48 Dec 18 2013 sdd
brw-rw-r-- 1 root disk 8, 64 Dec 18 2013 sde
brw-rw-r-- 1 root disk 8, 80 Dec 18 2013 sdf
brw-rw-r-- 1 root disk 8, 96 Dec 18 2013 sdg
brw-rw-r-- 1 root disk 8, 112 Dec 18 2013 sdh
brw-rw-r-- 1 root disk 8, 128 Dec 18 2013 sdi
brw-rw-r-- 1 root disk 8, 144 Dec 18 2013 sdj
brw-rw-r-- 1 root disk 8, 160 Dec 18 2013 sdk
brw-rw-r-- 1 root disk 8, 176 Dec 18 2013 sdl
brw-rw-r-- 1 root disk 8, 192 Dec 18 2013 sdm
Exit from the linux shell and run the flash recovery tool again.
This time without any error messages and the Raid failure on the primary flash was recovered (0xf0). Confirmed the same using the command,
"slot x show system internal raid | i i cmos|block | head line 5"
It should run fine without such errors and should be able to recover the affected Supervisor from the Raid failure state. In case, recovery tool continues to fail to run, it could be due to another reason, or an actual corruption with the partition, and we may have to resort to a reload/failover.
Related Information