THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.
Affected Software Product | Affected Release | Affected Release Number | Comments |
---|---|---|---|
HyperFlex HX Data Platform | 5.0 | 5.0(2a), 5.0(2b), 5.0(2c), 5.0(2d) |
Defect ID | Headline |
CSCwf98678 | Potential APD in presence of stale disk mirror clean requests |
CSCwf34019 | A race condition during handling of flusher restart may result into discarding data to be flushed |
Cisco HyperFlex Data Platform (HXDP) Software Releases 5.0(2a), 5.0(2b), 5.0(2c), and 5.0(2d) include software defects that may result in data unavailability or, in rare cases, data loss. The impact of these two bugs is typically triggered by a cluster maintenance event or an event such as drive failure.
CSCwf98678: All Paths Down (APD) may occur sometime after a drive is replaced. This bug can cause the following issues:
CSCwf34019: A race condition during cache-to-persistent destaging may discard the data to be flushed. This bug can cause the following issues:
CSCwf98678: APD may occur sometime after a drive is replaced.
A Cisco HyperFlex file system maintains copies or mirrors of data objects on different disks on different nodes. A mirror could be moved in or out of a disk due to various conditions, such as rebalancing requirements due to drive replacement. When a mirror is moved out of a disk, a request is generated to clean the mirror and store it in a cluster database. Due to changes that were made in Cisco HXDP Releases 5.0(2a through d), repeated rebalancing activity might cause different mirrors to become assigned to the same disk over time. Blocks from older mirrors might overwrite the index, which could result in data not being fetched correctly, which could lead to a potential APD situation.
CSCwf34019: A race condition during cache-to-persistent destaging may discard the data to be flushed.
During destaging activity, under an extremely rare scenario of remote node mirror changes, data may be impacted.
The Reduce Resync issues described in this Field Notice have been resolved in Cisco HXDP Software Release 5.0(2e) and later. Remediations described in this section are required before upgrading to Cisco HXDP Software Release 5.0(2e) because the action of performing the upgrade can cause an APD condition if stale mirrors exist prior to the upgrade. However, if upgrading a cluster to Cisco HXDP Software Release 5.0(2g) or later, no additional action is required prior to upgrade. Cisco HXDP Software Release 5.0(2g) automatically addresses stale mirrors by running remediation scripts before performing the update. Cisco recommends upgrading to HXDP Software Release 5.0(2g) or later for the smoothest remediation and upgrade experience.
Clusters running affected releases in a Hyper-V environment should immediately upgrade to Cisco HXDP Software Release 5.0(2g) or later. No Cisco TAC assistance or additional remediation is required if upgrading to this release.
Clusters running affected releases in an ESXi environment do not require any additional remediation steps if upgrading to Cisco HXDP Software Release 5.0(2g) or later. One of the two remediation options described below is required if upgrading to Cisco HXDP Software Release 5.0(2e).
Remediation by using Cisco Intersight API
Version 1.1 of this Field Notice introduced a method to remediate the stale mirror issue using an API that can be accessed through Cisco Intersight. This remediation method has the advantage of handling all clusters within an account with a single action. The ability to pre-select specific clusters also exists. Either remediation method can be used whether an Intersight alarm exists or not. One of the remediation methods must be run if a warning such as the following is present:
Note:
For instructions on using Cisco Intersight API remediation, see this video:
This video is also available here:
https://video.cisco.com/detail/videos/data-center-hyperflex/video/6345071116112?autoStart=true
Payload:The following is an example of the syntax:{"Operation":"StartReduceResync","ClusterMoIds":["cluster moid 1>","<cluster moid 2"..."cluster moid n>"]}
If the user wants to add just one MOID, then it should also be in square brackets, as shown in the following example:{"Operation": "StartReduceResync", "ClusterMoIds": ["6598825b656c6c301f6e2851","6597b6ed656c6c301f6b07db", "65971fce656c6c301f68b49a","6594a2d4656c6c301f5d8bf2"]}
{"Operation": "StartReduceResync", "ClusterMoIds": ["6598825b656c6c301f6e2851"]}
Example: {"Operation": "StartReduceResync"}
This example shows a workflow in the Request page where all clusters in the account are in progress:
Alarm Descriptions
There are three types of alarms that may be generated when StartReduceResync is run:
INFO alarm: This alarm is generated when a cluster has successfully run StartReduceResync API and the mirror count is 50 or less. No action is required for this alarm, which is shown in the example below:
Remediation by Running Script Directly
Cisco has created a tool that can be run on any affected cluster and that will correct the configuration to address the issues. If the tool fails to run, you will be directed to contact Cisco TAC. There is no expected down time for any node or operational impact to the cluster from running this tool. There are no changes made to ESXi as a result of running this tool. As a general best practice, Cisco recommends running a backup before making any cluster changes and scheduling a maintenance window to complete these types of activities. If you are running one of the affected software releases, Cisco highly recommends deferring planned maintenance on the cluster until this tool can be run. Maintenance examples include:
After completing remediation steps, it is still advised to upgrade to Cisco HXDP Software Release 5.0(2e) or later, which addresses the defects and contains a number of other fixes for the Hyperflex platform. The HXDP release marked as Suggested Release is expected to have the highest levels of stability, reliability, and longevity.
How to run the Hot Patch Tool
For instructions on how to run the hot patch tool, see this video:
This video is also available at the following link:
https://video.cisco.com/detail/videos/technical-assistance-center-tac/video/6345069678112?autoStart=true
4. Choose Create directory and create a temporary folder such as tmp. After the folder is created, select the folder and then click Upload to copy hx_patch_1.1.zip to the folder, as shown in the image below:
5. Once the file is uploaded, use SSH to access the ESXi host using root credentials to get to the console. SSH may need to be temporarily enabled if Lockdown mode is enabled. If SSH cannot be enabled, work with Cisco TAC.
6. From the console, navigate to the directory where the file is uploaded, using the following command:
cd /vmfs/volumes/SpringpathDS-<serialno>/tmp
7. Extract the hx_patch_1.1.zip file, using the following command:
unzip hx_patch_1.1.zip
8. Execute the patch.py script, as shown below. Enter the HX admin password at the prompt. This can be done at any time and does not require the cluster to be taken offline.
python patch.py
When the script completes, the message Successfully executed checks will appear. Contact Cisco TAC for assistance with any failures or error messages. Script run time can vary, depending on the number of stale mirrors on the affected releases. It may take several hours for stale mirrors to clean up in the background. Re-running the script will display the current active stale mirror count.
The following example shows successful execution and decreasing Pending Mirror Clean Count:
[root@hostname01:/tmp] python patch.py
Enter HX Cluster Admin Password:
HX Hot Patch - 1.2 2023-12-19 04:07:13.222516 UTC
HXDP Release 5.0.2b
Cluster UUID 5dc701bdefc3be1e:7921c7b98e5a2ac8
Copying Package SUCCESS
Applying RR Patch SUCCESS
Applying MC Patch SUCCESS
Pending Mirror Clean Count 1008
Successfully executed checks.
An alarm referencing FN74071 will appear in HX Connect and/or Intersight. This alarm can be safely ignored and reset.
[root@hostname01:/tmp] python patch.py --count
Enter HX Cluster Admin Password:
HX Hot Patch - 1.2 2023-12-19 04:09:18.327193 UTC
HXDP Release 5.0.2b
Cluster UUID 5dc701bdefc3be1e:7921c7b98e5a2ac8
Copying Package SUCCESS
Pending Mirror Clean Count 900
An alarm referencing FN74071 will appear in HX Connect and/or Intersight. This alarm can be safely ignored and reset.
The following is an example of an unsuccessful execution:
[root@hostname02:/vmfs/volumes/618e400a-70a1bb0c-2753-6887c6bab650/tmp] python patch.py
Enter HX Cluster Admin Password:
HX Hot Patch - 1.2 2024-01-08 20:13:50.022455 UTC
HXDP Release 5.0.2d
Cluster UUID 5f101f547036bff2:6000cb0c8d475479
Copying Package SUCCESS
Applying RR Patch FAILED
Applying MC Patch FAILED
Error while executing script, reach out to Cisco TAC for assistance.
To identify affected products using Controller VM SSH CLI access, use the following command:
hxshell:~$ stcli about
serviceType: stMgr
displayVersion: 5.0(2d)
name: HyperFlex StorageController
apiVersion: 0.1
productVersion: 5.0.2d-42558
build: 5.0.2d-42558 (internal)
To identify affected products using the HX Connect interface, see the following:
Version | Description | Section | Date |
1.2 | Clarified remediation methods and added reference to Cisco HXDP Release 5.0(2g) automatic remediation. | Workaround/Solution | 2024-APR-23 |
1.1 | Clarified remediations, added video links, and added Cisco Intersight instructions. | Workaround/Solution | 2024-JAN-24 |
1.0 | Initial Release | — | 2023-DEC-13 |
For further assistance or for more information about this field notice, contact the Cisco Technical Assistance Center (TAC) using one of the following methods:
To receive email updates about Field Notices (reliability and safety issues), Security Advisories (network security issues), and end-of-life announcements for specific Cisco products, set up a profile in My Notifications.
Unleash the Power of TAC's Virtual Assistance