Field Notice: FN74071 - Cisco Hyperflex Potential All Paths Down conditions in presence of stale disk mirror clean requests or Reduce Resync - Workaround Provided

Available Languages

Updated:April 23, 2024

Document ID:FN74071

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Notice

THIS FIELD NOTICE IS PROVIDED ON AN "AS IS" BASIS AND DOES NOT IMPLY ANY KIND OF GUARANTEE OR WARRANTY, INCLUDING THE WARRANTY OF MERCHANTABILITY. YOUR USE OF THE INFORMATION ON THE FIELD NOTICE OR MATERIALS LINKED FROM THE FIELD NOTICE IS AT YOUR OWN RISK. CISCO RESERVES THE RIGHT TO CHANGE OR UPDATE THIS FIELD NOTICE AT ANY TIME.

Products Affected

Affected Software Product	Affected Release	Affected Release Number	Comments
HyperFlex HX Data Platform	5.0	5.0(2a), 5.0(2b), 5.0(2c), 5.0(2d)

Defect Information

Defect ID	Headline
CSCwf98678	Potential APD in presence of stale disk mirror clean requests
CSCwf34019	A race condition during handling of flusher restart may result into discarding data to be flushed

Problem Description

Cisco HyperFlex Data Platform (HXDP) Software Releases 5.0(2a), 5.0(2b), 5.0(2c), and 5.0(2d) include software defects that may result in data unavailability or, in rare cases, data loss. The impact of these two bugs is typically triggered by a cluster maintenance event or an event such as drive failure.

CSCwf98678: All Paths Down (APD) may occur sometime after a drive is replaced. This bug can cause the following issues:

The cluster becomes inaccessible (APD) due to a Cisco HyperFlex file system process (storfs) crash.
Node panic/outage might occur.

CSCwf34019: A race condition during cache-to-persistent destaging may discard the data to be flushed. This bug can cause the following issues:

Multiple disks across multiple nodes may be retired.
The cluster state can become CRITICAL, which stops cluster read/write activity.

Background

CSCwf98678: APD may occur sometime after a drive is replaced.

A Cisco HyperFlex file system maintains copies or mirrors of data objects on different disks on different nodes. A mirror could be moved in or out of a disk due to various conditions, such as rebalancing requirements due to drive replacement. When a mirror is moved out of a disk, a request is generated to clean the mirror and store it in a cluster database. Due to changes that were made in Cisco HXDP Releases 5.0(2a through d), repeated rebalancing activity might cause different mirrors to become assigned to the same disk over time. Blocks from older mirrors might overwrite the index, which could result in data not being fetched correctly, which could lead to a potential APD situation.

CSCwf34019: A race condition during cache-to-persistent destaging may discard the data to be flushed.

During destaging activity, under an extremely rare scenario of remote node mirror changes, data may be impacted.

Problem Symptom

Multiple disks across multiple nodes randomly get retired.
Disks may become available for some time and then again get retired.
The cluster state can become CRITICAL, which stops cluster I/Os.
Cluster inaccessible (APD) due to Cisco HyperFlex file system process (storfs) crash occurs.
Node panic/outage occurs.
If clean-up of stale entries is not performed prior to an upgrade, long healing times lead to upgrade failure.

Workaround/Solution

The Reduce Resync issues described in this Field Notice have been resolved in Cisco HXDP Software Release 5.0(2e) and later. Remediations described in this section are required before upgrading to Cisco HXDP Software Release 5.0(2e) because the action of performing the upgrade can cause an APD condition if stale mirrors exist prior to the upgrade. However, if upgrading a cluster to Cisco HXDP Software Release 5.0(2g) or later, no additional action is required prior to upgrade. Cisco HXDP Software Release 5.0(2g) automatically addresses stale mirrors by running remediation scripts before performing the update. Cisco recommends upgrading to HXDP Software Release 5.0(2g) or later for the smoothest remediation and upgrade experience.

Clusters running affected releases in a Hyper-V environment should immediately upgrade to Cisco HXDP Software Release 5.0(2g) or later. No Cisco TAC assistance or additional remediation is required if upgrading to this release.

Clusters running affected releases in an ESXi environment do not require any additional remediation steps if upgrading to Cisco HXDP Software Release 5.0(2g) or later. One of the two remediation options described below is required if upgrading to Cisco HXDP Software Release 5.0(2e).

Remediation by using Cisco Intersight API

Version 1.1 of this Field Notice introduced a method to remediate the stale mirror issue using an API that can be accessed through Cisco Intersight. This remediation method has the advantage of handling all clusters within an account with a single action. The ability to pre-select specific clusters also exists. Either remediation method can be used whether an Intersight alarm exists or not. One of the remediation methods must be run if a warning such as the following is present:

Note:

- Only clusters running HXDP versions 5.0.2a/b/c/d will run the mitigation script.
- Only Intersight connected clusters will run the mitigation script.
- Only clusters in an OK state will run the mitigation script. Unhealthy clusters should be mitigated beforehand.
- A Device Connector in read-only state will not run the mitigation script.

For instructions on using Cisco Intersight API remediation, see this video:

This video is also available here:
https://video.cisco.com/detail/videos/data-center-hyperflex/video/6345071116112?autoStart=true

Log in to Intersight.com as an Account Administrator as shown in the example below. Administrator privileges are required to run this script.

Steps 2-6 describe the process of selecting specific clusters for remediation. Skip to step 7 to run the mitigation script on ALL clusters in the account.
It is important to run this on all clusters before upgrading software, but a user may choose to update a subset initially. To select specific clusters, take the following steps:
a. Identify the Managed Object IDs (MOID) of the clusters to which you wish to apply the mitigation script, as shown in the image below.
b. To get the Cluster MOID, go to Left -> HyperFlex Clusters -> Select the cluster. The MOID is in the URL.
c. Repeat to get multiple MOIDs for different clusters.
d. Save the MOIDs into a temporary document such as Notepad for use in the next step. To run the mitigation script on ALL clusters, skip to step 7.
Open another browser tab and go to https://www.intersight.com/apidocs.
On the top menu, navigate to API Reference.
In the search field on the left side, type in StartReduceResync, then select POST.
To run the mitigation script on selected clusters in the account, type the saved MOID text for the intended target cluster(s) using the following syntax and click Send. The result will be a green message, 200 Success, along with the message OK at the bottom.
Payload:
```
{"Operation":"StartReduceResync","ClusterMoIds":["cluster moid 1>","<cluster moid 2"..."cluster moid n>"]}
```
The following is an example of the syntax:
```
{"Operation": "StartReduceResync", "ClusterMoIds": ["6598825b656c6c301f6e2851","6597b6ed656c6c301f6b07db", "65971fce656c6c301f68b49a","6594a2d4656c6c301f5d8bf2"]} 
```
If the user wants to add just one MOID, then it should also be in square brackets, as shown in the following example:
```
{"Operation": "StartReduceResync", "ClusterMoIds": ["6598825b656c6c301f6e2851"]}
```
Continue to Step 8 if you are running the script on specific selected clusters.
To run the mitigation script on ALL clusters in the account, type the payload text below and click Send. Specific MOIDs are not required. The result will be a green message 200 Success along with the message OK at the bottom.

Payload:
{"Operation": "StartReduceResync"}

Example: {"Operation": "StartReduceResync"}
Return to the Intersight.com main page and click the requests icon on the right side of the top ribbon (circled checkmark) to see workflows. For details, click the name Start Reduce Re-sync and Clean Stale Mirrors. When the processes are complete, the status displays a green highlighted Success message on the left. The following example shows four clusters running the StartReduceResync API:

This example shows a workflow in the Request page where all clusters in the account are in progress:
All alarms in Cisco Intersight will remain visible until they are manually acknowledged by the user. After running this procedure, all associated alarms should be acknowledged.

Alarm Descriptions

There are three types of alarms that may be generated when StartReduceResync is run:
INFO alarm: This alarm is generated when a cluster has successfully run StartReduceResync API and the mirror count is 50 or less. No action is required for this alarm, which is shown in the example below:

INFO: <cluster name>: StartReduceResync operation triggered successfully on the cluster. Please check the workflow details for more information.

WARNING alarm: This alarm is generated when a cluster has successfully run StartReduceResync API with a mirror count of more than 50. No action is required unless the mirror count exceeds 50 and remains unchanged for more than one hour. Contact Cisco TAC if this occurs. The following example is of a WARNING alarm:

WARNING test1642_cluster_FI_1: Reduce Re-Sync and stale mirror cleanup operation triggered successfully on the cluster. The number of active mirrors on the cluster are: 51. Please check the workflow details for more information and re-trigger the workflow after some time to check if the active mirror count reduced.

CRITICAL alarm: This alarm is generated when a cluster that has run API StartReduceResync fails, if a cluster might be unhealthy, or for another reason. Review the workflow details and contact Cisco TAC if the mitigation is not obvious. The following is an example of a CRITICAL alarm:

CRITICAL aulani1-hx-cluster: Reduce Re-Sync and stale mirror cleanup has failed. Please check the workflow details for more information.

Remediation by Running Script Directly

Cisco has created a tool that can be run on any affected cluster and that will correct the configuration to address the issues. If the tool fails to run, you will be directed to contact Cisco TAC. There is no expected down time for any node or operational impact to the cluster from running this tool. There are no changes made to ESXi as a result of running this tool. As a general best practice, Cisco recommends running a backup before making any cluster changes and scheduling a maintenance window to complete these types of activities. If you are running one of the affected software releases, Cisco highly recommends deferring planned maintenance on the cluster until this tool can be run. Maintenance examples include:

Server reboots
Disk replacement
Entering/exiting Maintenance Mode
Cisco HXDP upgrades

After completing remediation steps, it is still advised to upgrade to Cisco HXDP Software Release 5.0(2e) or later, which addresses the defects and contains a number of other fixes for the Hyperflex platform. The HXDP release marked as Suggested Release is expected to have the highest levels of stability, reliability, and longevity.

How to run the Hot Patch Tool

For instructions on how to run the hot patch tool, see this video:

This video is also available at the following link:
https://video.cisco.com/detail/videos/technical-assistance-center-tac/video/6345069678112?autoStart=true

Confirm that the cluster is in a healthy state. The tool will not run if a cluster is in an unhealthy state or if nodes are not active in the cluster. Contact Cisco TAC in this case.
Download the Hot Patch Tool at: https://software.cisco.com/download/home/286305544/type/286305994/release/5.0(2e)
Open a browser window and navigate to an ESXi host that belongs to the cluster where the patch should be applied.
1. Upload the hx_patch_1.1.zip file to one of the converged nodes on the HyperFlex cluster. A converged node will usually have a VM named stCtlVM-<serial number> deployed on it.
2. To upload the file, click on the SpringpathDS-<serial number> datastore and select Datastore Browser, as shown in the image below:

4. Choose Create directory and create a temporary folder such as tmp. After the folder is created, select the folder and then click Upload to copy hx_patch_1.1.zip to the folder, as shown in the image below:

5. Once the file is uploaded, use SSH to access the ESXi host using root credentials to get to the console. SSH may need to be temporarily enabled if Lockdown mode is enabled. If SSH cannot be enabled, work with Cisco TAC.

6. From the console, navigate to the directory where the file is uploaded, using the following command:

cd /vmfs/volumes/SpringpathDS-<serialno>/tmp

7. Extract the hx_patch_1.1.zip file, using the following command:

unzip hx_patch_1.1.zip

8. Execute the patch.py script, as shown below. Enter the HX admin password at the prompt. This can be done at any time and does not require the cluster to be taken offline.

python patch.py

When the script completes, the message Successfully executed checks will appear. Contact Cisco TAC for assistance with any failures or error messages. Script run time can vary, depending on the number of stale mirrors on the affected releases. It may take several hours for stale mirrors to clean up in the background. Re-running the script will display the current active stale mirror count.

The following example shows successful execution and decreasing Pending Mirror Clean Count:

[root@hostname01:/tmp] python patch.py
Enter HX Cluster Admin Password:
HX Hot Patch - 1.2 2023-12-19 04:07:13.222516 UTC
HXDP Release 5.0.2b
Cluster UUID 5dc701bdefc3be1e:7921c7b98e5a2ac8
Copying Package SUCCESS
Applying RR Patch SUCCESS
Applying MC Patch SUCCESS
Pending Mirror Clean Count 1008

Successfully executed checks.

An alarm referencing FN74071 will appear in HX Connect and/or Intersight. This alarm can be safely ignored and reset.

[root@hostname01:/tmp] python patch.py --count
Enter HX Cluster Admin Password:
HX Hot Patch - 1.2 2023-12-19 04:09:18.327193 UTC
HXDP Release 5.0.2b
Cluster UUID 5dc701bdefc3be1e:7921c7b98e5a2ac8
Copying Package SUCCESS
Pending Mirror Clean Count 900

An alarm referencing FN74071 will appear in HX Connect and/or Intersight. This alarm can be safely ignored and reset.

The following is an example of an unsuccessful execution:

[root@hostname02:/vmfs/volumes/618e400a-70a1bb0c-2753-6887c6bab650/tmp] python patch.py
Enter HX Cluster Admin Password:
HX Hot Patch - 1.2 2024-01-08 20:13:50.022455 UTC
HXDP Release 5.0.2d
Cluster UUID 5f101f547036bff2:6000cb0c8d475479
Copying Package SUCCESS
Applying RR Patch FAILED
Applying MC Patch FAILED

Error while executing script, reach out to Cisco TAC for assistance.

How to Identify Affected Products

To identify affected products using Controller VM SSH CLI access, use the following command:

hxshell:~$ stcli about
serviceType: stMgr
displayVersion: 5.0(2d)
name: HyperFlex StorageController
apiVersion: 0.1
productVersion: 5.0.2d-42558
build: 5.0.2d-42558 (internal)

To identify affected products using the HX Connect interface, see the following:

Revision History

Version	Description	Section	Date
1.2	Clarified remediation methods and added reference to Cisco HXDP Release 5.0(2g) automatic remediation.	Workaround/Solution	2024-APR-23
1.1	Clarified remediations, added video links, and added Cisco Intersight instructions.	Workaround/Solution	2024-JAN-24
1.0	Initial Release	—	2023-DEC-13

For More Information

For further assistance or for more information about this field notice, contact the Cisco Technical Assistance Center (TAC) using one of the following methods:

Receive Email Notification About New Field Notices

To receive email updates about Field Notices (reliability and safety issues), Security Advisories (network security issues), and end-of-life announcements for specific Cisco products, set up a profile in My Notifications.

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

HyperFlex HX Data Platform