Introduction
This document describes how to troubleshoot the diameter peer issue on the failure of In-Service Software Migration (ISSM) in Cisco Policy Suite (CPS).
Prerequisites
Requirements
Cisco recommends that you have knowledge of these topics:
- Linux
- CPS
- Diameter
- Open Service Gateway Initiative (OSGI) framework
Note: Cisco recommends that you must have privilege root access to CPS CLI.
Components Used
The information in this document is based on these software and hardware versions:
- CPS 19.4, 21.1
- CentOS Linux release 8.1.1911 (Core)
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
Background Information
Users have the option to perform ISSM of a CPS 19.4.0/CPS 19.5.0 to CPS 21.1.0. This migration allows the traffic to continue without any impact while it gets completed.
ISSM to CPS 21.1.0 is supported only for Mobile High Availability (HA) and Geographic Redundancy (GR) installations. Other CPS installation types (mog|pats|arbiter|andsf|escef) cannot be migrated.
Problem
It is observed that, when the ISSM from CPS19.4 to CPS21.1 is failed due to an invalid Hosts.csv
entry, all diameter peers' connection with both Load Balancing (LB) goes down and normal restart does not help to restore.
[root@lab-lb02 ~]# ./show_peers.sh --all --summary
###############################################################################
[Wed Sep 21 01:57:47 CDT 2022]
SUMMARY of Peers in OKAY State:
| Gx | Re | Rx | Sh | Sy |
-------------|------|------|------|------|------|
lb01 peers | 0 | 0 | 0 | 0 | 0 |
-------------|------|------|------|------|------|
lb02 peers | 0 | 0 | 0 | 0 | 0 |
-------------|------|------|------|------|------|
This is the exemption you can see in consolidated-qns.log when you enable debug level logger.
2022-09-21 08:25:00,188 [pool-3-thread-1] DEBUG c.b.d.i.server.DelayedStartManager.? - isWorkerConnected: true queueSystem.enabled: false queueSystem.available: true isUpgradeState: false
After the execution of this step, the process gets hung due to an invalid entry in the Hosts.csv
file.
/mnt/iso/migrate.sh disable set 1
2022-09-21 02:52:48,913 INFO [__main__.migrate_disable_set] Waiting for build init.d background task
Replica-set Configuration
-------------------------------------------------------------------------------
The progress of this script can be monitored in the following log:
/var/log/broadhop/scripts//build_set_21092022_024648_1663728408306850218.log
-------------------------------------------------------------------------------
[ Done ] file creation [ In Progress ]
2022-09-21 02:58:16,385 INFO [__main__.migrate_disable_set] build init.d successfully.
2022-09-21 02:58:16,385 INFO [__main__.run_recipe] Performing installation stage: QuiesceClusterSet
[lab-cc02 PSZ06PCRFCC02] Executing task 'DisableArbiterVipNode'
[lab-cc02 PSZ06PCRFCC02] run: /var/qps/bin/support/disable_arbiter_vip_node.sh
Fatal error: Name lookup failed for lab-cc02 PSZ06PCRFCC02 --> Error highlight. Invalid host entry is noticed.
Underlying exception:
Name or service not known
Aborting.
2022-09-21 02:58:16,967 ERROR [__main__.<module>] Error during installation
2022-09-21 02:58:16,970 INFO [__main__.<module>] =====================
2022-09-21 02:58:16,970 INFO [__main__.<module>] FAILURE
2022-09-21 02:58:16,970 INFO [__main__.<module>] ======== END ========
2022-09-21 02:58:16,970 INFO [__main__.<module>] To have the environment variable updated, please logout and login from all opened shell on the current system
[root@lab-cm csv]#
The script trigger_silo.sh
, as part of migrate.sh
execution pauses all qns processes in the selected LB for set1 migration.
2022-09-21 03:11:34,885 INFO [migrate_traffic.run] running - ['bash', '-c', 'source /var/qps/install/current/scripts/migrate/trigger_silo.sh && trigger_silo_pre_set1_upgrade /var/tmp/cluster-upgrade-set-1.txt /var/tmp/cluster-upgrade-set-2.txt /var/log/trigger_silo.log']
2022-09-21 03:17:27,594 INFO [command.execute] (stdout): LB qns process count : 7
Running pause on lb02-1
checking JMX port 9045 ....
Done - Paused qns-1
Running pause on lb02-2
checking JMX port 9046 ....
Done - Paused qns-2
Running pause on lb02-3
checking JMX port 9047 ....
Done - Paused qns-3
Running pause on lb02-4
checking JMX port 9048 ....
Done - Paused qns-4
Running pause on lb02-5
checking JMX port 9049 ....
Done - Paused qns-5
Running pause on lb02-6
checking JMX port 9050 ....
Done - Paused qns-6
Running pause on lb02-7
checking JMX port 9051 ....
Done - Paused qns-7
Solution
The upgrade is not complete and is partial, the ISSM process keeps the CPS system in isUpgradeState: false
.
In order to recover from this condition, you must set the isUpgradeState: true
in the OSGI framework of CPS.
Procedure to Set the Correct Upgrade State
Step 1. Log in to the Cluster Manager node.
Step 2. Connect to OSGI framework of CPS system.
[root@installer ~]# telnet qns01 9091
Trying 192.168.10.11...
Connected to qns01.
Escape character is '^]'.
osgi>
Step 3. Execute this command.
osgi> markNodeUpgraded
Upgraded status set to true
osgi>
Step 4. Disconnect from the OSGI framework gracefully with this command.
osgi> disconnect
Disconnect from console? (y/n; default=y) y
Connection closed by foreign host.
[root@installer ~]#
Once you apply the solution, check the diameter peer status with this command and ensure all needed peers are active.
[root@lab-lb02 ~]# ./show_peers.sh --all --summary
###############################################################################
[Wed Sep 21 01:57:47 CDT 2022]
SUMMARY of Peers in OKAY State:
| Gx | Re | Rx | Sh | Sy |
-------------|------|------|------|------|------|
lb01 peers | 72 | 120 | 36 | 0 | 12 |
-------------|------|------|------|------|------|
lb02 peers | 72 | 120 | 36 | 0 | 12 |
-------------|------|------|------|------|------|