Introduction
This document describes the issues related to the UPF states mismatch in RCM.
Prerequisites
Requirements
There are no specific requirements for this document.
Components Used
The information in this document is based on these software and hardware versions:
- Redundancy Configuration Manager (RCM)
- User Plane Function (UPF/UP)
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
Logs Collection
RCM
Step 1. Capture Some Command Outputs
Firstly, you must identify which is the problematic UP and what is the pattern of the issue. In order to determine which UPs experienced a switchover and identify where the current issue is located, it is essential to document the reasons for the switchovers.
rcm show-statistics switchover
rcm show-statistics switchover-verbose
rcm show-statistics configmgr --------------- to check how many UPs are registered for config push
rcm show-statistics controller --------------- to check no of UPs and its states registered with controller
Step 2. Collect Controller and Configmgr Logs
Once you identify among which UPs the problem lies, you can collect controller logs and configmgr logs in order to identify what was the cause of the switchover and what went wrong for the UPs to get stuck in Pending State.
Refer to the RCM Log Collection link for the log collection procedure.
UP
SSD, Syslogs, and SNMP traps for the problematic timestamp, cover the timeframe at least two hours before the issue starts.
Troubleshooting
Scenario for UPs Getting Stuck Into the Pending State
- Generally, every UP registers itself to the RCM via the controller
- The controller is responsible for maintaining the UP states it receives from UP and the one assigned by RCM and compiling them
rcm show-statistics controller
message :
{
"keepalive_version": "f1ab207c5d3120f8a4286b999b9f4cd207034e7c61e204d74e41f48578c476de",
"keepalive_timeout": "20s",
"num_groups": 2,
"groups": [
{
"groupid": 1,
"endpoints_configured": 7,
"standby_configured": 1,
"pause_switchover": false,
"active": 2,
"standby": 0,
"endpoints": [
{
"endpoint": "X.X.X.X", -------- UP IP
"bfd_status": "STATE_UP",
"upf_registered": true,
"upf_connected": true,
"upf_state_received": "UpfMsgState_Active",
"bfd_state": "BFDState_UP",
"upf_state": "UPFState_PendActive",
"route_modifier": 32,
"pool_received": false,
"echo_received": 253,
"management_ip": "X.X.X.X",
"host_id": "SEUD2413",
"ssh_ip": "Y.Y.Y.Y",
"force_nso_registration": false
},
In the controller statistics, if observed, there are different states which controller is maintaining and each UP state has its own meaning.
BFD state - Indicates the BFD state between RCM and UP (do not refer to it as UF state, it is purely BFD state only)
UPF state - The current state of the UPF in the RCM
UPF state received - UP state sent by UP towards RCM
- As per the flow generally, whenever there is a switchover from Active UP to Standby UP, RCM must undergo certain procedures for smooth handovers mentioned here:
1. Checkpointmgr flush from old UP and checkpoint sync with new Active UP
2. Config flush
3. Config push
4. Managing UP states
Consider the example of UP pair as UP-A (Active UP) and UP-B (Standby-UP) and when there is a switchover before getting into Active and Standby states it first gets into the Pending state.
UP-A (Active UP) --------------------- PendStandby ---------------------- Standby
UP-B (Standby UP) ------------------- PendActive ---------------------- Active
As can be seen before becoming Active/Standby, the mentioned procedural transactions are happening between RCM and UP in order to have a smooth switchover.
- Whenever there is a switchover from Active to Standby and vice versa, RCM must perform a config push where it pushes the Active UP configuration in the UP which becomes Active, and pushes the Standby UP configuration in the UP which becomes Standby.
Note :: In Standby UP normally RCM push all the UP config which are currently active so that whenever this UP becomes active it removes all the unwanted config
- As soon as the switchover is initiated, RCM has a timer value of 15 minutes (it varies based on the configured value) and within this timer value, it must complete the switchover which gets concluded once the config push is completed.
- Now in case, due to some reason if config push is not completed within the time the timer expires and RCM initiates the reload to the UP. This continues until the config push is completed.
- So, when RCM is pushing configuration to UP it is expecting configuration complete signal from UP based on which RCM understands that the config push is completed and considers it a successful switchover.
This is the log that can be seen from the syslogs and the SNMP traps when the config push is complete.
Syslogs
Nov 13 12:01:09 INVIGJ02GNR1D1UP12CO evlogd: [local-60sec9.041] [cli 30000 debug] [1/0/10935 <cli:1010935> cliparse.c:571] [context: local, contextID: 1] [software internal system syslog] CLI command [user rcmadmin, mode [local]INVIGJ02GNR1D1UP12CO]: rcm-config-push-complete
Nov 13 12:01:09 INVIGJ02GNR1D1UP12CO evlogd: [local-60sec9.041] [cli 30000 debug] [1/0/10935 <cli:1010935> cliparse.c:571] [context: local, contextID: 1] [software internal system syslog] CLI command [user rcmadmin, mode [local]INVIGJ02GNR1D1UP12CO]: rcm-config-push-complete end-of-config
SNMP
Fri Mar 24 09:59:01 2023 Internal trap notification 1425 (RCMTCPConnect) Context Name: rcm
Fri Mar 24 09:59:01 2023 Internal trap notification 1421 (RCMConfigPushCompleteSent) Context Name: rcm
Fri Mar 24 09:59:01 2023 Internal trap notification 1426 (RCMChassisState) RCM Chassis State: (2) Chassis State Standby
Fri Mar 24 09:59:04 2023 Internal trap notification 1276 (BFDSessionUp) vpn n6 OurAddr fc00:10:5:132::10 NeighborAddr fc00:10:5:132::254 Session(6/1090552866), Diagnostic code 0 PhyPortId 0
- But in case there is any issue due to which the config push completion is taking time which causes the timer value to expire, then such issues of UP stuck into the Pending state occur.
- As RCM did not get the config push completion status, it considers the switchover is not complete and keeps UP in the Pending state.
- Different reasons for config push issues are explained in UP Reload Causes.
Workaround
1. Temporarily you can enforce the config push complete signal from UP towards RCM with this mentioned command in order to bring back the UP in the Active/Standby state:
rcm-config-push-complete end-of-config
2. This mentioned workaround is just temporary in order to identify the issue taking time for config push which is described in UP Reload Causes.