Introduction
This document describes the troubleshooting of Evolved-GPRS Tunnelling Protocol (EGTP) path failures observed due to a mismatch in restart counter values between SGSN/MME and GGSN/Serving Gateway or PDN Gateway (SPGW).
Troubleshooting Commands
show egtpc peers interface
show egtpc peers path-failure-history
show egtpc statistics path-failure-reasons
show egtp-service all
show egtpc sessions
show egtpc statistics
egtpc test echo gtp-version 2 src-address <source node IP address> peer-address <remote node IP address>
For more details about this commands refer this mentioned link
https://www.cisco.com/c/en/us/support/docs/wireless-mobility/gateway-gprs-support-node-ggsn/119246-technote-ggsn-00.html
Analysis
From the logs and stats, it is identified that the restart counter value at the Mobility Management Entity (MME) end is 11, and at the EPG end is 12.
You can observe the traps as mentioned here:
Internal trap notification 1112 (EGTPCPathFail) context s11mme, service s11-mme, interface type mme, self address <X.X.X.X>, peer address <Y.Y.Y.Y>, peer old restart counter 4, peer new restart counter 4, peer session count 240, failure reason no-response-from-peer, path failure detection Enabled.
Internal trap notification 1112 (EGTPCPathFail) context XGWin, service EGTP1, interface type pgw-ingress, self address <X.X.X.X>, peer address <Y.Y.Y.Y>, peer old restart counter 54, peer new restart counter 12, peer session count 1107240, failure reason restart-counter-change
Internal trap notification 1112 (EGTPCPathFail) context XGWin, service EGTP1, interface type pgw-ingress, self address <X.X.X.X>, peer address <Y.Y.Y.Y>, peer old restart counter 12, peer new restart counter 54, peer session count 1107207, failure reason create-sess-restart-counter-change
Vendor Gateway (GW) has a problem accepting lesser values from the Serving GPRS Support Node (SGSN) if the restart counter is changed. If vendor GW has stored a higher value (old one) and after node reload if Cisco SGSN sends a lesser value, Vendor GW does not accept it.
Note: As per TS 29.060:
1. If the SGSN is in contact with the Gateway GPRS Support Node (GGSN) for the first time or has recently restarted without indicating the new Restart Counter value to the GGSN, it incorporates a Recovery information element into the Create Policy Decision Point (PDP) Context Request. This element is included by the SGSN when necessary. The GGSN that receives a Recovery information element in the Create PDP Context Request message element handles it like when receiving an Echo Response message. The Create PDP Context Request message is considered a valid activation request for the PDP context included in the message.
2. The GGSN includes the Recovery information element into the Create PDP Context Response if the GGSN is in contact with the SGSN for the first time or the GGSN has restarted recently and the new Restart Counter value has not yet been indicated to the SGSN. The SGSN receiving the Recovery information element handles it as when an Echo Response message is received. However, it considers the PDP context being created as active if the response indicates successful context activation at the GGSN.
3. The GTP interface uses a restart counter in order to track the number of restarts. As per TS 23.060, GTP nodes must use persistent storage in order to keep track of their local GTP restart counters so one expects these restart counters to proceed upwards always. However, in the event of a peer node detecting a decrease in the restart counter, the GTP node behavior is elaborated in session '18 GTP-C based restart procedures' of TS 23.007. Suppose the value of a restart counter previously stored for a peer is larger than the restart counter value received in the Echo Response message or the GTP-C message, taking the integer rollover into account. In that case, this indicates a possible race condition (newer message arriving before the older one). The received new Restart counter value is discarded and an error can be logged. In other words, when the GTP node detects a lower restart counter from a peer, it never records that new restart counter.
StarOS Perspective
From the StarOS end, you can explicitly change the RC value in the StarOS from the path /flash/restart_file_cntr.txt
which is done at the time of the upgrade.
According to this theory, when comparing it to the current configuration, the MME RC value was lower than the Vendor GW RC value. In order to address the issue, the RC value at the Vendor GW node was modified.
Now after changing the RC value, it is seen that the EGTPC path failures stopped but still, sessions are not increasing and EGTPC links are still showing inactive.
These are the commands that were used during troubleshooting:
show sgtp-service all | grep "restart" ----------------- to check RC value
[local]Nodename# show egtp-service all | more
Service name : egtpc_sv_service
Service-Id : 5
Context : SGs
Interface Type : mme
Status : STARTED
Restart Counter : 11 ----------------- RC value to verify
Max Remote Restart Counter Change : 255
Message Validation Mode : Standard
GTPU-Context :
GTPC Retransmission Timeout : 5000 (milliseconds)
GTPC Maximum Request Retransmissions : 4
GTPC IP QOS DSCP value : 10
GTPC Echo : Enabled
GTPC Echo Mode : Default
[local]Nodename# show egtpc peers ------------ To check link status
Sunday February 05 15:31:00 IST 2023
+----Status: (I) - Inactive (A) - Active
|
|+---GTPC Echo: (D) - Disabled (E) - Enabled
||
||+--Restart Counter Sent: (S) - Sent (N) - Not Sent
|||
|||+-Peer Restart Counter: (K) - Known (U) - Unknown
||||
||||+-Type of Node: (S) - SGW (P) - PGW
||||| (M) - MME (G) - SGSN
||||| (L) - LGW (E) - ePDG
||||| (C) - CGW (B) - MBMS
||||| (U) - Unknown
|||||
||||| Service Restart--------+ No. of
||||| ID Counter | restarts
||||| | | | Current Max
vvvvv v Peer Address v v sessions sessions LCI OCI
----- --- --------------------------------------- --- --- ----------- ------------------
IDSKS 10 X.X.X.X 91 0 0 0 X X
IDNKS 11 Y.Y.Y.Y 4 95 0 34005 X X
IDNKS 11 Z.Z.Z.Z 10 103 0 16805 X X
IDNKS 11 A.A.A.A 104 95 0 7250 X X
AESKS 11 B.B.B.B 0 0 4004 47649 X X
AESKS 11 C.C.C.C 0 0 4053 46571 X X
AESKS 11 D.D.D.D 0 0 4026 46734 X X
ABove output peers if you see no sessions on this peer and also link are inactive
Further. check echo request/response (to be checked in hidden mode):
egtpc test echo gtp-version 2 src-address <MME end IP> peer-address <EPG end IP>
This is the output when the Restart Counter value is corrected and configured the same as that of MME for the S11 interface for the affected EGTP peer and then the Echo request/response is fine but the link is still inactive.
[s11mme]Nodename# egtpc test echo gtp-version 2 src-address <X.X.X.X> peer-address <Y.Y.Y.Y>
Sunday February 05 16:22:42 IST 2023
EGTPC test echo
---------------
Peer: X.X.X.X Tx/Rx: 1/1 RTT(ms): 1 (COMPLETE) Recovery: 10 (0x0A)
However, the same does not work as expected on other problematic affected GWs. You still get a failure for echo request/response as mentioned here.
[s11mme]Nodename# egtpc test echo gtp-version 2 src-address <X.X.X.X> peer-address <Y.Y.Y.Y>
Sunday February 05 16:46:11 IST 2023
EGTPC test echo
---------------
Peer: X.X.X.X Tx/Rx: 1/0 RTT(ms): 0 (FAILURE)
Workaround
1. In order to fix this issue, take note of the current restart counter in /flash/restart_file_cntr.txt
before the VNF deactivation. Later, when it is activated with new software, log in to CF and update the file /flash/restart_file_cntr.txt
with the old restart counter. Then, as a normal upgrade procedure, reload the VNF with day-N configuration.
2. Modify the cat /flash/restart_file_cntr.txt
to the required value and reload the node with the current configuration.
Note: You can try with SGTPC restart as well once as the initial step.