The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.
This document describes several SD-WAN issues along relevant data that must be collected in advance before you open a TAC case to improve the speed of troubleshooting and/or problem resolution. This document is broken up into two main technical sections: vManage and Edge routers. Relevant outputs and command syntax are provided dependent upon the device in question.
Cisco recommends that you have knowledge of these topics:
This document is not restricted to specific software and hardware versions.
The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, ensure that you understand the potential impact of any command.
The issues here are common problem conditions reported for vManage along with useful outputs for each problem that must be collected in addition to an admin-tech file(s). For cloud-hosted controllers, Technical Assistance Center (TAC) engineer can have access to collect the required admin-tech outputs for the devices based on the feedback in the Base Information Requested section if you provide explicit consent for this. However, we recommend to capture admin-tech outputs if the steps described here to ensure the data contained within is relevant to the time of the problem. This is specifically true if the problem isn't persistent, meaning that the problem can disappear by the time TAC is engaged. For on-prem controllers, an admin-tech must also be included with each set of data here. For a vManage cluster, ensure you capture an admin-tech for each node in the cluster or only the affected node(s).
Problem Report: Slowness in accessing the vManage GUI, latency when performing operations inside of the GUI, general slowness or sluggishness seen within vManage
Step 1. Capture 2-3 instances of a thread print, rename each thread-print file with a numerical designation after each (note the use of the username that you log into vManage with in the file path), example:
vManage# request nms application-server jcmd thread-print | save /home/<username>/thread-print.1
Step 2. Log in to vshell and run vmstat as below:
vManage# vshell
vManage:~$ vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 316172 1242608 5867144 0 0 1 22 3 5 6 1 93 0 0
0 0 0 316692 1242608 5867336 0 0 0 8 2365 4136 6 1 93 0 0
0 0 0 316204 1242608 5867344 0 0 0 396 2273 4009 6 1 93 0 0
0 0 0 316780 1242608 5867344 0 0 0 0 2322 4108 5 2 93 0 0
0 0 0 318136 1242608 5867344 0 0 0 0 2209 3957 9 1 90 0 0
0 0 0 318300 1242608 5867344 0 0 0 0 2523 4649 5 1 94 0 0
1 0 0 318632 1242608 5867344 0 0 0 44 2174 3983 5 2 93 0 0
0 0 0 318144 1242608 5867344 0 0 0 64 2182 3951 5 2 94 0 0
0 0 0 317812 1242608 5867344 0 0 0 0 2516 4289 6 1 93 0 0
0 0 0 318036 1242608 5867344 0 0 0 0 2600 4421 8 1 91 0 0
vManage:~$
Step 3. Collect additional details from the vshell:
vManage:~$ top (press '1' to get CPU counts)
vManage:~$ free -h
vManage:~$ df -kh
Step 4. Capture all NMS services diagnostics:
vManage# request nms application-server diagnostics
vManage# request nms configuration-db diagnostics
vManage# request nms messaging-server diagnostics
vManage# request nms coordination-server diagnostics
vManage# request nms statistics-db diagnostics
Problem Report: API calls fail to return any data or the correct data, general problems executing queries
Step 1. Check the memory available:
vManage:~$ free -h
total used free shared buff/cache available
Mem: 31Gi 24Gi 280Mi 60Mi 6.8Gi 6.9Gi
Swap: 0B 0B 0B
vManage:~$
Step 2. Capture 2-3 instances of a thread print with a 5-second gap in between, rename each thread-print file with a numerical designation after each run of the command (note the use of the username that you log into vManage with in the file path):
vManage# request nms application-server jcmd thread-print | save /home/<username>/thread-print.1
<WAIT 5 SECONDS>
vManage# request nms application-server jcmd thread-print | save /home/<username>/thread-print.2
Step 3. Collect details for any active HTTP sessions:
vManage# request nms application-server jcmd gc-class-histo | i io.undertow.server.protocol.http.HttpServerConnection
Step 4. Provide this details:
1. API calls executed
2. Invocation frequency
3. Log in method (i.e., usage of a single token to execute subsequent API calls or usage of basic authentication to execute the call and then logout)
4. Is the JSESSIONID being re-used?
Note Starting from 19.2 vManage software, only token-based authentication is supported for API calls. For more details on token generation, timeout, and expiration, see this link.
Problem Report: With DPI enabled, statistics processing can be slow or introduce slowness inside of the vManage GUI.
Step 1. Check the disk size allocated for DPI inside of vManage by navigating to Administration > Settings > Statistics Database > Configuration.
Step 2. Check the index health by running the following CLI command from vManage:
vManage# request nms statistics-db diagnostics
Step 3. Confirm if any API calls related to DPI stats are executed externally.
Step 4. Check disk I/O stats with help of this CLI command from vManage:
vManage# request nms application-server diagnostics
Problem Report: Template push or device template update fails or times out.
Step 1. Capture the Config Preview and Intent config from vManage before you click the Configure Devices button (navigation example provided here):
Step 2. Enable viptela.enable.rest.log from the logsettings page (this must be disabled after capturing the required information):
https://<vManage IP>:8443/logsettings.html
Step 3. If the template push failure involves a NETCONF issue or error, enable viptela.enable.device.netconf.log in addition to the REST log in Step 1. Note that this log must also be disabled after the outputs from Step 3 and Step 4 are captured.
Step 4. Attempt to attach the failed template again from vManage and capture an admin-tech using this CLI (capture this for each node of for a cluster):
vManage# request admin-tech
Step 5. Provide screenshots from the task in vManage and the Config Diff to confirm the failure details along with any CSV files used for the template.
Step 6. Include details about the failure and task, including the time of the failed push, system-ip of the device that failed, and error message you see in the vManage GUI.
Step 7. If a template push failure happens with an error message reported for the configuration by the device itself, collect an admin-tech from the device as well.
Problem Report: Cluster instability leading to GUI timeouts, sluggishness, or other anomalies.
Step 1. Capture the output from server_configs.json from each vManage node in the cluster. For example:
vmanage# vshell
vmanage:~$ cd /opt/web-app/etc/
vmanage:/opt/web-app/etc$ more server_configs.json | python -m json.tool
{
"clusterid": "",
"domain": "",
"hostsEntryVersion": 12,
"mode": "SingleTenant",
"services": {
"cloudAgent": {
"clients": {
"0": "localhost:8553"
},
"deviceIP": "localhost:8553",
"hosts": {
"0": "localhost:8553"
},
"server": true,
"standalone": false
},
"container-manager": {
"clients": {
"0": "169.254.100.227:10502"
},
"deviceIP": "169.254.100.227:10502",
"hosts": {
"0": "169.254.100.227:10502"
},
"server": true,
"standalone": false
},
"elasticsearch": {
"clients": {
"0": "169.254.100.227:9300",
"1": "169.254.100.254:9300",
"2": "169.254.100.253:9300"
},
"deviceIP": "169.254.100.227:9300",
"hosts": {
"0": "169.254.100.227:9300",
"1": "169.254.100.254:9300",
"2": "169.254.100.253:9300"
},
"server": true,
"standalone": false
},
"kafka": {
"clients": {
"0": "169.254.100.227:9092",
"1": "169.254.100.254:9092",
"2": "169.254.100.253:9092"
},
"deviceIP": "169.254.100.227:9092",
"hosts": {
"0": "169.254.100.227:9092",
"1": "169.254.100.254:9092",
"2": "169.254.100.253:9092"
},
"server": true,
"standalone": false
},
"neo4j": {
"clients": {
"0": "169.254.100.227:7687",
"1": "169.254.100.254:7687",
"2": "169.254.100.253:7687"
},
"deviceIP": "169.254.100.227:7687",
"hosts": {
"0": "169.254.100.227:5000",
"1": "169.254.100.254:5000",
"2": "169.254.100.253:5000"
},
"server": true,
"standalone": false
},
"orientdb": {
"clients": {},
"deviceIP": "localhost:2424",
"hosts": {},
"server": false,
"standalone": false
},
"wildfly": {
"clients": {
"0": "169.254.100.227:8443",
"1": "169.254.100.254:8443",
"2": "169.254.100.253:8443"
},
"deviceIP": "169.254.100.227:8443",
"hosts": {
"0": "169.254.100.227:7600",
"1": "169.254.100.254:7600",
"2": "169.254.100.253:7600"
},
"server": true,
"standalone": false
},
"zookeeper": {
"clients": {
"0": "169.254.100.227:2181",
"1": "169.254.100.254:2181",
"2": "169.254.100.253:2181"
},
"deviceIP": "169.254.100.227:2181",
"hosts": {
"0": "169.254.100.227:2888:3888",
"1": "169.254.100.254:2888:3888",
"2": "169.254.100.253:2888:3888"
},
"server": true,
"standalone": false
}
},
"vmanageID": "0"
}
Step 2. Capture details on which services are enabled or disabled for each node. For this, navigate to Administration > Cluster Management in the vManage GUI.
Step 3. Confirm underlay reachability on the cluster interface. For this, run ping <ip-address> from each vManage node in VPN 0 to the cluster interface IP of the other nodes.
Step 4. Collect diagnostics from all NMS services for each vManage node in the cluster:
vManage# request nms application-server diagnostics
vManage# request nms configuration-db diagnostics
vManage# request nms messaging-server diagnostics
vManage# request nms coordination-server diagnostics
vManage# request nms statistics-db diagnostics
The issues here are common problem conditions reported for Edge devices along with useful outputs for each that must be collected. Ensure that for each problem, an admin-tech is collected for all necessary and relevant Edge devices. For cloud-hosted controllers, TAC can have access to collect the required admin-tech outputs for the devices based on the feedback in the Base Information Requested section. However, as with vManage, it can be necessary to capture these before you open a TAC case to ensure the data contained within is relevant to the time of the problem. This is specifically true if the problem isn't persistent, meaning that the problem can disappear by the time TAC is engaged.
Problem Report: Control connection not forming from a vEdge/cEdge to one or more of the controllers
Step 1. Identify the local/remote error of the control connection failure:
Step 2. Confirm the state of the TLOC(s) and that any and all show 'up':
Step 3. For errors around timeouts or connection failures (i.e., DCONFAIL or VM_TMO), take control-plane captures on both the edge device as well as the controller in question:
vManage# tcpdump vpn 0 interface eth1 options "-vvvvvv host 192.168.44.6"
tcpdump -p -i eth1 -s 128 -vvvvvv host 192.168.44.6 in VPN 0
tcpdump: listening on eth1, link-type EN10MB (Ethernet), capture size 128 bytes
20:02:07.427064 IP (tos 0xc0, ttl 61, id 50139, offset 0, flags [DF], proto UDP (17), length 168)
192.168.44.6.12346 > 192.168.40.1.12346: UDP, length 140
20:02:07.427401 IP (tos 0xc0, ttl 64, id 37220, offset 0, flags [DF], proto UDP (17), length 210)
192.168.40.1.12346 > 192.168.44.6.12346: UDP, length 182
vEdge-INET-Branch2# tcpdump vpn 0 interface ge0/2 options "-vvvvvv host 192.168.40.1"
tcpdump -p -i ge0_2 -vvvvvv host 192.168.40.1 in VPN 0
tcpdump: listening on ge0_2, link-type EN10MB (Ethernet), capture size 262144 bytes
20:14:16.136276 IP (tos 0xc0, ttl 64, id 55858, offset 0, flags [DF], proto UDP (17), length 277)
10.10.10.1 > 192.168.40.1.12446: [udp sum ok] UDP, length 249
20:14:16.136735 IP (tos 0xc0, ttl 63, id 2907, offset 0, flags [DF], proto UDP (17), length 129)
192.168.40.1.12446 > 10.10.10.1.12346: [udp sum ok] UDP, length 101
cEdge-Branch1#config-transaction
cEdge-Branch1(config)# ip access-list extended CTRL-CAP
cEdge-Branch1(config-ext-nacl)# 10 permit ip host 10.10.10.1 host 192.168.40.1
cEdge-Branch1(config-ext-nacl)# 20 permit ip host 192.168.40.1 host 10.10.10.1
cEdge-Branch1(config-ext-nacl)# commit
cEdge-Branch1(config-ext-nacl)# end
cEdge-Branch1#monitor capture CAP control-plane both access-list CTRL-CAP buffer size 10
cEdge-Branch1#monitor capture CAP start
cEdge-Branch1#show monitor capture CAP buffer brief
----------------------------------------------------------------------------
# size timestamp source destination dscp protocol
----------------------------------------------------------------------------
0 202 0.000000 192.168.20.1 -> 50.50.50.3 48 CS6 UDP
1 202 0.000000 192.168.20.1 -> 50.50.50.4 48 CS6 UDP
2 220 0.000000 50.50.50.3 -> 192.168.20.1 48 CS6 UDP
3 66 0.000992 192.168.20.1 -> 50.50.50.3 48 CS6 UDP
4 220 0.000992 50.50.50.4 -> 192.168.20.1 48 CS6 UDP
5 66 0.000992 192.168.20.1 -> 50.50.50.4 48 CS6 UDP
6 207 0.015991 50.50.50.1 -> 12.12.12.1 48 CS6 UDP
Step 4. For other errors observed in the control connection history outputs and for more details on the issues described, please refer to the following guide .
Problem Report: One or more control connections flap between a vEdge/cEdge and one or more controllers. This can be frequent, intermittent, or random in nature.
Problem Report: BFD session is down or is flapping up and down between two edge devices.
Step 1. Collect the state of the BFD session on each device:
Step 2. Collect Rx and Tx packet counts on each edge router:
Step 3. If counters do not increase for BFD session on one end of the tunnel in the outputs above, captures can be taken using ACLs to confirm if packets are being received locally. More details on this along with other validations that can be done can be found here .
Problem Report: Device unexpectedly reloaded and problems with power are ruled out. Indications from the device are that it crashed potentially.
Step 1. Check the device to confirm if a crash or unexpected reload was observed:
Step 2. If confirmed, capture an admin-tech from the device through vManage by navigating to Tools > Operational Commands. Once there, select the Options button for the device and select Admin Tech. Ensure all check boxes are checked, which will include all logs and core files on the device.
Problem Report: Application does not work/HTTP pages not load, slowness/latency in performance, failures after making policy or configuration changes
Step 1. Identify the source/destination IP pair for an application or flow exhibiting the problem.
Step 2. Determine all Edge devices in the path and collect an admin-tech from each through vManage.
Step 3. Take a packet capture on the edge devices at each site for this flow when the problem is seen:
monitor capture CAP interface GigabitEthernet0/0/0 both access-list BROKEN-FLOW buffer size 10
monitor capture CAP start
show monitor capture CAP parameter
show monitor capture CAP buffer [brief]
monitor capture CAP export bootflash:cEdge1-Broken-Flow.pcap
debug platform packet-trace packet 2048 fia-trace
debug platform packet-trace copy packet input l3 size 2048
debug platform condition ipv4 access-list BROKEN-FLOW both
debug platform condition start
show platform packet-trace summary
show platform packet-trace packet all | redirect bootflash:cEdge1-PT-OUTPUT.txt
Step 4. If possible, repeat Step 3 in a working scenario for comparison.
Tip: If there are no other ways to copy the corresponding files off of the cEdge directly, the files can be copied to vManage first using method described here. Run the command on vManage:
request execute scp -P 830 <username>@<cEdge system-IP>:/bootflash/<filename> .
This file will then be stored in /home/<username>/ directory for the username you used to login to vManage. From there, you can use Secure Copy Protocol (SCP) of Secure File Transfer Protocol (SFTP) to copy file off a vManage using a third-party SCP/SFTP client or a Linux/Unix machine CLI with OpenSSH utilities.
Revision | Publish Date | Comments |
---|---|---|
1.0 |
10-Sep-2020 |
Initial Release |