Use the BGP "Slow Peer" Feature to Resolve Slow Peer Issues

Available Languages

Updated:June 16, 2015

Document ID:119000

Bias-Free Language

The documentation set for this product strives to use bias-free language. For the purposes of this documentation set, bias-free is defined as language that does not imply discrimination based on age, disability, gender, racial identity, ethnic identity, sexual orientation, socioeconomic status, and intersectionality. Exceptions may be present in the documentation due to language that is hardcoded in the user interfaces of the product software, language used based on RFP documentation, or language that is used by a referenced third-party product. Learn more about how Cisco is using Inclusive Language.

Introduction

Background Information

Update Groups

Problem

Solution

Detection

Slow Peer Identification

Movement

Movement Without Slow Peer Feature

Static Slow Peer Movement

Dynamic Slow Peer Movement

Recovery

Clear the Slow Peer Status

Introduction

This document describes how to resolve a slow peer problem with the use of the Border Gateway Protocol (BGP) slow peer feature, which identifies a slow peer in a BGP update group and can move the slow peer out of the update group permanently or temporarily.

Background Information

This section provides an overview of the slow peer feature and the use of update groups.

Update Groups

The slow peer feature is used in update groups. An update group is a dynamic method that is used in order to group BGP peers with same outbound policy. The benefit of update groups is that the group policy is used in order to format messages once, and then they are replicated and transmitted to the other members of the group. This method is more efficient than the need to format BGP updates for each peer separately.

When this method is implemented, if the outbound policy changes, the peer groups change per update group. The update groups are formed per Address Family (AF).

Here is an example of two BGP peers in different update groups for AF IPv4 unicast, but with the same update group for the AF VPNv4:

R2#show ip bgp update-group
BGP version 4 update-group 1, external, Address Family: IPv4 Unicast
  Has 1 member (* indicates the members currently being sent updates):
   10.1.3.4
 
BGP version 4 update-group 2, external, Address Family: IPv4 Unicast
  Has 1 member (* indicates the members currently being sent updates):
   10.1.2.3
 
R2#show ip bgp vpnv4 all update-group
BGP version 4 update-group 1, external, Address Family: VPNv4 Unicast
  Has 2 members (* indicates the members currently being sent updates):
   10.1.2.3         10.1.3.4

The update group becomes more efficient as the number BGP peers that are included in the update group increases. Typically, internal BGP (iBGP) peers have the same outbound policy. For iBGP, a Route Reflector (RR) can have many iBGP peers; thus, it will have large update groups. Provider Edge (PE) routers can have many external BGP (eBGP) peers towards the Customer Edge (CE) routers in one Virtual/Routing Forwarding (VRF). The PE routers can have large update groups as well for the peerings with CE routers on the VRF interfaces.

Problem

A slow peer is a peer that cannot keep up with the rate at which the router generates BGP update messages over a prolonged period of time (in the order of minutes) in an update group. The reason for this can be persistent network issues. The network reasons could be packet loss and/or loaded links, or throughput issues with the BGP sessions. Also, a BGP peer might be heavily loaded in terms of CPU and cannot service the TCP connection at the required speed.

Slow peers affect the BGP convergence of the complete update group. If one BGP peer is slow, it causes the entire update group to slow down. The result is that the other update group members will have slower convergence as well. For this reason, the issue should be resolved.

You can identify the slow peer and move it out of the update group. In order to complete this task, you can change the outbound policy for that BGP peer; however, this is a manual task. You must first identify the peer that is slow, and then move it out of the update group. The slow peer feature can do this automatically, so that no user intervention is required.

Solution

There are three parts to the slow peer feature:

Detection of the slow peer
Movement of the slow peer into a slow update group
Recovery of the slow peer (which moves the recovered peer back to its original update group)

These processes are described in further detail in the sections that follow.

Detection

The slow peer feature detects slow peers in an update group. Each update group has a caching queue, where formatted BGP updates are temporarily stored before transmission.

Here is an example of such an update group cache:

R2#show ip bgp replication

                                                                    Current    Next
Index  Members          Leader       MsgFmt    MsgRepl     Csize    Version Version
    1        1        10.1.1.1            0          0    0/100           6/0
    2        3        10.1.2.3            2          6    0/1000          6/0
    3        1        10.1.2.6            3          0    0/100           6/0

The size of the cache is dynamically calculated and depends on:

The number of peers in the update group
The installed system memory
The type of peers in the update group
The type of AF

The number of formatted BGP updates that await transmission can build in one update group when one peer (the slow one) does not acknowledge the BGP messages as quickly as the other members. When the cache limit is reached, the group does not have any more quota to queue new messages. No new messages can be formatted until the cache is reduced (until some messages are acknowledged by the slow peer(s)). This prohibits the BGP peer and does not allow it to send new messages (updates or withdraws) to the faster members of the group. Hence, this slows down the convergence of all peers in the update group.

In order for the slow peer feature to identify a slow peer, it refers to the BGP update timestamps and peer TCP parameters.

The slow peer detection is disabled by default. In order to enable the slow peer detection, use one of these methods:

Enable the feature for the BGP process (can be configured from AF/VRF):
```
bgp slow-peer detection [threshold <seconds>]

[no] bgp slow-peer detection
```
Note: The threshold value can range between 120 and 3,600 seconds, and default value is 300 seconds.

Enable the feature per peer:

neighbor {<nbr-addr>/<peer-grp-name>} slow-peer detection [threshold < seconds >]

[no] neighbor {<nbr-addr>/<peer-grp-name>} slow-peer detection

Enable the feature via peer policy template:

slow-peer detection [threshold < seconds >]

[no] slow-peer detection

When a slow peer is detected, a syslog message similar to this is printed:

%BGP-5-SLOWPEER_DETECT: Neighbor IPv4 Unicast 10.1.6.7 has been detected
 as a slow peer.

You can enter these show commands in order to view the slow peers:

show ip bgp summary slow
show ip bgp neighbors slow
show ip bgp update-group summary slow

Here is an example show command output when the slow keyword is used:

R2#show ip bgp update-group summary slow
Summary for  Update-group 1, Address Family IPv4 Unicast
Summary for  Update-group 2, Address Family IPv4 Unicast
Summary for  Update-group 3, Address Family IPv4 Unicast
Summary for  Update-group 4, Address Family IPv4 Unicast
BGP router identifier 10.1.6.2, local AS number 2
BGP table version is 966013, main routing table version 966013
BGP main update table version 966013
50000 network entries using 6050000 bytes of memory
50000 path entries using 2600000 bytes of memory
5001/5000 BGP path/bestpath attribute entries using 700140 bytes of memory
5000 BGP AS-PATH entries using 183632 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 9533772 total bytes of memory
BGP activity 208847/158847 prefixes, 508006/458006 paths, scan interval 60 secs
Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.1.6.7        4     7     165   50309        0    0  100 00:10:35        0

As shown in the output, the peer 10.1.6.7 is a slow peer for the AF IPv4 unicast. The other AFs do not show any slow peers.

In order to verify whether the detection timer currently runs and its value, enter this command:

R2#show ip bgp update-group
BGP version 4 update-group 3, external, Address Family: IPv4 Unicast
  BGP Update version : 116013/0, messages 164 queue 164, not converged
  Private AS number removed from updates to this neighbor
  Update messages formatted 5948, replicated 11589
  Number of NLRIs in the update sent: max 249, min 1
  Minimum time between advertisement runs is 30 seconds
  Slow-peer detection timer (expires in 111 seconds)
  Has 3 members (* indicates the members currently being sent updates):
   10.1.4.5         10.1.5.6         10.1.6.7

As shown in the example output, the detection timer has started. The detection timer starts when the update group cache is full.

In this example, you can see that a slow peer is detected, but it only moves out of the update group after the slow peer detection timer expires:

R2#show ip bgp update-group
âÂ¦
BGP version 4 update-group 3, external, Address Family: IPv4 Unicast
  BGP Update version : 516013/566013, messages 357 queue 357, not converged
  Private AS number removed from updates to this neighbor
  Update messages formatted 27044, replicated 53645
  Number of NLRIs in the update sent: max 249, min 0
  Minimum time between advertisement runs is 30 seconds
  Slow-peer detection timer (expires in 20 seconds)
  Has 3 members (* indicates the members currently being sent updates)
   (1 dynamically detected as slow): 

  *10.1.4.5        *10.1.5.6         10.1.6.7

Slow Peer Identification

If the slow peer detection feature is not enabled, then you must identify the slow peer manually. First, check the table version and the output queue of the peers in the update group:

R2#show ip bgp update-group 3 summary
Summary for  Update-group 3, Address Family IPv4 Unicast
BGP router identifier 10.1.6.2, local AS number 2
BGP table version is 552583, main routing table version 552583
BGP main update table version 552583
37870 network entries using 4582270 bytes of memory
37870 path entries using 1969240 bytes of memory
5002/3788 BGP path/bestpath attribute entries using 700280 bytes of memory
5001 BGP AS-PATH entries using 183656 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 7435446 total bytes of memory
BGP activity 158847/108847 prefixes, 295876/258006 paths, scan interval 60 secs
Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.1.4.5        4     5      77   26840   516013    0    0 01:07:12        0
10.1.5.6        4     6      69   26833   516013    0    0 01:00:30        0
10.1.6.7        4     7      79   26761   516013    0  194 00:45:42        0

In this example, verify whether the table version (TblVer) of the peers ever catches up with the main BGP table version or whether it is always behind. Second, check for one or more peers with very high output queue values. It is likely that these are the slow peers.

When you view the suspected slow BGP peer, consider these questions (on both sides of the BGP session):

How long ago was the last write performed?
Are the keepalives in throttle?
Is the output queue high?
Is the SRTT/RTTO high?
Does the the number of retransmits increase?
Are there any queued retransmit packets?
Is the TCP send window very low or zero?

Here is an example:

R2#show ip bgp neighbors 10.1.6.7
BGP neighbor is 10.1.6.7,  remote AS 7, external link
 Member of peer-group group3 for session parameters
  BGP version 4, remote router ID 10.1.6.7
  BGP state = Established, up for 00:56:09
  Last read 00:00:43, last write 00:00:17, hold time is 180, keepalive interval
   is 60 seconds
  Keepalives are temporarily in throttle due to closed TCP window
  Neighbor capabilities:
    Route refresh: advertised and received(new)
    Address family IPv4 Unicast:
     advertised and received
  Message statistics
    InQ depth is 0
    OutQ depth is 0    Partial message pending
                         Sent       Rcvd
    Opens:                  5          4
    Notifications:          0          0
    Updates:            29004          0
    Keepalives:             0       1426
    Route Refresh:          0          0
    Total:              30336       1431
  Default minimum time between advertisement runs is 30 seconds
For address family: IPv4 Unicast
  BGP table version 250001, neighbor version 200001/250001
  Output queue size : 410
  Index 3, Offset 0, Mask 0x8
  3 update-group member
  group3 peer-group member
  Inbound soft reconfiguration allowed
  Private AS number removed from updates to this neighbor
  Inbound path policy configured
  Route map for incoming advertisements is eBGP-in
                                 Sent       Rcvd
  Prefix activity:               ----       ----
    Prefixes Current:            2596          0
    Prefixes Total:            102624          0
    Implicit Withdraw:             28          0
    Explicit Withdraw:         100000          0
    Used as bestpath:             n/a          0
    Used as multipath:            n/a          0
                                   Outbound    Inbound
  Local Policy Denied Prefixes:    --------    -------
    Total:                                0          0
  Maximum prefixes allowed 20000
  Threshold for warning message 80%, restart interval 300 min
  Number of NLRIs in the update sent: max 249, min 0
  Last detected as dynamic slow peer: never
 Dynamic slow peer recovered: never
  Oldest update message was formatted: 00:02:24
  Address tracking is enabled, the RIB does have a route to 10.1.6.7
  Connections established 4; dropped 3
  Last reset 00:57:39, due to User reset
  Transport(tcp) path-mtu-discovery is enabled
Connection state is ESTAB, I/O status: 1, unread input bytes: 0       
Connection is ECN Disabled
Mininum incoming TTL 0, Outgoing TTL 1
Local host: 10.1.6.2, Local port: 20298
Foreign host: 10.1.6.7, Foreign port: 179
Connection tableid (VRF): 0
 
Enqueued packets for retransmit: 15, input: 0  mis-ordered: 0 (0 bytes)
Event Timers (current time is 0x4A63D14):
Timer          Starts    Wakeups            Next
Retrans           697         29       0x4A6590C
TimeWait            0          0             0x0
AckHold            64         63             0x0
SendWnd             0          0             0x0
KeepAlive           0          0             0x0
GiveUp              0          0             0x0
PmtuAger          128        127       0x4A64CB7
DeadWait            0          0             0x0
Linger              0          0             0x0
 
iss:  130287252  snduna:  131516888  sndnxt:  131532233     sndwnd:  16384
irs: 1184181084  rcvnxt: 1184182346  rcvwnd:      15123  delrcvwnd:   1261
 
SRTT: 20122 ms, RTTO: 20440 ms, RTV: 318 ms, KRTT: 0 ms
minRTT: 20028 ms, maxRTT: 20796 ms, ACK hold: 200 ms
Status Flags: none
Option Flags: nagle, path mtu capable, higher precendence
 
Datagrams (max data segment is 1460 bytes):
Rcvd: 922 (out of order: 0), with data&colon; 65, total data bytes: 1261
Sent: 1463 (retransmit: 29 fastretransmit: 1),with data&colon; 1391, total
 data bytes: 1245129

Movement

This section describes the movement process in regards to the slow peer feature in various scenarios.

Movement Without Slow Peer Feature

A slow peer can be moved manually into a new update group without the slow peer feature.

Before the slow peer feature was available, you were required to identify the slow peer and then move it out of the update group manually. This is completed with a change to the outbound policy of that BGP peer. This outbound policy must be different than any other that is used, as you must ensure that the slow peer does not move to another update group that currently exists (and move the problem to that update group). The best change that you can apply is one that does not affect the actual policy. For example, you could change the Minimum Route Advertisement Interval (MRAI) of the peer (under the specific AF).

Here is an example that shows the manual movement of a slow peer when the slow peer feature is not available:

RR1#debug ip bgp groups 
BGP groups debugging is on

RR1(config)#router bgp 1                                    
RR1(config-router)#address-family vpnv4                           
RR1(config-router-af)#neighbor 10.100.1.3 advertisement-interval 3 
 
BGP-DYN(4): 10.100.1.3 cannot join update-group 1 due to an advertisement-interval
 mismatch
BGP(4): Scheduling withdraws and update-group membership change for 10.100.1.3
BGP(4): Resetting 10.100.1.3's version for its transition out of update-group 1
BGP-DYN(4): 10.100.1.3 cannot join update-group 1 due to an advertisement-interval
 mismatch
BGP-DYN(4): Removing 10.100.1.3 from update-group 1
BGP-DYN(4): 10.100.1.3 cannot join update-group 1 due to an advertisement-interval
 mismatch
BGP-DYN(4): Created update-group 0 from neighbor 10.100.1.3
BGP-DYN(4): Adding 10.100.1.3 to update-group 0

Static Slow Peer Movement

In order to move one peer from an update group into a new update group, you can configure it as a static slow peer. If there are multiple slow peers, then static slow peers with the same outbound policy are placed into the same slow update group.

In order to move a slow peer statically, you can configure it with the use of these commands:

Enable static peer movement per neighbor or per peer-group:

[no] neighbor {<nbr-addr>/<peer-grp-name>} slow-peer split-update-group static

Enable static peer movement via a peer policy template:
```
[no] slow-peer split-update-group static
```

Dynamic Slow Peer Movement

Slow peer movement is disabled by default. In order to enable the slow peer movement, you can configure it via one of these methods:

Enable slow peer movement for the BGP process:
```
bgp slow-peer split-update-group dynamic [permanent]

[no] bgp slow-peer split-update-group dynamic
```
Note: This can be configured from the address-family/topology/VRF view.

Enable slow peer movement per peer:

neighbor {<nbr-addr>/<peer-grp-name>} slow-peer split-update-group dynamic [permanent]

[no] neighbor {<nbr-addr>/<peer-grp-name>} slow-peer split-update-group dynamic

Enable slow peer movement via a peer policy template:

slow-peer split-update-group dynamic [permanent]

[no] slow-peer split-update-group dynamic

Note: The permanent keyword indicates that the slow peer will not recover automatically. In this case, you can move the recovered slow peer back to its original update group via one of the clear commands.

The static slow peers and dynamic slow peers are in the same slow peer update group. In this example you can see one slow peer in a slow update group:

R2#show ip bgp update-group
âÂ¦
BGP version 4 update-group 4, external, Address Family: IPv4 Unicast
  BGP Update version : 0/566013, messages 100 queue 100, not converged
  Slow update group
  Private AS number removed from updates to this neighbor
  Update messages formatted 2497, replicated 0
  Number of NLRIs in the update sent: max 10, min 1
  Minimum time between advertisement runs is 30 seconds
  Has 1 member (* indicates the members currently being sent updates)
   (1 dynamically detected as slow):
  *10.1.6.7

Recovery

A slow peer can be regrouped under its original update group (that matches the outbound policy) once it is confirmed that it is no longer a slow peer (it catches up). The recovery timer starts when the slow peer update group has converged. When the recovery timer expires, the slow peer is moved back to the regular update group.

Note: In order to see the behavior that is related to the detection/recovery timer, enter the debug ip bgp updates events command.

When a slow peer is moved back to the original update group (this means a recovery), a syslog message similar to this is printed:

%BGP-5-SLOWPEER_RECOVER: Slow peer IPv4 Unicast 10.1.6.7 has recovered.

In order to verify whether the recovery timer currently runs and the value, enter this command:

R2#show ip bgp update-group
BGP version 4 update-group 1, external, Address Family: IPv4 Unicast
  BGP Update version : 165973/0, messages 0 queue 0, converged
  Route map for outgoing advertisements is dummy
  Update messages formatted 0, replicated 0
  Number of NLRIs in the update sent: max 0, min 0
  Minimum time between advertisement runs is 30 seconds
  Slow-peer recovery timer (expires in 16 seconds)
  Has 1 member (* indicates the members currently being sent updates):
   10.1.1.1

In this example, the recovery timer, with a value of 16 seconds, indicates that a possibly slow peer might move back to its original update group in 16 seconds.

In this example, you can see a peer that has recovered from the slow peer status:

R2#show ip bgp neighbor 10.1.6.7 

BGP neighbor is 10.1.6.7,  remote AS 7, external link
 Member of peer-group group3 for session parameters
  BGP version 4, remote router ID 10.1.6.7
âÂ¦
 3 update-group member
  group3 peer-group member
âÂ¦
 Number of NLRIs in the update sent: max 249, min 0
  Last detected as dynamic slow peer: 00:12:49
  Dynamic slow peer recovered: 00:01:57
  Oldest update message was formatted: 00:00:55

Clear the Slow Peer Status

The slow peer status can be cleared manually with these commands:

clear ip bgp * slow
clear ip bgp AF {unicast|multicast} <AS number> slow
clear ip bgp AF {unicast|multicast} peer-group <group-name> slow
clear ip bgp <neighbor-address> slow
clear bgp AF {unicast|multicast} * slow
clear bgp AF {unicast|multicast} <AS number> slow
clear bgp AF {unicast|multicast} peer-group <group-name> slow
clear bgp AF {unicast|multicast} <neighbor-address> slow

Note: When you use these commands, replace AF with the actual address family.

With the use of these commands, the peer is moved back to the original update group.

Enter the show ip bgp internal command in order to view the slow peer detection and movement settings:

R2#show ip bgp internal
Time left for bestpath timer: 593 secs
Address-family IPv4 Unicast, Mode : RW
    Table Versions : Current 622091, RIB 622091
    Start time : 00:00:01.168    Time elapsed 01:21:56.740
    First Peer up in : 00:00:07    Exited Read-Only in : 00:02:16
    Done with Install in : 00:02:26    Last Update-done in : never
    0 updates expanded
    Attribute list queue size: 0
    Slow-peer detection is enabled  Threshold is 300 seconds
    Slow-peer split-update-group dynamic is enabled
    BGP Nexthop scan:-
        penalty: 0, Time since last run: never,  Next due in: none
        Max runtime : 0 ms Latest runtime : 0 ms Scan count: 0
    BGP General Scan:-
        Max runtime : 14572 ms Latest runtime : 14572 ms Scan count: 78
    BGP future scanner version: 79
    BGP scanner version: 0

Note: In summary, the BGP slow peer is a feature that detects a slow peer in a BGP update group and allows for faster BGP convergence with the movement of the slow peer out of the update group.

Revision History

Revision	Publish Date	Comments
1.0	16-Jun-2015	Initial Release

Contributed by Cisco Engineers

Was this Document Helpful?

Feedback

Contact Cisco

Open a Support Case
(Requires a Cisco Service Contract)

This Document Applies to These Products

IP Routing

Use the BGP "Slow Peer" Feature to Resolve Slow Peer Issues

Available Languages

Bias-Free Language

Contents

Introduction

Background Information

Update Groups

Problem

Solution

Detection

Slow Peer Identification

Movement

Movement Without Slow Peer Feature

Static Slow Peer Movement

Dynamic Slow Peer Movement

Recovery

Clear the Slow Peer Status

Revision History

Contributed by Cisco Engineers

Was this Document Helpful?

Contact Cisco

This Document Applies to These Products