Troubleshooting High CPU Caused by The BGP Scanner or BGP Router Process
Troubleshooting High CPU Caused by The BGP Scanner or BGP Router Process
Introduction
This document describes situations in which a Cisco IOS router might experience high CPU utilization due to either the Border Gateway Protocol (BGP) router process or the BGP scanner process, as shown in the output of the show process cpu command. The duration of the high CPU condition varies based on a number of conditions, in particular the size of the Internet routing table and the number of routes that a particular router holds in its routing and BGP tables. The show process cpu command shows CPU utilization averaged over the past five seconds, one minute, and five minutes. CPU utilization numbers do not provide a true linear indication of the utilization with respect to the offered load. These are some of the major reasons: In a real world network, the CPU has to handle various system maintenance functions, such as network management. The CPU has to process periodic and eventtriggered routing updates. There are other internal system overhead operations, such as polling for resource availability, that are not proportional to traffic load. You can also use the show processes cpu command in order to obtain some indication of CPU activity. Once you read this document, you should understand the role of each BGP process and when each process runs. In addition, you should understand BGP convergence and techniques to optimize convergence time.
Prerequisites
This document requires an understanding of how to interpret the show process cpu command. Refer to Troubleshooting High CPU Utilization on Cisco Routers as reference material.
Components Used
The information in this document is based on Cisco IOS Software Release 12.0. The information in this document was created from the devices in a specific lab environment. All of the devices used in this document started with a cleared (default) configuration. If your network is live, make sure that you understand the potential impact of any command.
Handles queueing and BGP I/O processing of BGP packets, such as UPDATES and KEEPALIVES. Walks the BGP table and confirms reachability of the next hops. BGP scanner also checks conditionaladvertisement to BGP determine whether or not BGP Scanner should advertise condition prefixes, performs route dampening. In an MPLS VPN environment, BGP scanner imports and exports routes into a particular VPN routing and forwarding instance (VRF). BGP Router Calculates the best BGP path and processes any route "churn". It also sends and
Once a minute.
receives routes, establishes softreconfiguring a peers, and interacts with the BGP peer. routing information base (RIB).
While BGP scanner runs, low priority processes need to wait a longer time to access the CPU. One low priority process controls Internet Control Message Protocol (ICMP) packets such as pings. Packets destined to or originated from the router may experience higher than expected latency since the ICMP process must wait behind BGP scanner. The cycle is that BGP scanner runs for some time and suspends itself, and then ICMP runs. In contrast, pings sent through a router should be switched via Cisco Express Forwarding (CEF) and should not experience any additional latency. When troubleshooting periodic spikes in latency, compare forwarding times for packets forwarded through a router against packets processed directly by the CPU on the router. Note: Ping commands that specify IP options, such as record route, also require direct processing by the CPU and may experience longer forwarding delays. Use the show process | include bgp scanner command to see the CPU priority. The value of "Lsi" in the following sample output uses "L" to refer to a low priority process.
6513# show processes | include BGP Scanner 172 Lsi 407A1BFC 29144 29130 1000 8384/9000 0 BGP Scanner
2. Once the test starts, the CPU reaches 100 percent utilization. The show process cpu command shows that the high CPU condition is caused by BGP router, denoted by 139 (the IOS process ID for BGP router) in the following output.
router# show process cpu CPU utilization for five seconds: 100%/0%; one minute: 99%; five minutes: 81% ! Output omitted. 139 6795740 1020252 6660 88.34% 91.63% 74.01% 0 BGP Router
3. Monitor the router by capturing multiple outputs of the show ip bgp summary and show process cpu commands during the event. The show ip bgp summary command captures the state of the BGP neighbors.
router# show ip bgp summary Neighbor V AS MsgRcvd MsgSent 10.1.1.1 4 64512 309453 157389 172.16.1.1 4 65101 188934 1047 TblVer 19981 40081 InQ OutQ Up/Down State/PfxRcd 0 253 22:06:44 111633 41 0 00:07:51 58430
4. When the router completes prefix exchange with its BGP peers, the CPU utilization rates should return to normal levels. The computed one minute and five minute averages will settle back down as well and may show higher than normal levels for a longer period than the five seconds rate.
router# show proc cpu CPU utilization for five seconds: 3%/0%; one minute: 82%; five minutes: 91%
5. Use the captured output of the above show commands to calculate the BGP convergence time. In particular, use the "Up/Down" column of the show ip bgp summary command and compare the start and stop times of the high CPU condition. Typically, BGP convergence can take several minutes when exchanging a large Internet routing table.
Performance Improvements
As the number of routes in the Internet routing table grows, so too does the time it takes for BGP to converge. In general, convergence is defined as the process of bringing all route tables to a state of consistency. BGP is considered to be converged when the following conditions are true: All routes have been accepted. All routes have been installed in the routing table. The table version for all peers equals the table version of the BGP table. The InQ and OutQ for all peers is zero.
This section describes some IOS performance improvements to reduce BGP convergence time, which result in a reduction of a high CPU condition due to a BGP process.
The advantage of a 536 byte MSS is that packets are not likely to be fragmented at an IP device along the path to the destination since most links use an MTU of at least 1500 bytes. The disadvantage is that smaller packets increase the amount of bandwidth used to transport overhead. Since BGP builds a TCP connection to all peers, a 536 byte MSS affects BGP convergence times. The solution is to enable the Path MTU (PMTU) feature, using the ip tcp pathmtudiscovery command. You can use this feature to dynamically determine how large the MSS value can be without creating packets that need to be fragmented. PMTU allows TCP to determine the smallest MTU size among all links in a TCP session. TCP then uses this MTU value, minus room for the IP and TCP headers, as the MSS for the session. If a TCP session only traverses Ethernet segments, then the MSS will be 1460 bytes. If it only traverses Packet over SONET (POS) segments, then the MSS will be 4430 bytes. The increase in MSS from 536 to 1460 or 4430 bytes reduces TCP/IP overhead, which helps BGP converge faster. After enabling PMTU, again use the show ip bgp neighbors | include max data command to see the MSS value per peer:
Router# show ip bgp Datagrams (max data Datagrams (max data Datagrams (max data Datagrams (max data neighbors | include max data segment is 1460 bytes): segment is 1460 bytes): segment is 1460 bytes): segment is 1460 bytes):
Increasing the interface input queue depth (using the holdqueue <14096> in command) helps reduce the number of dropped TCP ACKs, which reduces the amount of work BGP must do to converge. Normally, a value of 1000 resolves problems caused by input queue drops. Note: The Cisco 12000 Series now uses a default SPD headroom value of 1000. It retains the default input queue size of 75. Use the show spd command to view these special input queues.
The following sample output of the show ip bgp peergroup | include replicated command on a router running IOS 12.0(18)S displays the following information:
Update Update Update Update Update messages messages messages messages messages formatted formatted formatted formatted formatted 836500, replicated 1668500 1050000, replicated 1455000 660500, replicated 1844500 656000, replicated 1849000 501250, replicated 2003750
! The first five lines are for eBGP peer groups. Update messages formatted 2476715, replicated 12114785 ! The last line is for an iBGP peer group.
In order to calculate the replication rate for each peer group, divide the number of updates replicated by the number of updates formatted: 1668500/836500 = 1.99 1455000/1050000 = 1.38 1844500/660500 = 2.79 1849000/656000 = 2.81 2003750/501250 = 3.99 12114785/2476715 = 4.89 If BGP replicated perfectly, then the eBGP peer groups would each have a replication rate of 19 because there are 20 peers in the peer group. The update should be formatted for the peer group leader and then replicated to the other 19 peers, providing an optimal replication rate of 19. The ideal replication rate for the iBGP peer group would be 99 since there are 100 peers. If BGP packed updates perfectly, then we would have only formatted 36,250 updates. We should only need to generate 36,250 updates for each peer group because that is the number of attribute combinations in the BGP table. The iBGP peer group alone formats almost 2,500,000 updates while the eBGP peer groups each generate anywhere from 500,000 to 1,000,000 updates. On a router running IOS 12.0(19)S, the show ip bgp peergroup | include replicated command provides this information:
Update Update Update Update Update Update messages messages messages messages messages messages formatted formatted formatted formatted formatted formatted 36250, 36250, 36250, 36250, 36250, 36250, replicated replicated replicated replicated replicated replicated 688750 688750 688750 688750 688750 3588750
Note: Update packing is optimal. Exactly 36,250 updates are formatted for each peer group. 688750/36250 = 19 688750/36250 = 19 688750/36250 = 19 688750/36250 = 19 688750/36250 = 19 3588750/36250 = 99 Note: Update replication is also perfect.
Troubleshooting
Use the following when troubleshooting high CPU due to BGP scanner or BGP router: Gather information on your BGP topology. Determine the number of BGP peers and the number of routes being advertised by each peer. Is the duration of the high CPU condition reasonable based on your environment?
Determine when the high CPU happens. Does it coincide with a regularly scheduled walk of the BGP table? Execute the show ip bgp flapstatistics command. Did the high CPU follow an interface flap? Ping through the router and then ping from the router. ICMP echoes are handled as a low priority process. The document Understanding the Ping and Traceroute Commands explains this in more detail. Ensure regular forwarding is not affected. Ensure that packets can follow a fast forwarding path by checking whether fast switching and/or CEF are enabled on the inbound and the outbound interfaces. Ensure that you do not see the no ip routecache cef command on the interface or the no ip cef command on the global configuration. In order to enable CEF in global configuration mode, use the ip cef command.
Related Information
BGP Support Page IP Routing Support Page Technical Support Cisco Systems
Contacts & Feedback | Help | Site Map 2009 2010 Cisco Systems, Inc. All rights reserved. Terms & Conditions | Privacy Statement | Cookie Policy | Trademarks of Cisco Systems, Inc.