Cisco Press - IP Quality of Service - Vegesna (2001)
Cisco Press - IP Quality of Service - Vegesna (2001)
IP Quality of Service
Srinivas Vegesna Copyright 2001 Cisco Press Cisco Press logo is a trademark of Cisco Systems, Inc. Published by: Cisco Press 201 West 103rd Street Indianapolis, IN 46290 USA All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without written permission from the publisher, except for the inclusion of brief quotations in a review. Printed in the United States of America 1 2 3 4 5 6 7 8 9 0 04 03 02 01 First Printing December 2000 Library of Congress Cataloging-in-Publication Number: 98-86710
Warning and Disclaimer
This book is designed to provide information about IP Quality of Service. Every effort has been made to make this book as complete and as accurate as possible, but no warranty or fitness is implied. The information is provided on an "as is" basis. The author, Cisco Press, and Cisco Systems, Inc., shall have neither liability nor responsibility to any person or entity with respect to any loss or damages arising from the information contained in this book or from the use of the discs or programs that may accompany it. The opinions expressed in this book belong to the author and are not necessarily those of Cisco Systems, Inc.
Trademark Acknowledgments
All terms mentioned in this book that are known to be trademarks or service marks have been appropriately capitalized. Cisco Press or Cisco Systems, Inc., cannot attest to the accuracy of this information. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.
Feedback Information
At Cisco Press, our goal is to create in-depth technical books of the highest quality and value. Each book is crafted with care and precision, undergoing rigorous development that involves the unique expertise of members from the professional technical community. Readers' feedback is a natural continuation of this process. If you have any comments regarding how we could improve the quality of this book, or otherwise alter it to better suit your needs, you can contact us through e-mail at [email protected]. Please make sure to include the book title and ISBN in your message. We greatly appreciate your assistance.
Publisher Editor-in-Chief Cisco Systems Program Manager Managing Editor Acquisitions Editor Development Editors Senior Editor Copy Editor Technical Editors John Wait John Kane Bob Anstey Patrick Kanouse Tracy Hughes Kitty Jarrett Allison Johnson Jennifer Chisholm Audrey Doyle Vijay Bollapragada Sanjay Kalra Kevin Mahler Erick Mar Sheri Moran Louisa Klucznick Argosy Bob LaRoche Larry Sweazy
Corporate Headquarters Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134-1706 USA http://www.cisco.com/ 408 526-4000 800 553-NETS (6387) 408 526-4100 European Headquarters Cisco Systems Europe 11 Rue Camille Desmoulins 92782 Issy-les-Moulineaux Cedex 9 France http://www.europe.cisco.com/ 33 1 58 04 60 00 33 1 58 04 61 00
Americas Headquarters Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134-1706 USA http://www.cisco.com/ 408 526-7660 408 527-0883 Asia Pacific Headquarters Cisco Systems Australia, Pty., Ltd Level 17, 99 Walker Street North Sydney NSW 2059 Australia http://www.cisco.com/ +61 2 8448 7100 +61 2 9957 4350 Cisco Systems has more than 200 offices in the following countries. Addresses, phone numbers, and fax numbers are listed on the Cisco Web site at http://www.cisco.com/go/offices Copyright 2000, Cisco Systems, Inc. All rights reserved. Access Registrar, AccessPath, Are You Ready, ATM Director, Browse with Me, CCDA, CCDE, CCDP, CCIE, CCNA, CCNP, CCSI, CD-PAC, CiscoLink, the Cisco NetWorks logo, the Cisco Powered Network logo, Cisco Systems Networking Academy, Fast Step, FireRunner, Follow Me Browsing, FormShare, GigaStack, IGX, Intelligence in the Optical Core, Internet Quotient, IP/VC, iQ Breakthrough, iQ Expertise, iQ FastTrack, iQuick Study, iQ Readiness Scorecard, The iQ Logo, Kernel Proxy, MGX, Natural Network Viewer, Network Registrar, the Networkers logo, Packet, PIX, Point and Click Internetworking, Policy Builder, RateMUX, ReyMaster, ReyView, ScriptShare, Secure Script, Shop with Me, SlideCast, SMARTnet, SVX, TrafficDirector, TransPath, VlanDirector, Voice LAN, Wavelength Router, Workgroup Director, and Workgroup Stack are trademarks of Cisco Systems, Inc.; Changing the Way We Work, Live, Play, and Learn, Empowering the Internet Generation, are service marks of Cisco Systems, Inc.; and Aironet, ASIST, BPX, Catalyst, Cisco, the Cisco Certified Internetwork Expert Logo, Cisco IOS, the Cisco IOS logo, Cisco Press, Cisco Systems, Cisco Systems Capital, the Cisco Systems logo, Collision Free, Enterprise/Solver, EtherChannel, EtherSwitch, FastHub, FastLink, FastPAD, IOS, IP/TV, IPX, LightStream, LightSwitch, MICA, NetRanger, Post-Routing, Pre-Routing, Registrar, StrataView Plus, Stratm, SwitchProbe, TeleRouter, are registered trademarks of Cisco Systems, Inc. or its affiliates in the U.S. and certain other countries. All other brands, names, or trademarks mentioned in this document or Web site are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (0010R)
Dedication
IP Quality of Service
About the Author Acknowledgments About the Technical Reviewers I: IP QoS 1. Introducing IP Quality of Service Levels of QoS IP QoS History Performance Measures QoS Functions Layer 2 QoS Technologies Multiprotocol Label Switching End-to-End QoS Objectives Audience Scope and Limitations Organization References 2. Differentiated Services Architecture Intserv Architecture Diffserv Architecture Summary References 3. Network Boundary Traffic Conditioners: Packet Classifier, Marker, and Traffic Rate Management Packet Classification Packet Marking The Need for Traffic Rate Management Traffic Policing Traffic Shaping Summary Frequently Asked Questions References 4. Per-Hop Behavior: Resource Allocation I Scheduling for Quality of Service (QoS) Support Sequence Number Computation-Based WFQ Flow-Based WFQ Flow-Based Distributed WFQ (DWFQ) Class-Based WFQ Priority Queuing Custom Queuing Scheduling Mechanisms for Voice Traffic Summary Frequently Asked Questions References 5. Per-Hop Behavior: Resource Allocation II Modified Weighted Round Robin (MWRR) Modified Deficit Round Robin (MDRR) MDRR Implementation Summary
Frequently Asked Questions References 6. Per-Hop Behavior: Congestion Avoidance and Packet Drop Policy TCP Slow Start and Congestion Avoidance TCP Traffic Behavior in a Tail-Drop Scenario REDProactive Queue Management for Congestion Avoidance WRED Flow WRED ECN SPD Summary Frequently Asked Questions References 7. Integrated Services: RSVP RSVP Reservation Styles Service Types RSVP Media Support RSVP Scalability Case Study 7-1: Reserving End-to-End Bandwidth for an Application Using RSVP Case Study 7-2: RSVP for VoIP Summary Frequently Asked Questions References II: Layer 2, MPLS QoSInterworking with IP QoS 8. Layer 2 QoS: Interworking with IP QoS ATM ATM Interworking with IP QoS Frame Relay Frame Relay Interworking with IP QoS The IEEE 802.3 Family of LANs Summary Frequently Asked Questions References 9. QoS in MPLS-Based Networks MPLS MPLS with ATM Case Study 9-1: Downstream Label Distribution MPLS QoS End-to-End IP QoS MPLS VPN Case Study 9-3: MPLS VPN MPLS VPN QoS Case Study 9-4: MPLS VPN QoS Summary Frequently Asked Questions References III: Traffic Engineering 10. MPLS Traffic Engineering The Layer 2 Overlay Model RRR TE Trunk Definition TE Tunnel Attributes
Link Resource Attributes Distribution of Link Resource Information Path Selection Policy TE Tunnel Setup Link Admission Control TE Path Maintenance TE-RSVP IGP Routing Protocol Extensions TE Approaches Case Study 10-1: MPLS TE Tunnel Setup and Operation Summary Frequently Asked Questions References IV: Appendixes A. Cisco Modular QoS Command-Line Interface Traffic Class Definition Policy Definition Policy Application Order of Policy Execution B. Packet Switching Mechanisms Process Switching Route-Cache Forwarding CEF Summary C. Routing Policies Using QoS Policies to Make Routing Decisions QoS Policy Propagation Using BGP Summary References D. Real-time Transport Protocol (RTP) Reference E. General IP Line Efficiency Functions The Nagle Algorithm Path MTU Discovery TCP/IP Header Compression RTP Header Compression References F. Link-Layer Fragmentation and Interleaving References G. IP Precedence and DSCP Values
Acknowledgments
I would like to thank all my friends and colleagues at Cisco Systems for a stimulating work environment for the last six years. I value the many technical discussions we had in the internal e-mail aliases and hallway conversations. My special thanks go to the technical reviewers of the book, Sanjay Kalra and Vijay Bollapragada, and the development editors of the book, Kitty Jarrett and Allison Johnson. Their input has considerably enhanced the presentation and content in the book. I would like to thank Mosaddaq Turabi for his thoughts on the subject and interest in the book. I would also like to remember a special colleague and friend at Cisco, Kevin Hu, who passed away in 1995. Kevin and I started at Cisco the same day and worked as a team for the one year I knew him. He was truly an all-round person. Finally, the book wouldn't have been possible without the support and patience of my family. I would like to express my deep gratitude and love for my wife, Latha, for the understanding all along the course of the book. I would also like to thank my brother, Srihari, for being a great brother and a friend. A very special thanks goes to my two-year old son, Akshay, for his bright smile and cute words and my newborn son, Karthik for his innocent looks and sweet nothings.
Part I: IP QoS
Chapter 1 Introducing IP Quality of Service Chapter 2 Differentiated Services Architecture Chapter 3 Network Boundary Traffic Conditioners: Packet Classifier, Marker, and Traffic Rate Management Chapter 4 Per-Hop Behavior: Resource Allocation I Chapter 5 Per-Hop Behavior: Resource Allocation II Chapter 6 Per-Hop Behavior: Congestion Avoidance and Packet Drop Policy Chapter 7 Integrated Services: RSVP
Levels of QoS
Traffic in a network is made up of flows originated by a variety of applications on end stations. These applications differ in their service and performance requirements. Any flow's requirements depend inherently on the application it belongs to. Hence, under-standing the application types is key to understanding the different service needs of flows within a network. The network's capability to deliver service needed by specific network applications with some level of control over performance measuresthat is, bandwidth, delay/jitter, and lossis categorized into three service levels:
Best-effort service Basic connectivity with no guarantee as to whether or when a packet is delivered to the destination, although a packet is usually dropped only when the router input or output buffer queues are exhausted. Best-effort service is not really a part of QoS because no service or delivery guarantees are made in forwarding best-effort traffic. This is the only service the Internet offers today. Most data applications, such as File Transfer Protocol (FTP), work correctly with best-effort service, albeit with degraded performance. To function well, all applications require certain network resource allocations in terms of bandwidth, delay, and minimal packet loss.
Differentiated service In differentiated service, traffic is grouped into classes based on their service requirements. Each traffic class is differentiated by the network and serviced according to the configured QOS mechanisms for the class. This scheme for delivering QOS is often referred to as COS. Note that differentiated service doesn't give service guarantees per se. It only differentiates traffic and allows a preferential treatment of one traffic class over the other. For this reason, this service is also referred as soft QOS. This QoS scheme works well for bandwidth-intensive data applications. It is important that network control traffic is differentiated from the rest of the data traffic and prioritized so as to ensure basic network connectivity all the time.
Guaranteed service A service that requires network resource reservation to ensure that the network meets a traffic flow's specific service requirements. Guaranteed service requires prior network resource reservation over the connection path. Guaranteed service also is referred to as hard QoS because it requires rigid guarantees from the network. Path reservations with a granularity of a single flow don't scale over the Internet backbone, which services thousands of flows at any given time. Aggregate reservations, however, which call for only a minimum state of information in the Internet core routers, should be a scalable means of offering this service. Applications requiring such service include multimedia applications such as audio and video. Interactive voice applications over the Internet need to limit latency to 100 ms to meet human ergonomic needs. This latency also is acceptable to a large spectrum of multimedia applications. Internet telephony needs at a minimum an 8-Kbps bandwidth and a 100-ms round-trip delay. The network needs to reserve resources to be able to meet such guaranteed service requirements.
Layer 2 QoS refers to all the QoS mechanisms that either are targeted for or exist in the various link layer technologies. Chapter 8, "Layer 2 QoS: Interworking with IP QoS," covers Layer 2 QoS. Layer 3 QoS refers to QoS functions at the network layer, which is IP. Table 1-1 outlines the three service levels and their related enabling QoS functions at Layers 2 and 3. These QoS functions are discussed in detail in the rest of this book.
10
Table 1-1. Service Levels and Enabling QoS Functions Enabling Layer 3 QoS Enabling Layer 2 QoS Basic connectivity Asynchronous Transfer Mode (ATM), Unspecified Bit Rate (UBR), Frame Relay Committed Information Rate (CIR)=0 IEEE 802.1p
Differentiated CoS Committed Access Rate (CAR), Weighted Fair Queuing (WFQ), Weighted Random Early Detection (WRED) Guaranteed Resource Reservation Protocol (RSVP)
Subnet Bandwidth Manager (SBM), ATM Constant Bit Rate (CBR), Frame Relay CIR
IP QoS History
IP QoS is not an afterthought.The Internet's founding fathers envisioned this need and provisioned a Type of Service (ToS) byte in the IP header to facilitate QoS as part of the initial IP specification. It described the purpose of the ToS byte as follows: The Type of Service provides an indication of the abstract parameters of the quality of service desired. These parameters are to be used to guide the selection of the actual service parameters when transmitting a datagram through the particular network.[1] Until the late 1980s, the Internet was still within its academic roots and had limited applications and traffic running over it. Hence, ToS support wasn't necessarily important, and almost all IP implementations ignored the ToS byte. IP applications didn't specifically mark the ToS byte, nor did routers use it to affect the forwarding treatment given to an IP packet. The importance of QoS over the Internet has grown with its evolution from its academic roots to its present commercial and popular stage. The Internet is based on a connectionless end-to-end packet service, which traditionally provided best-effort means of data transportation using the Transmission Control Protocol/Internet Protocol (TCP/IP) Suite. Although the connectionless design gives the Internet its flexibility and robustness, its packet dynamics also make it prone to congestion problems, especially at routers that connect networks of widely different bandwidths. The congestion collapse problem was discussed by John Nagle during the Internet's early growth phase in the mid-1980s[2]. The initial QoS function set was for Internet hosts. One major problem with expensive wide-area network (WAN) links is the excessive overhead due to small Transmission Control Protocol (TCP) packets created by applications such as telnet and rlogin. The Nagle algorithm, which solves this issue, is now supported by all IP host implementations[3]. The Nagle algorithm heralded the beginning of Internet QoS-based functionality in IP. In 1986, Van Jacobson developed the next set of Internet QoS tools, the congestion avoidance mechanisms for end systems that are now required in TCP implementations. These mechanismsslow start and congestion avoidancehave helped greatly in preventing a congestion collapse of the present-day Internet. They primarily make the TCP flows responsive to the congestion signals (dropped packets) within the network. Two additional mechanismsfast retransmit and fast recoverywere added in 1990 to provide optimal performance during periods of packet loss[4]. Though QoS mechanisms in end systems are essential, they didn't complete the end-to-end QoS story until adequate mechanisms were provided within routers to transport traffic between end systems. Hence, around 1990 QoS's focus was on routers. Routers, which are limited to only first-in, first-out (FIFO) scheduling, don't offer a mechanism to differentiate or prioritize traffic within the packet-scheduling algorithm. FIFO queuing causes tail drops and doesn't protect well-behaving flows from misbehaving flows. WFQ, a packet scheduling algorithm[5], and WRED, a queue management algorithm[6], are widely accepted to fill this gap in the Internet backbone. Internet QoS development continued with standardization efforts in delivering end-to-end QoS over the Internet. The Integrated Services (intserv) Internet Engineering Task Force (IETF) Working Group[7] aims to provide the means for applications to express end-to-end resource requirements with support mechanisms in
11
routers and subnet technologies. RSVP is the signaling protocol for this purpose. The Intserv model requires per-flow states along the path of the connection, which doesn't scale in the Internet backbones, where thousands of flows exist at any time. Chapter 7, "Integrated Services: RSVP," provides a discussion on RSVP and the intserv service types. The IP ToS byte hasn't been used much in the past, but it is increasingly used lately as a way to signal QoS. The ToS byte is emerging as the primary mechanism for delivering diffserv over the Internet, and for this purpose, the IETF differentiated services (diffserv) Working Group[8] is working on standardizing its use as a diffserv byte. Chapter 2, "Differentiated Services Architecture," discusses the diffserv architecture in detail.
Performance Measures
QoS deployment intends to provide a connection with certain performance bounds from the network. Bandwidth, packet delay and jitter, and packet loss are the common measures used to characterize a connection's performance within a network. They are described in the following sections.
Bandwidth
The term bandwidth is used to describe the rated throughput capacity of a given medium, protocol, or connection. It effectively describes the "size of the pipe" required for the application to communicate over the network. Generally, a connection requiring guaranteed service has certain bandwidth requirements and wants the network to allocate a minimum bandwidth specifically for it. A digitized voice application produces voice as a 64-kbps stream. Such an application becomes nearly unusable if it gets less than 64 kbps from the network along the connection's path.
12
If the network is not congested, queues will not build at routers, and serialization delay at each hop as well as propagation delay account for the total packet delay. This constitutes the minimum delay the network can offer. Note that serialization delays become insignificant compared to the propagation delays on fast link speeds. If the network is congested, queuing delays will start to influence end-to-end delays and will contribute to the delay variation among the different packets in the same connection. The variation in packet delay is referred to as packet jitter. Packet jitter is important because it estimates the maximum delays between packet reception at the receiver against individual packet delay. A receiver, depending on the application, can offset the jitter by adding a receive buffer that could store packets up to the jitter bound. Playback applications that send a continuous information streamincluding applications such as interactive voice calls, videoconferencing, and distributionfall into this category. Figure 1-1 illustrates the impact of the three delay types on the total delay with increasing link speeds. Note that the serialization delay becomes minimal compared to propagation delay as the link's bandwidth increases. The switching delay is negligible if the queues are empty, but it can increase drastically as the number of packets waiting in the queue increases. Figure 1-1 Delay Components of a 1500-byte Packet on a Transcontinental U.S. Link with Increasing Bandwidths
Packet Loss
Packet loss specifies the number of packets being lost by the network during transmission. Packet drops at network congestion points and corrupted packets on the transmission wire cause packet loss. Packet drops generally occur at congestion points when incoming packets far exceed the queue size limit at the output queue. They also occur due to insufficient input buffers on packet arrival. Packet loss is generally specified as a fraction of packets lost while transmitting a certain number of packets over some time interval. Certain applications don't function well or are highly inefficient when packets are lost. Such loss-intolerant applications call for packet loss guarantees from the network. Packet loss should be rare for a well-designed, correctly subscribed or under-subscribed network. It is also rare for guaranteed service applications for which the network has already reserved the required resources. Packet loss is mainly due to packet drops at network congestion points with fiber transmission lines, with a Bit Error Rate (BER) of 10E-9 being relatively loss-free. Packet drops, however, are a fact of life when transmitting best-effort traffic, although such drops are done only when necessary. Keep in mind that dropped packets waste network resources, as they already consumed certain network resources on their way to the loss point.
13
QoS Functions
This section briefly discusses the various QoS functions, their related features, and their benefits. The functions are discussed in further detail in the rest of the book.
Resource Allocation
FIFO scheduling is the widely deployed, traditional queuing mechanism within routers and switches on the Internet today. Though it is simple to implement, FIFO queuing has some fundamental problems in providing QoS. It provides no way to enable delay-sensitive traffic to be prioritized and moved to the head of the queue. All traffic is treated exactly the same, with no scope for traffic differentiation or service differentiation among traffic. For the scheduling algorithm to deliver QoS, at a minimum it needs to be able to differentiate among the different packets in the queue and know the service level of each packet. A scheduling algorithm determines which packet goes next from a queue. How often the flow packets are served determines the bandwidth or resource allocation for the flow. Chapter 4, "Per-Hop Behavior: Resource Allocation I," covers QoS features in this section in detail.
14
Switching
A router's primary function is to quickly and efficiently switch all incoming traffic to the correct output interface and next-hop address based on the information in the forwarding table. The traditional cache-based forwarding mechanism, although efficient, has scaling and performance problems because it is traffic-driven and can lead to increased cache maintenance and poor switching performance during network instability. The topology-based forwarding method solves the problems involved with cache-based forwarding mechanisms by building a forwarding table that exactly matches the router's routing table. The topology-based forwarding mechanism is referred to as Cisco Express Forwarding (CEF) in Cisco routers. Appendix B, "Packet Switching Mechanisms," offers more detail on these QoS functions.
Routing
Traditional routing is destination-based only and routes packets on the shortest path derived in the routing table. This is not flexible enough for certain network scenarios. Policy routing is a QoS function that enables the user to change destination-based routing to routing based on various user-configurable packet parameters. Current routing protocols provide shortest-path routing, which selects routes based on a metric value such as administrative cost, weight, or hop count. Packets are routed based on the routing table, without any knowledge of the flow requirements or the resource availability along the route. QoS routing is a routing mechanism that takes into account a flow's QoS requirements and has some knowledge of the resource availability in the network in its route selection criteria. Appendix C, "Routing Policies," offers more detail on these QoS functions.
15
End-to-End QoS
Layer 2 QoS technologies offer solutions on a smaller scope only and can't provide end-to-end QoS simply because the Internet or any large scale IP network is made up of a large group of diverse Layer 2 technologies. In a network, end-to-end connectivity starts at Layer 3 and, hence, only a network layer protocol, which is IP in the TCP/IP-based Internet, can deliver end-to-end QoS. The Internet is made up of diverse link technologies and physical media. IP, being the layer providing end-toend connectivity, needs to map its QoS functions to the link QoS mechanisms, especially of switched networks, to facilitate end-to-end QoS. Some service provider backbones are based on switched networks such as ATM or Frame Relay. In this case, you need to have ATM and Frame Relay QoS-to-IP interworking to provide end-to-end QoS. This enables the IP QoS request to be honored within the ATM or the frame cloud. Switched LANs are an integral part of Internet service providers (ISPs) that provide Web-hosting services and corporate intranets. IEEE 801.1p and IEEE 802.1Q offer priority-based traffic differentiation in switched LANs. Interworking these protocols with IP is essential to making QoS end to end. Chapter 8 discusses IP QoS interworking with switches, backbones, and LANs in detail. MPLS facilitates IP QoS delivery and provides extensive traffic engineering capabilities that help provide MPLS-based VPNs. For end-to-end QoS, IP QoS needs to interwork with the QoS mechanisms in MPLS and MPLS-based VPNs. Chapter 9 focuses on this topic.
Objectives
This book is intended to be a valuable technical resource for network managers, architects, and engineers who want to understand and deploy IP QoS-based services within their network. IP QoS functions are indispensable in today's scalable, IP network designs, which are intended to deliver guaranteed and differentiated Internet services by giving control of the network resources and its usage to the network operator. This book's goal is to discuss IP QoS architectures and their associated QoS functions that enable end-to-end QoS in corporate intranets, service provider networks, and, in general, the Internet. On the subject of IP QoS architectures, this book's primary focus is on the diffserv architecture. This book also focuses on ATM, Frame Relay, IEEE 801.1p, IEEE 801.1Q, MPLS, and MPLS VPN QoS technologies and on how they interwork with IP QoS in providing an end-to-end service. Another important topic of this book is MPLS traffic engineering. This book provides complete coverage of IP QoS and all related technologies, complete with case studies. Readers will gain a thorough understanding in the following areas to help deliver and deploy IP QoS and MPLS-based traffic engineering: Fundamentals and the need for IP QoS The diffserv QoS architecture and its enabling QoS functionality The Intserv QoS model and its enabling QoS functions ATM, Frame Relay, and IEEE 802.1p/802.1Q QoS technologiesInterworking with IP QoS MPLS and MPLS VPN QoSInterworking with IP QoS MPLS traffic engineering Routing policies, general IP QoS functions, and other miscellaneous QoS information
QoS applies to any IP-based network. As such, this book targets all IP networkscorporate intranets, service provider networks, and the Internet.
Audience
The book is written for internetworking professionals who are responsible for designing and maintaining IP services for corporate intranets and for service provider network infrastructures. If you are a network engineer, architect, planner, designer, or operator who has a rudimentary knowledge of QoS technologies, this book will
16
provide you with practical insights on what you need to consider to design and implement varying degrees of QoS in the network. This book also includes useful information for consultants, systems engineers, and sales engineers who design IP networks for clients. The information in this book covers a wide audience because incorporating some measure of QoS is an integral part of any network design process.
Organization
This book consists of four parts: Part I, "IP QoS," focuses on IP QoS architectures and the QoS functions enabling them. Part II, "Layer 2, MPLS QoSInterworking with IP QoS," lists the QoS mechanisms in ATM, Frame Relay, Ethernet, MPLS, and MPLS VPN and discusses how they map with IP QoS. Part III, "Traffic Engineering," discusses traffic engineering using MPLS. Finally, Part IV, "Appendixes," discusses the modular QoS command-line interface and miscellaneous QoS functions and provides some useful reference material. Most chapters include a case study section to help in implementation, as well as a question and answer section.
Part I
This part of the book discusses the IP QoS architectures and their enabling functions. Chapter 2 introduces the two IP QoS architectures: diffserv and intserv, and goes on to discuss the diffserv architecture. Chapters 3, 4, 5, and 6 discuss the different functions that enable diffserv architecture. Chapter 3, for instance, discusses the QoS functions that condition the traffic at the network boundary to facilitate diffserv within the network. Chapters 4 and 5 discuss packet scheduling mechanisms that provide minimum bandwidth guarantees for traffic. Chapter 6 focuses on the active queue management techniques that proactively drop packets signaling congestion. Finally, Chapter 7 discusses the RSVP protocol and its two integrated service types.
Part II
This section of the book, comprising Chapters 8 and 9, discusses ATM, Frame Relay, IEEE 801.1p, IEEE 801.1Q, MPLS, and MPLS VPN QoS technologies and how they interwork to provide an end-to-end IP QoS.
Part III 17
Chapter 10, the only chapter in Part III, talks about the need for traffic engineering and discusses MPLS traffic engineering operation.
Part IV
This part of the book has useful information that didn't fit well with previous sections but still is relevant in providing IP QoS. Appendix A, "Cisco Modular QoS Command-Line Interface," details the new user interface that enables flexible and modular QoS configuration. Appendix B, "Packet Switching Mechanisms," introduces the various packet-switching mechanisms available on Cisco platforms. It compares the switching mechanisms and recommends CEF, which also is a required packet-switching mechanism for certain QoS features. Appendix C, "Routing Policies," discusses QoS routing, policy-based routing, and QoS Policy Propagation using Border Gateway Protocol (QPPB). Appendix D, "Real-Time Transport Protocol (RTP)," talks about the transport protocol used to carry realtime packetized audio and video traffic. Appendix E, "General IP Line Efficiency Functions," talks about some IP functions that help improve available bandwidth. Appendix F, "Link Layer Fragmentation and Interleaving," discusses fragmentation and interleaving functionality with the Multilink Point-to-Point protocol. Appendix G, "IP Precedence and DSCP Values," tabulates IP precedence and DSCP values. It also shows how IP precedence and DSCP values are mapped to each other.
References
1. RFC 791: "Internet Protocol Specification," J. Postel, 1981 2. RFC 896: "Congestion Control in IP/TCP Internetworks," J. Nagle, 1984 3. RFC 1122: "Requirements for Internet HostsCommunication Layers," R. Braden, 1989 4. RFC 2001: "TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms," W. Stevens, 1997 5. S. Floyd and V. Jacobson. "Random Early Detection Gateways for Congestion Avoidance." IEEE/ACM Transactions on Networking, August 1993 6. A. Demers, S. Keshav, and S. Shenkar. "Design and Analysis of a Fair Queuing Algorithm." Proceedings of ACM SIGCOMM '89, Austin, TX, September 1989 7. IETF Intserv Working Group, http://www.ietf.org/html.charters/intserv-charter.html 8. IETF DiffServ Working Group, http://www.ietf.org/html.charters/diffserv-charter.html 9. IETF MPLS Working Group, http://www.ietf.org/html.charters/mpls-charter.html
18
Intserv Architecture
The Internet Engineering Task Force (IETF) set up the intserv Working Group (WG) in 1994 to expand the Internet's service model to better meet the needs of emerging, diverse voice/video applications. It aims to clearly define the new enhanced Internet service model as well as to provide the means for applications to express end-to-end resource requirements with support mechanisms in routers and subnet technologies. It follows the goal of managing those flows separately that requested specific QoS. Two servicesguaranteed[1] and controlled load[2]are defined for this purpose. Guaranteed service provides deterministic delay guarantees, whereas controlled load service provides a network service close to that provided by a best-effort network under lightly loaded conditions. Both services are discussed in detail in Chapter 7. Resource Reservation Protocol (RSVP) is suggested as the signaling protocol that delivers end-to-end service requirements[3]. The intserv model requires per-flow guaranteed QoS on the Internet. With the thousands of flows existing on the Internet today, the amount of state information required in the routers can be enormous. This can create scaling problems, as the state information increases as the number of flows increases. This makes intserv hard to deploy on the Internet. In 1998, the diffserv Working Group was formed under IETF. Diffserv is a bridge between intserv's guaranteed QoS requirements and the best-effort service offered by the Internet today. Diffserv provides traffic differentiation by classifying traffic into a few classes, with relative service priority among the traffic classes.
Diffserv Architecture
The diffserv approach[4] to providing QoS in networks employs a small, well-defined set of building blocks from which you can build a variety of services. Its aim is to define the differentiated services (DS) byte, the Type of Service (ToS) byte from the Internet Protocol (IP) Version 4 header and the Traffic Class byte from IP Version 6, and mark the standardized DS byte of the packet such that it receives a particular forwarding treatment, or per-hop behavior (PHB), at each network node. The diffserv architecture provides a framework[5] within which service providers can offer customers a range of network services, each differentiated based on performance. A customer can choose the performance level needed on a packet-by-packet basis by simply marking the packet's Differentiated Services Code Point (DSCP) field to a specific value. This value specifies the PHB given to the packet within the service provider network. Typically, the service provider and customer negotiate a profile describing the rate at which traffic can be submitted at each service level. Packets submitted in excess of the agreed profile might not be allotted the requested service level. The diffserv architecture only specifies the basic mechanisms on ways you can treat packets. You can build a variety of services by using these mechanisms as building blocks. A service defines some significant characteristic of packet transmission, such as throughput, delay, jitter, and packet loss in one direction along a path in a network. In addition, you can characterize a service in terms of the relative priority of access to resources in a network. After a service is defined, a PHB is specified on all the network nodes of the network offering this service, and a DSCP is assigned to the PHB. A PHB is an externally observable forwarding
19
behavior given by a network node to all packets carrying a specific DSCP value. The traffic requiring a specific service level carries the associated DSCP field in its packets. All nodes in the diffserv domain observe the PHB based on the DSCP field in the packet. In addition, the network nodes on the diffserv domain's boundary carry the important function of conditioning the traffic entering the domain. Traffic conditioning involves functions such as packet classification and traffic policing and is typically carried out on the input interface of the traffic arriving into the domain. Traffic conditioning plays a crucial role in engineering traffic carried within a diffserv domain, such that the network can observe the PHB for all its traffic entering the domain. The diffserv architecture is illustrated in Figure 2-1. The two major functional blocks in this architecture are shown in Table 2-1. Figure 2-1 Diffserv Overview
Table 2-1. Functional Blocks in the diffserv Architecture Functional Blocks Traffic Conditioners PHB Location Typically, on the input interface on the diffserv domain boundary router All routers in the entire diffserv domain Enabling Functions Packet Classification, Traffic Shaping, and Policing (Chapter 3) Resource Allocation (Chapters 4 and 5) Packet Drop Policy (Chapter 6) Apart from these two functiona l blocks, resource allocation policy plays an important role in defining the policy for admission control, ratio of resource overbooking, and so on. Note Cisco introduced modular QoS command-line interface (CLI) (discussed in Appendix C, "Routing Policies" ) to provide a clean separation and modular configuration of the different enabling QoS functions. Action Polices incoming traffic and sets the DSCP field based on the traffic profile PHB applied to packets based on service characteristic defined by DSCP
20
A general QoS operational model is shown in Figure 2-2. Figure 2-2 General QoS Operational Model
DSCP
The IETF diffserv group is in the process of standardizing an effort that enables users to mark 6 bits of the ToS byte in the IP header with a DSCP. The lowestorder 2 bits are currently unused (CU). DSCP is an extension to 3 bits used by IP precedence. Like IP precedence, you can use DSCP to provide differential treatment to packets marked appropriately. Figure 2-3 shows the ToS byte[6]. The ToS byte is renamed the DS byte with the standardization of the DSCP field. Figure 2-4 shows the DS byte. The DSCPs defined[7] thus far by the IETF Working Group are as follows:
Class Selector DSCPs They are defined to be backward-compatible with IP precedence and are tabulated in Table 2-2.
Table 2-2. Class Selector DSCP Class Selectors 001 000 010 000 011 000 100 000 101 000 110 000 111 000
DSCP
Expedited Forwarding (EF) PHB It defines premium service. Recommended DSCP is 101110.
Assured Forwarding (AF) PHB It defines four service levels, with each service level having three drop 21
precedence levels. As a result, AF PHB recommends 12 code points, as shown in Table 2-3.
Table 2-3. AF PHB Class 1 Class 2 001010 010010 001100 010100 001110 010110
22
Shaper The shaper function delays traffic by buffering some packets so that they comply with the profile. This action is also referred to as traffic shaping. Dropper The dropper function drops all traffic that doesn't comply with the traffic profile. This action is also referred to as traffic policing.
PHB
Network nodes with diffserv support use the DSCP field in the IP header to select a specific PHB for a packet. A PHB is a description of the externally observable forwarding behavior of a diffserv node applied to a set of packets with the same DSCP. You can define a PHB in terms of its resource priority relative to other PHBs, or to some observable traffic service characteristics, such as packet delay, loss, or jitter. You can view a PHB as a black box, as it defines some externally observable forwarding behavior without mandating a particular implementation. In a diffserv network, best-effort behavior can be viewed as the default PHB. Diffserv recommends specific DSCP values for each PHB, but a network provider can choose to use a different DSCP than the recommended values in his or her network. The recommended DSCP for best-effort behavior is 000000. The PHB of a specific traffic class depends on a number of factors: Arrival rate or load for the traffic class This is controlled by the traffic conditioning at the network boundary. Resource allocation for the traffic class This is controlled by the resource allocation on the nodes in the diffserv domain. Resource allocation in the network nodes is discussed in Chapter 4, "Per-Hop Behavior: Resource Allocation I," and Chapter 5, "Per-Hop Behavior: Resource Allocation II." Traffic loss This depends on the packet discard policy on the nodes in the diffserv domain. This function is covered in Chapter 6, "Per-Hop Behavior: Congestion Avoidance and Packet Drop Policy." Two PHBs, EF and AF, are standardized. They are discussed in the following sections. EF PHB You can use the EF PHB to build a low-loss, low-latency, low-jitter, assured-bandwidth, end-to-end service through diffserv domains.[8] EF PHB targets applications such as Voice over IP (VoIP) and video conferencing, and services such as virtual leased line, as the service looks like a point-to-point connection for a diffserv network's end nodes. Such service is also often termed as premium service. The main contributors to high packet delays and packet jitter are queuing delays caused by large, accumulated queues. Such queues are typical at network congestion points. Network congestion occurs when the arrival rate of packets exceeds their departure rate. You can essentially eliminate queuing delays if the maximum arrival rate is less than the minimal departure rate. The EF service sets the departure rate, whereas you can control the traffic arrival rate at the node by using appropriate traffic conditioners at the network boundary. An EF PHB needs to assure that the traffic sees no or minimal queues and, hence, needs to configure a departure rate for traffic that is equal to or less than the packet arrival rate. The departure rate or bandwidth
23
should be independent of the other traffic at any time. The packet arrival and departure rates are typically measured at intervals equal to the time it takes for a link's maximum transmission unit (MTU)-sized packet to be transmitted. A router can allocate resources for a certain departure rate on an interface by using different EF functionality implementations. Packet scheduling techniquessuch as Class-Based Weighted Fair Queuing (CBWFQ), Weighted Round Robin (WRR), and Deficit Round Robin (DRR)provide this functionality when the EF traffic can be carried over a highly weighted queue; that is, a weight that allocates a much higher rate to EF traffic than the actual EF traffic arrival rate. Further, you can modify these scheduling techniques to include a priority queue to carry EF traffic. The scheduling techniques are discussed in detail in Chapter 5. When EF traffic is implemented using a priority queue, it is important to ensure that a busy EF priority queue does not potentially starve the remaining traffic queues beyond a certain configurable limit. To alleviate this problem, a user can set up a maximum rate against which the traffic serviced by the priority queue is policed. If the traffic exceeds the configured rate limit, all excess EF traffic is dropped. The network boundary traffic conditioners should be configured such that the EF traffic never exceeds its maximum configured rate at any hop in the network. The recommended DSCP to be used for EF traffic in the network is 101110. AF PHB AF PHB[9] is a means for a service provider to offer different levels of forwarding assurances for IP packets received from a customer diffserv domain. It is suitable for most Transmission Control Protocol (TCP)-based applications. An AF PHB provides differentiated service levels among the four AF traffic classes. Each AF traffic class is serviced in its own queue, enabling independent capacity management for the four traffic classes. Within each AF class are three drop precedence levels (Low, Medium, and High) for Random Early Detection (RED)-like queue management.
24
RSVP can scale well to support a few thousand per-flow sessions running in parallel. In addition, work is ongoing to provide aggregated RSVP. Multiple RSVP reservations are aggregated into a single aggregate reservation for large-scale RSVP deployment across a core network backbone that requires topology-aware admission control. Aggregated RSVP reservation is a fat, slowly adjusting reservation state that results in a reduced state signaling information in the network core. As a normal RSVP reservation, you can map the aggregate reservation to a diffserv class. QoS Policy Manager The policy definition determines the QoS applied on a traffic flow. The policy identifies the critical application traffic in the network and specifies its QoS level. Policy is simply the configuration needed in all the individual network nodes to enable QoS. How does a QoS node get its policy? In simple terms, a network engineer can configure the policies by making changes to a router's configuration. On a large-scale network, however, the process becomes tedious and unmanageable. To deliver end-to-end QoS on a large-scale network, the policies applied across all the individual nodes in the network should be consistent. As such, a centralized policy manager to define policies makes the task less daunting. This policy manager can distribute the policy to all the network devices. Common Open Policy Service (COPS) is an IETF protocol for distributing policy. In COPS terminology, the centralized policy server is called the Policy Decision Point (PDP). The network node implementing or enforcing the policy is called the Policy Enforcement Point (PEP). The PDP uses the COPS protocol for downloading the policies into the PEPs in the network. A PEP device can generate a message informing the PDP if it cannot implement a policy that it was given by PDP. IP Precedence Versus DSCP As discussed in this chapter, diffserv architecture needs traffic conditioners at the network boundary and resource allocation and packet discard functions in the network core to provide EF and AF services. Because DSCP field definitions were not fully clear until recently, the diffserv architecture was initially supported using the 3-bit IP precedence because historically, the IP precedence field is used to mark QoS or precedence in IP packets. Cisco IOS is fully aligned with the diffserv architecture and provides all required network edge and core QoS functions based on the 3-bit IP precedence field. Both 3-bit IP precedence and 6-bit DSCP fields are used in exactly the same purpose in a diffserv network: for marking packets at the network edge and triggering specific packet queuing/discard behavior in the network. Further, the DSCP field definition is backward-compatible with the IP precedence values. Hence, DSCP field support doesn't require any change in the existing basic functionality and architecture. Soon, all IP QoS functions will support the DSCP field along with IP precedence. Cisco introduced modular QoS CLI to provide a clean separation and modular configuration of the different enabling QoS functions. Modular QoS CLI is discussed in Appendix C. DSCP support for the various QoS functions is part of the modular QoS architecture.
25
Summary
Two Internet QoS architectures, intserv and diffserv, are introduced in this chapter. intserv architecture is defined to enable applications to request end-to-end network resource requirements. RSVP is suggested as a signaling protocol to request end-to-end resource requirements. This architecture can run into scalability issues in large networks, especially on the Internet, where the number of traffic flows can run in the order of tens of thousands. Intserv is discussed in detail in Chapter 7. The chapter focuses on the diffserv architecture. Diffserv defines the architecture for implementing scalable service differentiation on the Internet. Diffserv uses the newly standardized DSCP field in the IP header to mark the QoS required by a packet, and a diffserv-enabled network delivers the packet with a PHB indicated by the DSCP field. Traffic is policed and marked appropriately at the edge of the diffserv-enabled network. A PHB is an externally observable forwarding behavior given by a network node to all packets carrying a specific DSCP value. You can specify PHBs in terms of their relative priority in access to resources, or in terms of their relative traffic performance characteristics. Two PHBs, EF and AF, are defined. EF is targeted at real-time applications, such as VoIP. AF provides different forwarding assurance levels for packets based on their DSCP field. COPS is a new IETF protocol for distributing QoS policy information across the network.
References
1. "Specification of the Controlled-Load Network Element Service," J. Wroclawski, RFC 2211. 2. "Specification of Guaranteed Quality of Service," S. Shenker, C. Partridge, R. Guerin, RFC 2212. 3. "The Use of RSVP with IETF Integrated Services," J. Wroclawski, RFC 2210. 4. IETF Differentiated Services Working Group. 5. "Type of Service in the Internet Protocol Suite," P. Almquist, RFC 1349. 6. "A Framework for Differentiated Services," Y. Bernet, and others, Internet Draft. 7. "Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers," K. Nichols, and others, RFC 2474. 8. Expedited Forwarding (EF) PHB. RFC2598.txt. 9. Assured Forwarding (AF) PHB. RFC2597.txt.
26
Chapter 3. Network Boundary Traffic Conditioners: Packet Classifier, Marker, and Traffic Rate Management
Traffic conditioning functions at the network boundary are vital to delivering differentiated services within a network domain. These functions provide packet classifier, marker, and traffic rate management. In a network, packets are generally differentiated on a flow basis by the five flow fields in the Internet Protocol (IP) packet header: source IP address; destination IP address; IP protocol field; and source and destination ports. An individual flow is made of packets going from an application on a source machine to an application on a destination machine, and packets belonging to a flow carry the same values for the five packet header flow fields. Quality of service (QoS) applied on an individual flow basis, however, is not scalable because the number of flows can be large. So, routers at the network boundary perform classifier functions to identify packets belonging to a certain traffic class based on one or more Transmission Control Protocol/Internet Protocol (TCP/IP) header fields. A marker function is used to color the classified traffic by setting either the IP Precedence or the Differentiated Services Code Point (DSCP) field. Within the network core, you can apply a per-hop behavior (PHB) to the packets based on either the IP Precedence or the DSCP field marked in the packet header. Another important traffic conditioner at the network boundary is traffic rate management. It enables a service provider to meter the customer's traffic entering the network against the customer's traffic profile using a policing function. Conversely, an enterprise accessing its service provider can meter all its traffic to shape the traffic and send out at a constant rate such that all its traffic passes through the service provider's policing functions. Network boundary traffic conditioners are essential to delivering differentiated services in a network.
Packet Classification
Packet classification is a means of identifying packets to be of a certain class based on one or more fields in a packet. The identification function can range from straightforward to complicated. The different classification support types include: IP flow identification based on the five flow parameters: Source IP Address, Destination IP Address, IP protocol field, Source Port Number, and Destination Port number. Identification based on IP Precedence or DSCP field. Packet identification based on other TCP/IP header parameters, such as packet length. Identification based on source and destination Media Access Control (MAC) addresses. Application identification based on port numbers, Web Universal Resource Locator (URL) addresses, and so on. This functionality is available in Cisco products as Network Based Application Recognition (NBAR).
You can use access lists to match packets based on the various flow parameters. Access lists can also identify packets based on the IP Precedence or DSCP field. NBAR enables a router to recognize traffic flows as belonging to a specific application enabling packet classification based on application. Packet classification also can be done based on information internal to the router. Examples of such classification are identification based on the arrived input interface and the QoS group field in the internal packet data structure. All the preceding classification mechanisms are supported across all QoS functions as part of Modular QoS command-line interface (CLI). Modular QoS CLI is discussed in Appendix A, "Cisco Modular QoS Command-Line Interface." The classification action is referred to as packet marking or packet coloring. Packets identified to belong to a class are colored accordingly.
27
Packet Marking
You can mark classified packets to indicate their traffic class. You can color packets by marking the IP Precedence or the DSCP field in the packet's IP header, or the QoS group field in the packet's internal data structure within a router.
IP Precedence
The IP Precedence field in the packet's IP header is used to indicate the relative priority with which a particular packet should be handled. It is made up of three bits in the IP header's Type of Service (ToS) byte. Apart from IP Precedence, the ToS byte contains ToS bits. ToS bits were designed to contain values indicating how each packet should be handled in a network, but this particular field is never used much in the real world. The ToS byte[1] showing both the IP Precedence and ToS bits is illustrated in Figure 2-3 in Chapter 2, "Differentiated Services Architecture." Table 3-1 shows the different IP Precedence bit values and their names[2]. Table 3-1. IP Precedence Values and Names IP Precedence Bits IP Precedence Names 000 Routine 001 Priority 010 Immediate 011 Flash 100 Flash Override 101 Critical 110 Internetwork Control 111 Network Control
All routing control traffic in the network uses IP Precedence 6 by default. IP Precedence 7 also is reserved for network control traffic. Hence, use of IP Precedences 6 and 7 is not recommended for user traffic.
Packet coloring by setting the IP Precedence field can be done either by the application originating the traffic or by a node in the network. Cisco QoS features supporting this function include Committed Access Rate (CAR), Policy-Based Routing (PBR), and QoS Policy Propagation using Border Gateway Control (BGP) (QPPB). Case study 3-1, later in this chapter, discusses the IP Precedence setting using different QoS features. CAR is discussed in the "Traffic Policing" section of this chapter. PBR and QPPB are discussed in Appendix C, "Routing Policies." Note PBR is primarily a feature to route packets based on policy, though it has functionality for marking packets with IP Precedence. As such, PBR is recommended only for marking packets when either CAR or QPPB support is unavailable, or when IP Precedence needs to be marked while routing packets based on a policy.
DSCP
DSCP field is used to indicate a certain PHB in a network. It is made up of 6 bits in the IP header and is being standardized by the Internet Engineering Task Force (IETF) Differentiated Services Working Group. The
28
original ToS byte containing the DSCP bits has been renamed the DSCP byte. The DSCP byte is shown in Figure 2-4 in Chapter 2. The DSCP field definitions[3] and the recommended DSCP values for the different forwarding behaviors are also discussed in Chapter 2. The DSCP field is part of the IP header, similar to IP Precedence. In fact, the DSCP field is a superset of the IP Precedence field. Hence, the DSCP field is used and is set in ways similar to what was described with respect to IP Precedence. Note that the DSCP field definition is backward-compatible with the IP Precedence values.
29
This case study discusses the IP Precedence setting function using CAR, PBR, and QPPB. On the ISP router's High-Speed Serial Interface (HSSI) connecting to the enterprise customer, you should configure CAR, PBR, or QPPB commands to set IP Precedence values based on the source IP address for all incoming traffic. IP Precedence Using CAR Listing 3-1 is a sample configuration to enable traffic based on IP Precedence using CAR. Listing 3-1 Using CAR to Classify Traffic Based on IP Precedence interface Hssi 0/0/1 rate-limit access-group 1 input 45000000 22500 22500 conform-action set-prec-transmit 5 exceed-action set-prec-transmit 5 rate-limit input 45000000 22500 22500 conform-action set-prec-transmit 4 exceed-action set-prec-transmit 4 access-group 1 permit 215.215.215.0 0.0.0.255 Two rate-limit statements are defined to set the IP Precedence values. The first statement sets IP Precedence 5 for all traffic carrying a source address in the 215.215.215.0/24 address space. The rest of the traffic arriving on the Hssi0/0/1 interface is set with an IP Precedence of 4. Note that though rate-limit parameters are defined in the two statements, the purpose is not to rate-limit but simply to set IP Precedence based on an IP access list. Note CAR provides two functions in a single statement: rate limiting and IP Precedence setting. In this case study, the purpose is to set only the IP Precedence, but a CAR statement requires both ratelimiting and IP Precedence functions to be configured. In the future, with the support for modular QoS CLI, setting IP Precedence will not require enabling rate-limiting functions.
30
IP Precedence Using PBR Listing 3-2 is a sample configuration to enable traffic based on IP Precedence using PBR. Listing 3-2 Using PBR to Classify Traffic Based on IP Precedence interface Hssi 0/0/1 ip policy route-map tasman route-map tasman permit 10 match ip address 1 set ip precedence 5 route-map tasman permit 20 set ip precedence 4 access-group 1 permit 215.215.215.0 0.0.0.255 The route map tasman is used to set IP Precedence 5 for traffic with a source address in the 215.215.215.0/24 address space and IP Precedence 4 for the rest of the traffic. IP Precedence Using QPPB The ISP router receives a BGP route of 215.215.215.0 from the enterprise network. Within the ISP router's BGP configuration, a table map tasman is defined to tag the route 215.215.215.0 with a precedence of 5. The rest of the BGP routes of the enterprise are tagged with IP Precedence 4. BGP installs the BGP route in the routing table with its associated IP Precedence value. Cisco Express Forwarding (CEF) carries the IP Precedence value along with the forwarding information from the routing table. CEF is discussed in Appendix B, "Packet Switching Mechanisms." When CEF switching incoming traffic on the Hssi0/0/1 interface, check the packet's source address and tag the associated IP Precedence value from the matching CEF entry before transmitting the packet on the outgoing interface. The configuration for this purpose is shown in Listing 3-3. Listing 3-3 Using QPPB to Classify Traffic Based on IP Precedence interface Hssi 0/0/1 ip address 217.217.217.1 255.255.255.252 bgp source ip-prec-map router bgp 10 table-map tasman neighbor 217.217.217.2 remote-as 2345 route-map tasman permit 10 match ip address 1 set ip precedence 5 route-map tasman permit 20 set ip precedence 4 access-group 1 permit 215.215.215.0 0.0.0.255
Case Study 3-2: Packet Classification and Marking Using QoS Groups
In case study 3-1, assume that the ISP prefers to use QoS groups to indicate different traffic service levels on its routers that connect to customers. In case study 2, the enterprise customer traffic from network 215.215.215.0/24 gets a QoS group 3 service level, and the rest of the traffic a service level of QoS group 0, as shown in Figure 3-1.
31
Note You can choose to classify packets by QoS group rather than IP Precedence either when you want more than eight traffic classes, or when you don't want to change the original packets' IP Precedence values.
This case study discusses packet classification based on the QoS group using CAR and QPPB. It shows sample CAR and QPPB configurations on the ISP router to classify and mark packets using QoS group values. QoS Groups Using CAR Listing 3-4 is a sample configuration to enable traffic based on QoS groups using CAR. Listing 3-4 Using CAR to Classify Traffic Based on QoS Groups interface Hssi 0/0/1 rate-limit access-group 1 input 45000000 22500 22500 conform-action set-qostransmit 3 exceed-action drop rate-limit input 45000000 22500 22500 conform-action set-qos-transmit 0 exceedaction drop access-group 1 permit 215.215.215.0 0.0.0.255 QoS Groups Using QPPB Listing 3-5 is a sample configuration to enable traffic based on QoS groups using QPPB. Listing 3-5 Using QPPB to Classify Traffic Based on QoS Groups interface Hssi 0/0/1 ip address 217.217.217.1 255.255.255.252 bgp source ip-qos-map router bgp 10 table-map tasman neighbor 217.217.217.2 remote-as 2345 route-map tasman permit 10 match ip address 1 set ip qos-group 3 route-map tasman permit 20 set ip qos-group 0 access-group 1 permit 215.215.215.0 0.0.0.255
32
The service provider needs to enforce the IP Precedence setting to 0 for all traffic coming from best-effort service customers. For traffic coming from premium customers, the service provider has to mark the packets with an IP Precedence of 5. Input CAR is configured on the HSSI interface of the router enforcing IP Precedence values. Listing 3-6 shows a sample configuration. Listing 3-6 Using CAR to Enforce IP Precedence on Traffic interface Hssi 1/0 rate-limit input 45000000 22500 22500 conform-action set-prec-transmit 0 exceed-action set-prec-transmit 0 interface Hssi 1/1 rate-limit input 10000000 22500 22500 conform-action set-prec-transmit 5 exceed-action set-prec-transmit 5 Interface Hssi1/0 connects to a customer with normal, best-effort service. Therefore, all incoming traffic on this interface is tagged with an IP Precedence of 0. Similarly, interface Hssi1/1 connects to a customer with premium service. Therefore, all incoming traffic on this interface is tagged with an IP Precedence of 5.
33
Table 3-3. Comparison Between Policing and Shaping Functions Policing Function (CAR) Shaping Function (TS) Sends conforming traffic up to the line rate and allows Smoothes traffic and sends it out at a constant rate. bursts. When tokens are exhausted, it can drop packets. When tokens are exhausted, it buffers packets and sends them out later, when tokens are available. Works for both input and output traffic. Implemented for output traffic only. Transmission Control Protocol (TCP) detects the line TCP can detect that it has a lower speed line and adapt at line speed but adapts to the configured rate when its retransmission timer accordingly. This results in less a packet drop occurs by lowering its window size. scope of retransmissions and is TCP-friendly.
Traffic Policing
The traffic policing or rate-limiting function is provided by CAR. CAR offers two primary functions: packet coloring by setting IP Precedence, and rate-limiting. As a traffic policing function, CAR doesn't buffer or smooth traffic and might drop packets when the allowed bursting capability is exceeded. CAR is implemented as a list of rate-limit statements. You can apply it to both the output and input traffic on an interface. A rate-limit statement is implemented as follows:
34
rate-limit <input/ouput> access-group rate-limit # "CIR" "conformed burst" "extended burst" conform-action "action desired" exceed-action "action desired"
Each rate-limit statement is made up of three elements. The rate-limit statements are processed, as shown in Figure 3-3. Figure 3-3 The Evaluation Flow of Rate-Limit Statements
Two special rate-limit access lists are provided to match IP Precedence and MAC addresses. Examples of the rate-limit access lists and other traffic match specification usage are given in the case studies in this chapter.
35
The size of the token bucket (the maximum number of tokens it can hold) is equal to the conformed burst (BC). For each packet for which the CAR limit is applied, tokens are removed from the bucket in accordance to the packet's byte size. When enough tokens are available to service a packet, the packet is transmitted. If a packet arrives and the number of available tokens in the bucket is less than the packet's byte size, however, the extended burst (BE) comes into play. Consider the following two cases: Standard token bucket, where BE = BC A standard token bucket has no extended burst capability, and its extended burst is equal to its BC. In this case, you drop the packet when tokens are unavailable. Token bucket with extended burst capability, where BE > BC A token bucket with an extended burst capability allows a stream to borrow more tokens, unlike the standard token bucket scheme. Because this discussion concerns borrowing, this section introduces two terms related to debtactual debt (DA) and compounded debt (DC) that are used to explain the behavior of an extended burst-capable token bucket. DA is the number of tokens the stream currently borrowed. This is reduced at regular intervals, determined by the configured committed rate by the accumulation of tokens. Say you borrow 100 tokens for each of the three packets you send after the last packet drop. The DA is 100, 200, and 300 after the first, second, and third packets are sent, respectively.
36
DC is the sum of the DA of all packets sent since the last time a packet was dropped. Unlike DA, which is an actual count of the borrowed tokens since the last packet drop, DC is the sum of the actual debts for all the packets that borrowed tokens since the last CAR packet drop. Say, as in the previous example, you borrow 100 tokens for each of the three packets you send after the last packet drop. DC equals 100, 300 ( = 100 + 200), and 600 ( = 100 + 200 + 300) after the first, second, and third packets are sent, respectively. Note that for the first packet that needs to borrow tokens after a packet drop, DC is equal to DA. The DC value is set to zero after a packet is dropped, and the next packet that needs to borrow has a new value computed, which is equal to the DA. In the example, if the fourth packet gets dropped, the next packet that needs to borrow tokens (for example, 100) has its DC = DA = 400 (= 300 + 100). Note that unlike DC, DA is not forgiven after a packet drop. If DA is greater than the extended limit, all packets are dropped until DA is reduced through accumulation of tokens. The need for a token bucket with extended burst capability is not to immediately enter into a tail-drop scenario such as a standard token bucket, but rather, to gradually drop packets in a more Random Early Detection (RED)-like fashion. RED is discussed in Chapter 6, "Per-Hop Behavior: Congestion Avoidance and Packet Drop Policy." If a packet arrives and needs to borrow some tokens, a comparison is made between BE and DC . If DC is greater than BE , the packet is dropped and DC is set to zero. Otherwise, the packet is sent, and DA is incremented by the number of tokens borrowed and DC with the newly computed DA value. Note that if a packet is dropped because the number of available tokens exceeds the packet size (in bytes), tokens will not be removed from the bucket (for example, dropped packets do not count against any rate or burst limits). It is important to note that CIR is a rate expressed in bytes per second. The bursts are expressed in bytes. A burst counter counts the current burst size. The burst counter can either be less than or greater than the BC. When the burst counter exceeds BC, the burst counter equals BC + DA. When a packet arrives, the burst counter is evaluated, as shown in Figure 3-5. Figure 3-5 Action Based on the Burst Counter Value
For cases when the burst counter value is between BC and BE, you can approximately represent the exceed action probability as: (Burst counter BC) (BE BC) Based on this approximation, the CAR packet drop probability is shown in Figure 3-6. The concept of exceed action packet drop probability between the conformed and extended burst is similar to the packet drop probability of RED between the minimum and maximum thresholds.
37
Note The discussion in this section assumes packet transmission and packet drop as conform and exceed actions, respectively, for simplicity. In reality, a number of options are available to define both conform and exceed actions. They are discussed in the next section, "The Action Policy."
In a simple rate-limit statement where the BC and BE are the same value, no variable drop probability exceed region exists. CAR implementation puts the following constraints on the token bucket parameters: Rate (bps) should be in increments of 8 Kbps, and the lowest value allowed for conformed and extended burst size is 8000 bytes. The minimum value of BC size is Rate (bps) divided by 2000. It should be at least 8000 bytes. The BE is always equal to or greater than the BC value.
Case study 3-9 discusses how you can use action continue. Note that classification using qos-group functionality is available only in Versatile Interface Processor (VIP)-based 7500 series routers.
38
Note Per-Interface Rate Configuration (PIRC) is a limited-scope feature implementation of CAR targeted for the Cisco 12000 series routers. PIRC fits within the micro code space of the packet-switching application-specific integrated circuit (ASIC) on certain line cards of the router. You can apply only one CAR rule to the interface to rate-limit and/or set IP Precedence. PIRC is supported only in the ingress direction.
Case Study 3-4: Limiting a Particular Application's Traffic Rate at a Service Level
A service provider has one of its premium customers define its traffic service level. All customer traffic except Hypertext Transfer Protocol (HTTP) (Web) traffic over a rate of 15 Mbps is marked with an IP Precedence of 4. HTTP traffic over a rate of 15 Mbps is transmitted with an IP Precedence of 0. The customer has a 30-Mbps service from the service provider. On the service provider boundary router connecting the premium customer, you enable CAR as shown is Listing 3-7. Listing 3-7 Limiting HTTP Traffic at a Service Level to a Specific Rate Interface Hssi1/0/0 rate-limit input 30000000 15000 15000 conform-action continue exceed-action drop rate-limit input access-group 101 15000000 10000 10000 conform-action set-prec-transmit 4 exceed-action set-prec-transmit 0 rate-limit input 30000000 15000 15000 conform-action set-prec-transmit 4 exceed-action set-prec-transmit 4 ! access-list 101 permit tcp any any eq www access-list 101 permit tcp any eq www any The first CAR statement is used to rate-limit all incoming traffic to 30 Mbps. It uses a continue action so that you continue to the next rate-limit statement. The second rate-limit statement is used to set the IP Precedence value for HTTP traffic based on its traffic rate. The last rate-limit statement is used to set the IP Precedence value to 4 for all non-HTTP traffic. Listings 3-8 and 3-9 give the output of some relevant show commands for CAR. Listing 3-8 CAR Parameters and Packet Statistics #show interface hssi1/0 rate Hssi1/0/0 Input matches: all traffic params: 30000000 bps, 15000 limit, 15000 extended limit conformed 0 packets, 0 bytes; action: continue exceeded 0 packets, 0 bytes; action: drop last packet: 338617304ms ago, current burst: 0 bytes last cleared 00:11:11 ago, conformed 0 bps, exceeded 0 bps matches: access-group 101 params: 15000000 bps, 10000 limit, 10000 extended limit conformed 0 packets, 0 bytes; action: set-prec-transmit 4 exceeded 0 packets, 0 bytes; action: set-prec-transmit 0 last packet: 338617201ms ago, current burst: 0 bytes last cleared 00:11:11 ago, conformed 0 bps, exceeded 0 bps matches: all traffic params: 15000000 bps, 10000 limit, 10000 extended limit conformed 0 packets, 0 bytes; action: set-prec-transmit 4 exceeded 0 packets, 0 bytes; action: set-prec-transmit 4
39
last packet: 338617304ms ago, current burst: 0 bytes last cleared 00:03:30 ago, conformed 0 bps, exceeded 0 bps The show interface rate command displays the CAR configuration parameters as well as CAR packet statistics. With the display, current burst gives a snapshot of the value in the token bucket at the time the value is printed. The conformed bps is obtained by dividing the total conformed traffic passed by the elapsed time since the counters were cleared the last time. Listing 3-9 Input and Output IP Precedence Accounting Information on Interface Hssi1/0/0 #show interface hssi1/0/0 precedence Hssi1/0/0 Input Precedence 4: 10 packets, 1040 bytes Output Precedence 4: 10 packets, 1040 bytes The show interface precedence command shows input and output packet accounting on an interface when IP accounting precedence input/output is applied on the interface.
On the HSSI interface going to the partner network, the service provider adds the configuration shown in Listing 3-10 for this functionality. Listing 3-10 Rate-Limiting Traffic Based on Packet Precedence Values interface Hssi1/0/0 rate-limit input access-group rate-limit 1 10000000 10000 10000 conform-action transmit exceed-action drop access-list rate-limit 1 mask 07 To match a range of precedence values, you use a mask of 8 bits, where each bit refers to a precedence value. Precedence 7 is 10000000, precedence 6 is 01000000, precedence 5 is 0010000, and so on. In this example, the mask 07 is in hex, which is 00000111 in decimal. Hence, the mask matches precedence values 0, 1, and 2.
40
We use a rate-limit access list statement to match the precedence values 0, 1, and 2. IP-precedence-based rate-limit access lists range from 1 to 99. The rate-limit statement limits all traffic with precedence 0, 1, and 2 to 10 Mbps.
All packets being sent by the web server with an IP address of 209.11.212.1 are set with an IP Precedence value based on the traffic rate. All traffic within the 5-Mbps rate gets an IP Precedence of 4, and the rest get an IP Precedence of 0.
41
Listing 3-13 Limiting Denial-of-Service Attacks interface Hssi1/0 rate-limit input access-group 100 256000 8000 8000 conform-action transmit exceed-action drop access-list 100 permit icmp any any
The Tier-1 ISP wants to rate-limit all traffic received from the ISP X and ISP Y routers to a mean rate of 30 Mbps and 40 Mbps, respectively. Any traffic exceeding the mean rate is dropped. The Tier-1 ISP drops any packets arriving from non-peers on the FDDI (other than ISPs X and Y). Listing 3-14 show ways you can enable the Tier-1 ISP router for this functionality. Listing 3-14 Enforcing Public Exchange Point Traffic Example interface Fddi1/0/0 ip address 162.111.10.1 255.255.255.192 rate-limit input access-group rate-limit 110 30000000 15000 15000 conform-action transmit exceed-action drop rate-limit input access-group rate-limit 120 40000000 40000 40000 conform-action continue exceed-action drop rate-limit input access-group 100 4000000 40000 40000 conform-action drop exceed-action drop access-list rate-limit 110 0000.0c10.7819 access-list rate-limit 120 0000.0c89.6900 access-list 100 permit ip any any 0000.0c10.7819 and 0000.0c89.6900 are the MAC addresses on the FDDI interface of ISP X's and ISP Y's routers, respectively.
42
Rate-limit access lists ranging from 100 to 199 are used to match traffic by MAC addresses. A rate-limit access list of 110 limits the traffic coming from ISP X to 30 Mbps. A rate-limit access list of 120 is used to rate-limit the traffic from ISP Y to 40 Mbps. The last rate-limit statement is a "catch-all" statement that drops any traffic arriving from non-peer routers connected on the FDDI.
Traffic Shaping
TS is a mechanism to smooth the traffic flow on an interface to avoid congestion on the link and to meet a service provider's requirements. TS smoothes bursty traffic to meet the configured CIR by queuing or buffering packets exceeding the mean rate. The queued packets are transmitted as tokens become available. The queued packets' transmission is scheduled in either the first-in, first-out (FIFO) or Weighted Fair Queuing (WFQ) order. TS operation is illustrated in Figure 3-9. Figure 3-9 Traffic Shaping Operation
TS also can be configured in an adaptive mode on a frame-relay interface. In this mode, TS estimates the available bandwidth by Backward Explicit Congestion Notification (BECN)/Forward Explicit Congestion Notification (FECN) field and Discard Eligible(DE) bit integration (discussed in Chapter 8, "Layer 2 QoS: Interworking with IP QoS" ). This section covers traffic shaping on all interfaces/subinterfaces, regardless of the interface encapsulation. Traffic-shaping on an individual Frame Relay permanent virtual circuit (PVC)/switched virtual circuit (SVC) is covered in Chapter 8.
43
Figure 3-10 The Token Bucket Scheme for the Traffic Shaping Function
You can accomplish traffic shaping on any generic interface using one of two implementationsGeneric Traffic Shaping (GTS) and Distributed Traffic Shaping (DTS). Table 3-4 compares the two TS implementations. Note TS works only for outbound traffic; hence, TS cannot be applied on the inbound traffic to an interface.
Table 3-4. Comparison of TS implementations: GTS and DTS Feature Attributes GTS DTS Uses WFQ as a scheduling Can use either FIFO or Distributed WFQ Order of Transmission of algorithm. (DWFQ) as a scheduling algorithm. Buffered Packets Traffic Matching Specification Has two modes: either all traffic, or Traffic classes as defined by a user by traffic matched by a simple or means of one of the classification extended IP access list. features (CAR or QPPB). Doesn't support per-PVC/SVC Supports per-PVC/SVC traffic-shaping Per Frame Relay Data-Link on a Frame Relay interface. Connection Identifier (DLCI) traffic-shaping on a Frame Relay interface. Support All single-processor (nonVIP-based 7500 series routers. Availability distributed) router platforms. All protocols. IP only. Protocol Support
44
The serial interface of the enterprise's router that connects to the service provider has an access rate of 256 Kbps. Traffic shaping by both GTS and DTS features is discussed in the following sections. Shaping Traffic Using GTS Listings 3-15 through 3-18 show a sample configuration to shape traffic using GTS to an access rate of 64 Kbps, and some of the relevant show commands to monitor the shaping operation. Listing 3-15 Shaping Traffic to 256 Kbps on Average Using GTS interface Serial0 traffic-shape rate 256000 Note that in the configuration, only the desired CIR is provided. The appropriate sustained and BE values, which are shown in Listing 3-16, are picked automatically. Listing 3-16 GTS Parameters on Interface Serial0 #show traffic shape serial0 Access Target Byte Sustain Excess I/F List Rate Limit bits/int bits/int Se0 256000 1984 7936 7936
Interval (ms) 31
The show traffic-shape command shows the traffic shaping configuration and the token bucket parameters. Listing 3-17 Information on the GTS shaping queue #show traffic serial queue Traffic queued in shaping queue on Serial0 Queueing strategy: weighted fair Queueing Stats: 0/1000/64/0 (size/max total/threshold/drops) Conversations 4/8/256 (active/max active/max total) Reserved Conversations 0/0 (allocated/max allocated) (depth/weight/discards/tail drops/interleaves) 9/585/0/0/0 Conversation 118, linktype: ip, length: 70 source: 199.199.199.199, destination: 2.2.2.1, id: 0x212D, ttl: 255, TOS: 192 prot: 6, source port 60568, destination port 711
45
(depth/weight/discards/tail drops/interleaves) 6/1365/0/0/0 Conversation 84, linktype: ip, length: 60 source: 254.6.140.76, destination: 172.16.1.3, id: 0x45C0, ttl: 213, prot: 158 (depth/weight/discards/tail drops/interleaves) 8/4096/0/0/0 Conversation 124, linktype: ip, length: 114 source: 172.16.69.115, destination: 2.2.2.1, id: 0x0079, ttl: 254, prot: 1 (depth/weight/discards/tail drops/interleaves) 5/4096/26/0/0 Conversation 257, linktype: cdp, length: 337 The show traffic-shape queue command shows the WFQ information as well as the WFQ queued packets on the interface. Listing 3-18 GTS Statistics on Interface Serial0 #show traffic Access I/F List Se0 serial0 Queue Depth 4
Packets 2000
Bytes 185152
Packets Delayed 20
The show traffic-shape statistics command shows the GTS packet statistics. Queue depth shows the number of packets in the WFQ queue at the moment the command is issued. Packets and Bytes show the total traffic switched through the interface. Packets Delayed and Bytes Delayed show the part of the total traffic transmitted that was delayed in the WFQ queue. Shaping Active shows if traffic shaping is active or not. Shaping is active if packets are to be shaped in the WFQ queue before transmitting them. Shaping Traffic Using DTS A class-map classifies all IP traffic into the class myclass. A policy-map applies the policy mypolicy to shape the traffic to an average rate of 256 Kbps on the traffic class myclass. The mypolicy policy is then applied on the interface to shape the outgoing traffic. Note that the serial interface is Serial0/0/0 in Figure 3-11 for the purpose of the DTS discussion. DTS applies only to VIP-based 7500 routers. Listings 3-19 and 3-20 are sample configurations and relevant show command output for shaping traffic to 256 Kbps using DTS. Listing 3-19 Shaping Traffic to 256 Kbps Using DTS class-map myclass match any policy-map mypolicy class myclass shape average 256000 16384 0 interface Serial0/0/0 service-policy output mypolicy The shape average command is used to send traffic at the configured CIR of 256000 bps without sending any excess burst bits per interval. The conformed burst (BC) is 16384 bits. Hence, the interval is 64 ms. The command allows only BC bits of data to be sent per interval. The excess burst is not allowed at all, even if it is configured to a value other than 0. The shape peak command is used to peak to a burst of BC + BE bits per interval while keeping a CIR.
46
Listing 3-20 DTS Parameters and Queue Statistics #show interface shape Serial0/0/0 nobuffer drop 0 Serial0/0/0 (class 2): cir 256000, Bc 16384, Be 0 packets output 0, bytes output 0 queue limit 0, queue size 0, drops 0 last clear = 00:01:39 ago, shape rate = 0 bps The show interface shape command shows the packet and queue statistics along with the configured DTS parameters.
Case Study 3-11: Shaping Incoming and Outgoing Traffic for a Host to a Certain Mean Rate
At a large company's remote sales office, Host A with an IP address of 200.200.200.1 is connected to the network on an Ethernet interface. It connects to the corporate office through a router using a T1. The network administrator wants to shape Host A's incoming and outgoing traffic to 128 K, as in Figure 3-12. Figure 3-12 Traffic Shaping Incoming and Outgoing Traffic of Host A
Host A is connected to the remote sales office router using interface ethernet0. The router's interface serial0 connects to the corporate network through a T1. The following configuration shows how to shape Host A's to and from traffic using GTS and DTS. Listings 3-21 and 3-22 show samples of the configuration required for GTS and DTS, respectively, to achieve this functionality. Shaping Incoming and Outgoing Traffic of a Host Through GTS Listing 3-21 shows a sample configuration for GTS.
47
Listing 3-21 Shaping Traffic to 128 Kbps Using GTS interface serial 0 traffic-shape group 101 128000 interface ethernet 0 traffic-shape group 102 128000 access-list 101 permit ip host 200.200.200.1 any access-list 102 permit ip any host 200.200.200.1 The access lists 101 and 102 match host 200.200.200.1's outgoing and incoming traffic, respectively. The outgoing traffic is shaped on interface serial0 to 128 Kbps. The incoming traffic is shaped similarly on the ethernet0 interface. Shaping Incoming and Outgoing Traffic of a Host Through DTS Listing 3-22 shows a sample configuration for DTS. Note that the serial interface and the ethernet interface in Figure 3-12 are Serial0/0/0 and Ethernet1/0/0 for the purpose of the DTS discussion. DTS applies only to VIP-based 7500 routers. Listing 3-22 Shaping Traffic to 128 Kbps Using DTS class-map FromHostA match ip access-list 101 class-map ToHostA match ip access-list 102 policy-map frompolicy class FromHostA shape peak 128000 8192 1280 policy-map topolicy class ToHostA shape peak 128000 8192 1280 interface serial0/0/0 service-policy output frompolicy interface ethernet1/0/0 service-policy output topolicy The class maps FromHostA and ToHostA are used to match the from and the to traffic to Host A, respectively. On both traffic classes, a policy to shape the traffic to 128 Kbps is applied. The service-policy commands are used to activate the policies on the interface output traffic. The shape peak 128000 8192 1280 shapes traffic providing output traffic at an average rate of 128 Kbps with packet bursts of 9472 (8192 + 1280) per interval.
48
Listings 3-23 and 3-24 show samples of the configuration required for GTS and DTS, respectively, to achieve this functionality. Shaping Frame Relay Traffic Through GTS Listing 3-23 shows a sample configuration for GTS. Listing 3-23 Shaping Traffic to 256 Kbps Using GTS interface Serial0/0/0.1 point-to-point traffic-shape rate 256000 traffic-shape adaptive 64000 The traffic-shape adaptive command makes the router adapt its traffic-shaping rate to the incoming BECNs. The outgoing traffic on the interface is shaped at 256 Kbps (as configured by the traffic-shape rate command) on average when it receives no BECNs. The CIR is throttled to 64 Kbps, however, if it receives a series of BECNs, as in Figure 3-13. Figure 3-13 Adaptive Traffic Shaping on a Frame Relay Interface
Shaping Frame Relay Traffic Through DTS Listing 3-24 shows a sample configuration for DTS. Listing 3-24 Shaping Traffic to 256 Kbps Using DTS class-map myclass match any policy-map mypolicy class myclass shape peak 256000 16384 1280 shape adaptive 64000
49
interface serial0/0/0.1 point-to-point service-policy output mypolicy The shape adaptive command makes the router adapt its traffic-shaping rate to the incoming BECNs. The outgoing traffic on the interface is shaped at 256 Kbps (as configured by the shape peak command) on average when it receives no BECNs. The CIR is throttled to 64 Kbps, however, if it receives a series of BECNs.
Summary
Network boundary traffic conditioners provide packet classifier, marker, and traffic rate management functions. Packets are classified at the network boundary so that they receive differentiated service within the network. Packet classification is necessary to identify different classes of traffic based on their service level. You can perform packet classification based on one or more fields in the packet's IP header. After a packet is identified to be of a certain class, a marker function is used to color the packet by setting the IP Precedence, the DSCP, or the QoS group value. Traffic rate management on network boundary routers is essential to ensure resource availability and QoS within the network core. CAR is used to rate-limit any traffic trying to go over the configured rate. CAR can send some burst of traffic at up to the line rate and then start dropping packets after a given rate is reached. TS smoothes the traffic by queuing packets and sending them at a configured rate. TS is more TCP-friendly than CAR, as a packet drop can cause a TCP stream to lower the window size to 1. This can lower the rate taken by the TCP stream below the allowed rate. By choosing the right burst parameters, which can vary from one TCP implementation to another, a TCP stream can take up the entire configured rate.
Q:
On a router, how can I mark Telnet packets destined to or originating from it to be set a certain IP Precedence value? You can mark Telnet packets from or to the router to a certain IP Precedence value by using the ip telnet tos command. As an example, you can set Telnet packets with an IP Precedence 6 by using the ip telnet tos C0 command. I am rate-limiting for 1 Mbps of traffic. Why do I see exceed traffic when the conformed traffic is as low as 250 Kbps? The exceed traffic might have happened for a short period of time. On average, the traffic seen by CAR is only 250 Kbps.
A:
Q:
A:
50
Q: A:
How do I choose the right BC parameter for CAR? No right answer exists. It depends on the type of flows and how accommodating you want to be to traffic burstiness. For IP and ICMP traffic flows, the burst parameters don't matter as they only lead to more short- or longterm bursty behavior. It isn't a big concern, as over time, the rate is limited to the configured CIR. For protocols such as TCP, however, which use adaptive window-based rate control, a drop leads to a retransmission timer timeout at the sender and causes its window size to be reduced to 1. Though the right burst parameters for TCP flow vary based on the TCP implementations on the flow's sender and receiver, in general the best choices for CAR burst parameters are conformed_burst = CIR (1 byte 8 bits) 1.5 seconds, where CIR is configured rate extend_burst = 2 conformed_burst
Q: A: Q: A:
How big are the CAR traffic counters? CAR uses 64-bit counters. How does CAR penalize flows that continually exceed the BC value? When a packet is dropped, you set the DC to zero, but the DA is left untouched. The next time a packet needs to borrow some tokens, the cumulative debt becomes equal to the DA. If the flow continually borrows tokens, the DA can continue to grow quickly to a value above the extended limit such that even compounding is not necessary to cause the packet to get dropped. You continue to drop packets until the DA is reduced because of token accumulation. How can a network administrator enable traffic rate limiting to be active during a certain time period of the day only? A network administrator can make this possible by using time-based IP access lists within CAR or modular QoS. Please refer to Cisco documentation for more information on time-based IP access lists. What amount of processor resources is taken by CAR on a router? The amount of processor resources taken by CAR depends on the match condition and the depth of the rate-limit statements. A match condition using a complicated extended IP access list takes more processor power relative to one using a basic IP access list. In general, the new precedence and MAC address access lists need less processor resources compared to the IP access lists. The rate limits are evaluated sequentially, so the amount of processor resources used increases linearly with the number of rate limits checked.
Q:
A:
Q: A:
References
1. "Type of Service in the Internet Protocol Suite," P. Almquist, RFC 1349. 2. "Internet Protocol Specification," Jon Postel, RFC 791. 3. "Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers," K. Nichols, and others, RFC 2474.
51
52
Note It is important to keep in mind the conservation law from queuing theory, which states that any scheduling discipline can only reduce a flow's mean delay at another flow's expense. Occasionally, latency is traded for bandwidth. Some flows are delayed to offer a particular bandwidth to other flows. When someone gets preferential treatment, someone else will suffer.
FIFO Queuing
FIFO queuing is a queuing mechanism in which the order of the packets coming into a queue is the same as the order in which they are serviced or transmitted out of the queue. Figure 4-1 illustrates a FIFO queue. Figure 4-1 FIFO Queue
In FIFO queuing, the order in which packets arrive in a queue is the same as the order in which they are transmitted from the queue. FIFO, the most common scheduling mechanism in routers today, is easy to implement. It has no mechanism to differentiate between flows, however, and hence cannot prioritize among them. Not only can FIFO queuing not prioritize one flow over the other, but it also offers neither protection nor fairness to equal-priority traffic flows because a large, badly behaving flow can take the share of resources of well-behaving flows with end-to-end, adaptive flow-control schemes, such as Transmission Control Protocol (TCP) dynamic window control. With FIFO, flows receive service approximately in proportion to the rate at which they send data into the network. Such a scheme is obviously not fair, because it rewards greedy flows over well-behaved ones. Any fairness algorithm by its nature offers protection against greedy flows.
Consider an example in which a resource has a capacity of 14, servicing five users, A, B, C, D, and E, with demands 2, 2, 3, 5, and 6, respectively. Initially, the source with the smallest demand is given a resource equal to the resource capacity divided by the total number of users. In this case, User A and User B are given a resource of 14 5 = 2.8. But Users A and B actually need only 2. So the unused excess, 1.6 (0.8 each from Users A and B), is distributed evenly among the other three users. So Users C, D, and E each get a resource of 2.2 + (1.6 3) = 3.33. Now, the user with the next-smaller demand is serviced. In this case, it is User C. The resource allocated to User C is 0.33 units in excess of its demand for 3. This unused excess is distributed evenly between Users D and E so that each now has a resource of 3.33 + (0.33 2) = 3.5.
53
We can calculate fair allocation as follows: Fair allocation = (resource capacity resource capacity already allocated to users) number of users who still need resource allocation This example is illustrated in Figures 4-2, 4-3, and 4-4. In Step 1, shown in Figure 4-2, the demands of Users A and B are fully allocated because their resource requests fall within the fair allocation. In this step, the demands of C, D, and E exceed fair allocation of 2.8 and, hence, cannot be allocated. In the next step, the unused excess bandwidth of A and B's fair allocation is equally distributed among the three remaining users, C, D, and E. Figure 4-2 Resource Allocation for Users A and B
54
In Step 2, shown in Figure 4-3, the demand of User C is fully allocated because its resource request falls within the fair allocation. In this step, the demands of D and E exceed fair allocation of 3.33 and, hence, cannot be allocated. In the next step, the unused excess bandwidth of C's fair allocation is equally distributed between the two remaining users, D and E. In Step 3, shown in Figure 4-4, the fair allocation of 3.5 falls below the requests of both Users D and E, which are each allocated 3.5 and have unsatisfied demands of 1.5 and 2.5, respectively. This scheme allocates resources according to the max-min fair-share scheme. Note that all users with unsatisfied demands (beyond what is their max-min fair share) get equal allocation. So, you can see that this scheme is referred to as max-min fair-share allocation because it maximizes the minimum share of a user whose demand is not fully satisfied. Consider an extension to the max-min fair-share allocation scheme in which each user is assigned a weight. Such a scheme is referred to as weighted max-min fair-share allocation, in which a user's fair share is proportional to its assigned weight.
55
FQ simulates GPS by computing a sequence number for each arriving packet. The assigned sequence numbers are essentially service tags, which define the relative order in which the packets are to be serviced. The service order of packets using sequence number computation emulates the service order of a GPS scheduler. To intuitively understand how GPS simulation is done, consider a variable called round number, which denotes the number of rounds of service a byte-by-byte round-robin scheduler has completed at a given time. The computation of a sequence number depends on the round number. To illustrate how GPS is simulated by FQ, consider three flows, A, B, and C, with packet sizes 128, 64, and 32 bytes, respectively. Packets arrive back-to-back on a busy FQ server in the order A1, A2, A3, B1, C1, with A1 arriving first, followed by A2, and so on. A flow is said to be active if any outstanding packets of that flow are awaiting service, and inactive otherwise. For this example, assume Packet A1 arrived on an inactive flow in the FQ system. Assuming service by a byte-by-byte round-robin scheduler, an entire 128-byte packet is sent when the scheduler completes 128 rounds of service since the packet arrived. If the round number at the time Packet A1 arrived is 100, the entire packet is transmitted when the round number becomes 100 + 128 = 228. Hence, the sequence number of a packet for an inactive flow is calculated by adding the round number and the packet size in bytes. Essentially, it is the round in which the last byte of the packet is transmitted. Because, in reality, a scheduler transmits a packet and not 1 byte at a time, it services the entire packet whenever the round number becomes equal to the sequence number. When Packet A2 arrives, the flow is already active with A1 in the queue, waiting for service, with a sequence number of 228. The sequence number of Packet A2 is 228 + 128 = 356, because it needs to be transmitted after A1. Hence, the sequence number of a packet arriving on an active flow is the highest sequence number of the packet in the flow queue, plus its packet size in bytes. Similarly, Packet A3 gets a sequence number of 356 + 128 = 484. Because Packets B1 and C1 arrive on an inactive flow, their sequence numbers are 164 (that is, 100 + 64) and 132 (that is, 100 + 32), respectively. Sequence Number (SN) assignment for a packet is summarized based on whether it arrives on an active or an inactive flow as follows: Packet arrives on an inactive flow: SN = size of the packet in bytes + the round number at the time the packet arrived (The round number is the sequence number of the last packet serviced.) Packet arrives on an active flow: SN = size of the packet in bytes + the highest sequence number of the packet already in the flow queue Packets in their flow queues, along with their computed sequence numbers, are shown in Figure 4-5 to illustrate how the FQ scheduler emulated GPS. A GPS scheduler will have completed scheduling the entire Packet A1 in the 228th round. The sequence number denotes the relative order in which the packets are served by the scheduler. The FQ scheduler serves the packets in the following order: C1, B1, A1, A2, and A3.
56
Figure 4-5 An Example Illustrating the Byte-by-Byte Round-Robin GPS Scheduler Simulation for FQ
Round numbers are used only for calculating sequence numbers if the arriving packet belongs to a new flow. Otherwise, the sequence number is based on the highest sequence number of a packet in that flow awaiting service. If Packet A4 arrives at any time before A3 is serviced, it has a sequence number of 484 + 128 = 612. Note that the round number is updated every time a packet is scheduled for transmission to equal the sequence number of the packet being transmitted. So if Packet D1 of size 32 bytes, belonging to a new flow, arrives when A1 is being transmitted, the round number is 228 and the sequence number of D1 is 260 (228 + 32). Because D1 has a lower sequence number than A2 and A3, it is scheduled for transmission before A2 and A3. Figure 4-6 depicts this change in scheduling order.
57
Figure 4-6 Illustration of FQ Scheduler Behavior; Packet D1 Arriving After Packet A1 Is Scheduled
Most often, some flows are considered more important or mission-critical than others. Such flows need to be preferred over the others by the scheduler. You can expand the FQ concept to assign weights per flow so that each flow is serviced in proportion to its weight. Such a fair queuing system is called flow-based WFQ and is discussed in the next section.
Flow-Based WFQ
In WFQ, weights are assigned based on their precedence value in the Internet Protocol (IP) header. They are calculated as follows: Weight = 4096 (IP precedence + 1) Note With the recent changes in WFQ implementation, the preceding WFQ flow weight calculation formula applies only when running IOS Versions 12.0(4)T or lower. The changes are made to enable class guarantees for Class-Based Weighted Fair Queuing (CBWFQ). CBWFQ is discussed later in this chapter. For IOS Versions 12.0(5)T and higher, the WFQ flow weights discussed in this chapter are multiplied by 8. Hence, the WFQ weight for best-effort (IP precedence 0) traffic becomes 4096 8 = 32768 and the equation for weight calculation becomes
58
Weight = 32768 (IP precedence + 1) Table 4-1 tabulates a packet's weight based on its IP precedence and Type of Service (ToS) byte value. Table 4-1. Weights Assigned Based on the IP Precedence Value of a Packet Belonging to an Unreserved (Non-RSVP) Flow IP Precedence ToS Byte Value Weight IOS Versions Prior to 12.0(5)T IOS Versions 12.0(5)T and Higher 0 1 2 3 4 5 6 7 0 (0x00) 32 (0x20) 64 (0x40) 96 (0x60) 128 (0x80) 160 (0xA0) 192 (0xC0) 224 (0xE0) 4096 2048 1365 1024 819 682 585 512 32768 16384 10920 8192 6552 5456 4680 4096
The weight of an RSVP flow with the largest bandwidth reservation is 4 until IOS Version 12.0(5)T and is 6 for 12.0(5)T and higher. The weight of all other RSVP flow reservations is derived based on the largest bandwidth reservation, as shown here: Weight for RSVP flow or conversation = highest bandwidth reservation on the link (greatest bandwidth reservation on the link conservation bandwidth) For the purpose of the discussion in the remainder of this section, weights prior to 12.0(5)T are used. It is important to note that the exact weight calculation scheme used doesn't matter in illustrating the working of WFQ.
WFQ uses two packet parameters to determine its sequence number. Like FQ, WFQ uses the packet's byte size. In addition, however, WFQ uses the weight assigned to the packet. The packet's weight is multiplied by its byte size for calculating the sequence number. This is the only difference between WFQ and FQ. Note that the direct correlation between a byte-by-byte round-robin scheduler and FQ is lost with WFQ, because the packet's byte count is multiplied by its weight before its sequence number is calculated. Consider a sequence number in WFQ as a number calculated to determine the relative order of a packet in a WFQ scheduler, and consider the round number as the sequence number of the last packet served in the WFQ scheduler. Using the same example discussed in the FQ section, assume that packets of Flow A have precedence 5, whereas Flows B and C have precedence 0. This results in a weight of 683 for packets in Flow A and 4096 for packets in Flows B and C. Table 4-2 shows all the flow parameters in this example. The sequence number of Packet A1 is calculated as 100 + (683 128) = 87524. Similarly, you can calculate sequence numbers for packets A2, A3, B1, and C1 as 174948, 262372, 262244, and 131172, respectively. So the order in which the scheduler services them is A1, C1, A2, B1, and A3, as illustrated in Figure 4-7.
59
Table 4-2. Flow-Based WFQ Example Precedence Weight = 4096 (Precedence + 1) 5 683 0 4096 0 4096
Note that with WFQ, you can prioritize Flow A, but you can't accommodate Flows B and C fairly. A WFQ scheduler simulates a max-min weighted GPS. If Packets A4 and D1 (a new flow with a precedence 0 and size 32 bytes) arrive after A1 is scheduled, A4 and D1 get a sequence number of 349,796 ((683 128) + 262,372) and 218,596 ((4096 32) + 87,524), respectively. The discussion of calculating the sequence numbers for A4 and D1 with FQ still applies here. Now, the scheduling order of the remaining packets is changed to C1, A2, D1, B1, A3, and A4. This is shown in Figure 4-8. Figure 4-8 Illustration of the Flow-Based WFQ Example (continued).
In Figure 4-8, packet D1 comes closely after packet A1 has been scheduled for transmission. Packet D1 is transmitted before packets A3 and A4, which arrived in the queue early. Note Sequence number calculation for a packet arriving on an interface occurs only when congestion is on the outbound interface (no buffer is available in the interface hardware queue). When no congestion is on the outbound interface, the scheduling behavior is FIFOan arriving packet is simply queued to the outbound interface hardware queue for transmission.
60
The length of interface hardware transmit queues determines the maximum queuing delay for realtime traffic in a WFQ scheduler. Real-time traffic such as voice has to wait until the packets already queued in the hardware queue are sent before it can be transmitted. Excessive queuing delays can result in jitter, a problem for real-time traffic such as voice. Typical hardware interface buffers can hold from one to five packets. In IOS implementation, most interfaces automatically reduce their hardware transmit queues to 2 when WFQ is enabled. A network operator should be able to modify the length of the interface hardware transmit queues based on the delay requirements for the traffic in the network. This is especially true for voice and other real-time traffic. You can modify the interface transmit queue size by using the tx-queue-limit command.
WFQ Implementation
In flow-based WFQ implementation, weights are based strictly on precedence and cannot be changed. Though FQ in itself is not available, WFQ becomes FQ for all practical purposes when all traffic arriving at the scheduler carries the same precedence value. With flow-based WFQ, packets with different IP precedence in a single flow are not scheduled out of order. In this regard, a flow is implemented as a hash defined by source and destination IP addresses, IP protocol field, Transmission Control Protocol/User Datagram Protocol (TCP/UDP) port numbers and the 5 bits (excluding the 3 IP precedence bits) in the ToS byte. Due to this flow description, packets of the same flow, but with different precedence values, fall into the same queue. Packets within a flow queue are serviced in FIFO order. In general, WFQ limits its drops to the most active flows, whereas FIFO might drop from any flow. Therefore, WFQ should encourage the most active flows to scale back without affecting the smaller flows. Because the median flow duration in the Internet is 1020 packets in length, a fairly small percentage of the flows should be taking the lion's share of the drops with WFQ, while FIFO drops should be distributed across all flows. Hence, the effects of global synchronization with FIFO are less pronounced with WFQ for traffic with adaptive flow control such as TCP traffic. Global synchronization and proactive drop policies are detailed in Chapter 6. In general, a flow-based WFQ uses a subqueue for each flow. As such, flow-based WFQ queues are referred to as conversation queues. Because memory is a finite resource, the default number of conversation queues allocated is restricted to 256. This parameter is configurable when enabling the fair-queue interface command, however. Note that increasing the number of queues increases the memory taken by the queue data structures and the amount of state information maintained by the router. If the number of flows is greater than the number of queues, multiple flows can share a single queue. Configuring a large number of queues increases the chances of having only one flow per queue. Flow-based WFQ can also work in conjunction with Weighted Random Early Detection (WRED), a proactive packet drop policy to avoid congestion. WRED is discussed in Chapter 6. Flow-based WFQ implementation is done using list sorting. The complexity is of the order of O(n), where n is the number of packets waiting for service from a WFQ scheduler. List sorting can become prohibitively expensive on high-bandwidth links where the number of flows and the number of packets to be serviced per second is high. Note Flow-based WFQ is available on most Cisco router platforms. One notable exception is Cisco 12000 series routers. It is the default queuing mechanism in all interfaces with a bandwidth less than 2 Mbps.
61
After IP precedence is set at the ingress, you can deploy flow-based WFQ on the routers to achieve the preceding objectives. Packet classification based on IP precedence is discussed in case studies in Chapter 3. You can set up flow-based WFQ on an interface by using the fair-queue interface command. The show queue command, shown in Listing 4-1, describes the active flows along with the queue depth, weight, and other queue statistics as well as the queue parameters in the WFQ system. Listing 4-1 show queue serial0 Command Output Router#show queue serial0 Input queue: 0/75/0 (size/max/drops); Total output drops: 0 Queueing strategy: weighted fair Output queue: 9/1000/120/0 (size/max total/threshold/drops Conversations 1/4/256 (active/max active/threshold) Reserved Conversations 0/1 (allocated/max allocated) (depth/weight/discards/tail drops/interleaves) 2/4096/0/0/0 Conversation 1044, linktype: ip, length: 1504 source: 172.26.237.58, destination: 172.26.237.2, id: 0xC4FE, ttl:126, TOS: 0 prot: 6, source port 1563, destination port 4665 In Listing 4-1, max total is a per-interface, global limit on the number of buffers for WFQ, whereas threshold is the maximum number of buffers allowed per conversation. Note that max total is the same as the output hold queue, and you can change it using the hold-queue < queue length > out interface command. A conversation with nonzero queue depth is said to be active. active and max active show the present and the maximum number of active conversations. Reserved Conversations shows RSVP flow reservations. In the second part of Listing 4-1, depth shows the number of packets in the conservation queue (which is defined below this line of queue statistics) waiting service, and weight shows the weight assigned to this flow. A discard is done by WFQ logic when the number of packets in a conversation queue exceeds the threshold. Tail drops happen when the WFQ buffer surpasses the max total. At that time, any packet arriving on the tail of a full queue gets dropped. Interleaves happen when link layer fragmentation and interleaving are configured to allow small packets to be interleaved between fragments of big packets, lowering the delay jitter seen by the small packet flows. Note Queues are displayed only if one or more packets are in the queue at the time the show queue is issued. Otherwise, the conversation is inactive and no output is displayed. Hence, the show queue interface command does not give any output if the interface is lightly loaded.
The show queueing fair command is used to check the WFQ parameters on an interface. Listing 4-2 displays the output of the show queueing fair command.
62
Listing 4-2 show queueing fair Command Output Router#show queueing fair Current fair queue configuration: Interface Serial0 Discard threshold 64 Dynamic queue count 256 Reserved queue count 0
Case Study 4-3: WFQ Scheduling Among Voice and FTP Flow Packets
A DS3 link at an Internet service provider (ISP) carries voice traffic along with the File Transfer Protocol (FTP) traffic. The voice traffic is seeing varying delays because of the relatively large and variably sized FTP packets. The network engineering group believes that replacing the default FIFO queuing with classic WFQ helps, but it isn't sure how it will get fair treatment competing with FTP traffic. The voice traffic is made up of 64 bytes and is represented as V1, V2, V3, and so on, and FTP traffic consists of 1472-byte packets and is represented as F1, F2, F3, and so on. The following scenarios discuss the resource allocation for voice traffic (when run in conjunction with FTP traffic) in the FQ and WFQ situation. In the first scenario, IP precedence is not set in any packet. All traffic carries the default precedence value of 0. The voice traffic of 64-byte packets gets a fair treatment competing with an FTP flow. When both flows are active, the voice traffic gets the same bandwidth as the FTP traffic, effectively scheduling one FTP packet for 23 (= 1472 64) voice packets, on average. In the second scenario, voice packets carry precedence 5 and FTP traffic has a precedence of 0. The voice traffic is weighted lower than the FTP traffic. The voice traffic and FTP get 6/7 percent and 1/7 percent of the link bandwidth, respectively. Hence, the voice traffic flow gets six times the bandwidth of the FTP flow. Taking packet size differences into consideration, when both flows are active, one FTP packet is scheduled for 138 (= 23 6) voice packets! Note WFQ in itself doesn't give absolute priority to a particular traffic flow. It can give only a higher weight in the share of resources or certain guaranteed bandwidth, as shown in this case study. Hence, it
63
might not be ideal for interactive, real-time applications, such as voice, in all circumstances. WFQ can work for voice when little background traffic is in the network. On a loaded network, however, WFQ cannot achieve the low-jitter requirements for interactive voice traffic. For applications such as voice, WFQ is modified with a priority queue (PQ). WFQ with a PQ is discussed toward the end of this chapter.
64
Listing 4-3 Flow-Based DWFQ Information Router#show interface fair POS0/0/0 queue size 0 packets output 2, wfq drops 1, nobuffer drops 0 WFQ: aggregate queue limit 5972, individual queue limit 2986 max available buffers 5972 The show interface fair command displays packet statistics for the number of packets transmitted as well as packet drops due to DWFQ and buffer pool depletion. Note The number of available buffers for DWFQ, individual, and aggregate limits is derived based on the VIP Static Random Access Memory (SRAM) capacity, the number of interfaces on the VIP, and the speed of those interfaces.
Class-Based WFQ
The last two sections discussed flow-based WFQ mechanisms running on the IOS router platforms' central processor, and flow-based DWFQ mechanisms running on the 7500 platform's VIP line cards. This section studies a CBWFQ mechanism that is supported in both nondistributed and distributed operation modes. CBWFQ allocates a different subqueue for each traffic class compared with a subqueue per each flow in the flow-based versions of WFQ. So, you can use the existing flow-based implementations of WFQ to deliver CBWFQ in both nondistributed and distributed modes of operation by adding a traffic classification module in which each WFQ subqueue carries a traffic class rather than a traffic flow. Hence, CBWFQ is still based on sequence number computation when run on the router's central processor and on calendar queue implementation when run on the 7500 platform's VIP line cards. The CBWFQ mechanism uses the modular QoS command-line interface (CLI) framework discussed in Appendix A, "Cisco Modular QoS Command-Line Interface." As such, it supports all classes supported under this framework. You can base traffic classes on a variety of traffic parameters, such as IP precedence, Differentiated Services Code Point (DSCP), input interface, and QoS groups. Appendix A lists the possible classifications. CBWFQ enables a user to directly specify the required minimum bandwidth per traffic class. This functionality is different from flow-based WFQ, where a flow's minimum bandwidth is derived indirectly based on the assigned weights to all active flows in the WFQ system. Note Note that CBWFQ also can be used to run flow-based WFQ. In CBWFQ, the default-class traffic class appears as normal WFQ flows on which you can apply flow-based WFQ by using the fairqueue command.
DWFQ and CBWFQ differ in that you can run FQ within any DWFQ class, but in the case of CBWFQ, only default classes can run WFQ.
65
All critical database application traffic is classified under the class gold by using a class-map. Then a policy named goldservice is defined on the gold traffic class. Finally, the policy goldservice is applied on the output traffic of interface serial0 to apply specific bandwidth allocation for the outgoing critical traffic. Listing 4-4 gives the sample configuration. Listing 4-4 Allocating Bandwidth for Critical Traffic class-map gold match access-group 101 policy-map goldservice class gold bandwidth 500 interface serial0 service-policy output goldservice access-list 101 permit ip any any udp range 1500 1600 CBWFQ can directly specify a minimum bandwidth per class by using the bandwidth command. The access-list 101 matches all critical database application traffic in the network. The show policy and show class commands display all policy and class map information on the router, respectively. Listings 4-5 and 4-6 display the output of the policy goldservice and the gold traffic class, respectively. Listing 4-5 Information on the goldservice Policy Router#show policy goldservice Policy Map goldservice Weighted Fair Queueing Class gold Bandwidth 500 (kbps) Max Thresh 64 (packets) Listing 4-6 Information on the gold Traffic Class Router#show class gold class Map gold match access-group 101 As in Case Study 4-1, you use show queueing fair for the WFQ information and the show queue serial0 command to see packets waiting in the queue.
66
Assume here that packet classification into IP precedence based on traffic rate was already done on the network boundary routers or interfaces connecting to the customer traffic. See Chapter 3 for details regarding packet classification. On an HSSI interface, ToS classes 03 are given 6750 Kbps, 13500 Kbps, 18000 Kbps, and 6750 Kbps, respectively, based on the link bandwidth percentage allocations, as shown in Listing 4-8. Listing 4-8 Enabling CBWFQ for ToS Classes class-map match-any class0 match ip precedence 4 match ip precedence 0 class-map match-any class1 match ip precedence 1 match ip precedence 5 class-map match-any class2 match ip precedence 2 match ip precedence 6 class-map match-any class3 match ip precedence 3 match ip precedence 7 policy-map tos-based class class0 bandwidth 6750 class class1 bandwidth 13500 class class2 bandwidth 18000 class class3 bandwidth 6750 interface hssi0/0/0 service-policy output tos-based Note Although bandwidth specification as a percentage of the total link bandwidth is not allowed at the time of this writing, this option will be available soon.
67
IP Precedence Bits 000 001 010 011 100 101 110 111
ToS-based DWFQ is enabled by using the fair-queue tos command. As an example, Case Study 4-7 is shown here, redone using ToS-based CBWFQ without the modular QoS CLI. Listing 4-9 lists the required configuration for this functionality. Listing 4-9 Assigning Bandwidth per ToS Class on an Interface interface Hssi0/0/0 fair-queue tos fair-queue tos 1 weight 15 fair-queue tos 2 weight 30 fair-queue tos 3 weight 40 Note The weight parameter is used differently in DWFQ implementation when compared to flow-based WFQ. Weights in DWFQ implementation indicate the percentage of link bandwidth allocated.
Note that the default weights of ToS classes 1, 2, and 3 are 20, 30, and 40, respectively. In this case, the intended weight assignments for classes 2 and 3 are their default values and hence, they need not be configured at all. Class 1 is assigned a weight of 15. Listing 4-10 displays information on ToS-based DWFQ and its packet statistics. Listing 4-10 ToS-Based DWFQ Information Router#show interface fair HSSI0/0/0 queue size 0 packets output 20, wfq drops 1, nobuffer drops 0 WFQ: aggregate queue limit 5972, individual queue limit 2986 max available buffers 5972 Class 0: weight 15 limit 2986 qsize 0 packets output 0 drops 0
68
Class 1: weight 15 limit 2986 qsize 0 packets output 0 drops 0 Class 2: weight 30 limit 2986 qsize 0 packets output 0 drops 0 Class 3: weight 40 limit 2986 qsize 0 packets output 0 drops 0 weight indicates a percentage of the link bandwidth allocated to the given class. A ToS class 3 has a weight of 40, for example, which means it is allocated 40 percent of the link bandwidth during times when the queues for all four classes (0, 1, 2, and 3) are simultaneously backlogged. The weight for ToS class 0 is always based on the weights of the other classes and changes when weights for any one of the classes 13 change. Because the total weight is 100, the bandwidth allotted to ToS class 0 is always 100 - (weights of classes 13). Note the sum of the class 13 weights should not exceed 99. QoS Group-Based DWFQ In addition to the ToS-based DWFQ, you can configure QoS group-based DWFQ without modular QoS CLI in VIP-based 7500 series routers. The QoS group is a number assigned to a packet when that packet matches certain user-specified criteria. It is important to note that a QoS group is an internal label to the router and not a field within the IP packet, unlike IP precedence. Without using modular QoS CLI, the QoS group-based DWFQ feature is enabled by using a fair-queue qos-group command.
Case Study 4-8: Bandwidth Allocation Based on the QoS Group Classification Without Using Modular QoS CLI
The ISP wants to allocate four times the bandwidth for traffic classified with QoS group 3 when compared to traffic classified with QoS group 0 on a router's HSSI0/0/0 interface. Assume that packets were already assigned the QoS group label by a different application, and packets with only QoS group labels 0 and 3 are allowed in the ISP router. Listing 4-11 shows how to enable QoS group-based WFQ without using modular QoS CLI. Listing 4-11 Allocate 80 Percent of the Bandwidth to QoS Group 3 Traffic interface hssi0/0/0 fair-queue qos-group fair-queue qos-group 3 weight 80 The bandwidth is allocated in a ratio of 4:1 between QoS groups 3 and 0. Because the weight indicates the percentage of bandwidth, the ratio of the weights to get the bandwidth allocation desired is 80:20 for QoS groups 3 and 1. Listing 4-12 shows the information on DWFQ parameters and operation on the router. Listing 4-12 DWFQ Information Router#show interface fair HSSI0/0/0 queue size 0 packets output 3142, wfq drops 32, nobuffer drops 0 WFQ: aggregate queue limit 5972, individual queue limit 2986 max available buffers 5972 Class 0: weight 20 limit 2986 qsize 0 packets output 11 drops 1 Class 3: weight 80 limit 2986 qsize 0 packets output 3131 drops 31 qsize 0 packets output 0 drops
69
Priority Queuing
Priority queuing maintains four output subqueueshigh, medium, normal, and lowin decreasing order of priority. A network administrator can classify flows to fall into any of these four queues. Packets on the highestpriority queue are transmitted first. When that queue empties, traffic on the next-highest-priority queue is transmitted, and so on. No packets in the medium-priority queue are serviced if packets in the high-priority queue are waiting for service. Priority queuing is intended for environments where mission-critical data needs to be categorized as the highest priority, even if it means starving the lower-priority traffic at times of congestion. During congestion, mission-critical data can potentially take 100 percent of the bandwidth. If the high-priority traffic equals or exceeds the line rate for a period of time, priority queuing always lets the highest-priority traffic go before the next-highest-priority traffic and, in the worst case, drops important control traffic. Priority queuing is implemented to classify packets into any of the priority queues based on input interface, simple and extended IP access lists, packet size, and application. Note that unclassified traffic, which isn't classified to fall into any of the four priority queues, goes to the normal queue. The packets within a priority queue follow FIFO order of service.
Listings 4-14 and 4-15 show the current priority queuing configuration and interface queuing strategy information, respectively. The show queueing priority command shows the current priority queue configuration on the router. The show interface serial0 command shows the priority list configured on the interface as well as packet statistics for the four priority queues.
70
Listing 4-14 Information on Priority Queuing Parameters Router#show queueing priority Current priority queue configuration: List Queue Args 1 low default 1 high protocol ip list 1 medium protocol ip list 1 normal protocol ip list 1 low protocol ip list
Listing 4-15 Interface Queuing Strategy Information Router#show interface serial0 <top portion deleted> Queueing strategy: priority-list 1 Output queue (queue priority: size/max/drops): high: 2/20/0, medium: 0/40/0, normal: 0/60/0, low: 4/80/0 <bottom portion deleted>
71
Custom Queuing
Whereas priority queuing potentially guarantees the entire bandwidth for mission-critical data at the expense of low-priority data, custom queuing guarantees a minimum bandwidth for each traffic classification. This bandwidth reservation discipline services each nonempty queue sequentially in a round-robin fashion, transmitting a configurable percentage of traffic on each queue. Custom queuing guarantees that missioncritical data is always assigned a certain percentage of the bandwidth, while assuring predictable throughput for other traffic. You can think of custom queuing as CBWFQ with lots of configuration details. You can classify traffic into 16 queues. Apart from the 16 queues is a special 0 queue, called the system queue. The system queue handles high-priority packets, such as keepalive packets and control packets. User traffic cannot be classified into this queue. Custom queuing is implemented to classify IP packets into any of the 16 queues based on input interface, simple and extended IP access lists, packet size, and application type. A popular use of custom queuing is to guarantee a certain bandwidth to a set of places selected by an access list. To allocate bandwidth to different queues, you must specify the byte count for each queue.
72
or 0.01842, 0.20619, 0.02407 Step 2. Normalize the numbers by dividing by the lowest number: 1, 11.2, 1.3 The result is the ratio of the number of packets that must be sent so that the percentage of bandwidth each protocol uses is approximately 20, 60, and 20 percent. Step 3. A fraction in any of the ratio values means an additional packet is sent. Round up the numbers to the next whole number to obtain the actual packet count. In this example, the actual ratio is 1 packet, 12 packets, and 2 packets. Step 4. Convert the packet number ratio into byte counts by multiplying each packet count by the corresponding packet size. In this example, the number of packets sent is one 1086-byte packet, twelve 291-byte packets, and two 831byte packets, or 1086, 3492, and 1662 bytes, respectively, from each queue. These are the byte counts you would specify in your custom queuing configuration. Step 5. To determine the bandwidth distribution this ratio represents, first determine the total number of bytes sent after all three queues are serviced: (1 1086) + (12 291) + (2 831) = 1086 + 3492 + 1662 = 6240 Step 6. Then determine the percentage of the total number of bytes sent from each queue: 1086 6240, 3492 6240, 1662 6240 = 17.4%, 56%, and 26.6% As you can see, this is close to the desired ratio of 20:60:20. Step 7. If the actual bandwidth is not close enough to the desired bandwidth, multiply the original ratio of 1:11.2:1.3 in Step 2 by the best value, in order to get the ratio as close to three integer values as possible. Note that the multiplier you use need not be an integer. If you multiply the ratio by 2, for example, you get 2:22.4:2.6. You would now send two 1086-byte packets, twenty-three 291-byte packets, and three 831-byte packets, or 2172:6693:2493, for a total of 11,358 bytes. The resulting ratio is 19:59:22 percent, which is much closer to the desired ratio you achieved. Listing 4-18 is the sample configuration needed to stipulate the byte count of the three protocol queues and the assignment of each protocol traffic to its appropriate queue. The configured custom Queue 1 is enabled on interface Serial0/0/3. Listing 4-18 Enabling Custom Queuing interface Serial0/0/3 custom-queue-list 1 queue-list queue-list queue-list queue-list queue-list queue-list 1 1 1 1 1 1 protocol ip 1 tcp <protocolA> protocol ip 2 tcp <protocolB> protocol ip 3 tcp <protocolC> queue 1 byte-count 2172 queue 2 byte-count 6693 queue 3 byte-count 2493
Listings 4-19 and 4-20 show information on the custom queuing configuration and the interface queuing strategy, respectively. The show queueing custom command is used to display the custom queuing configuration.
73
Listing 4-19 Information on Custom Queuing Configuration Router#show queueing custom Current custom queue configuration: List 1 1 1 1 1 1 Queue 1 2 3 1 2 3 Args protocol ip protocol ip protocol ip byte-count 2172 byte-count 6693 byte-count 2493 tcp port <protocolA> tcp port <protocolB> tcp port <protocolC>
Listing 4-20 Interface Queuing Strategy and Queue Statistics Router#show interface serial0/0/3 <top portion deleted> Queueing strategy: custom-list 1 Output queues: (queue #: size/max/drops) 0: 0/20/0 1: 0/20/0 2: 0/20/0 3: 0/20/0 4: 0/20/0 5: 0/20/0 6: 0/20/0 7: 0/20/0 8: 0/20/0 9: 0/20/0 10: 0/20/0 11: 0/20/0 12: 0/20/0 13: 0/20/0 14: 0/20/0 15: 0/20/0 16: 0/20/0 <bottom portion deleted> The show interface command gives the maximum queue size of each of the 16 queues under custom queuing, along with the instantaneous queue size and packet drop statistics per queue at the time the command is issued. Also, note that window size affects the bandwidth distribution as well. If the window size of a particular protocol is set to 1, that protocol does not place another packet into the queue until it receives an acknowledgment. The custom queuing algorithm moves to the next queue if the byte count is exceeded or if no packets are in that queue. Therefore, with a window size of 1, only one packet is sent each time. If the byte count is set to 2 KB and the packet size is 256 bytes, only 256 bytes are sent each time this queue is serviced. Note Although custom queuing allows bandwidth reservation per traffic class, as does CBWFQ, CBWFQ has many advantages over custom queuing. Some are listed here: Setting up CBWFQ is far easier and straightforward compared to enabling custom queuing for bandwidth allocations. RSVP depends on CBWFQ for bandwidth allocation. With CBWFQ, you can apply packet drop policies such as Random Early Detection (RED) on each traffic class, in addition to allocating a minimum bandwidth. You are no longer limited to 16 custom queues. CBWFQ supports 64 classes.
74
75
Listing 4-22 Enabling a Strict Priority Queue for Voice Traffic up to 640 Kbps Using CBWFQ class-map premium match <premium voice traffic> policy-map priority premiumpolicy class premium bandwidth 640 jnterface serial0 service-policy output premiumpolicy On interface serial0, CBWFQ allocates a minimum bandwidth of 640 Kbps to the premium voice traffic using CBWFQ. CBWFQ services the voice traffic on a strict priority queue based on the priority keyword command in the policy-map statement.
Summary
At times of network congestion, a scheduling discipline can allocate a specific bandwidth to a certain traffic flow or packet class by determining the order in which the packets in its queue get serviced. In flow-based WFQ, all flows with the same weight are treated fairly based on the max-min fair-share algorithm, and flows with different weights get unequal bandwidth allocations based on their weights. The CBWFQ algorithm is a class-based WFQ mechanism using modular QoS CLI. It is used to allocate a minimum guaranteed bandwidth to a traffic class. Each traffic class is allocated a different subqueue and is serviced according to its bandwidth allocation. Priority and custom queuing algorithms service queues on a strict priority and round-robin basis, respectively. Voice traffic can be serviced on a strict priority queue in CBWFQ and custom queuing so that voice traffic sees low jitter.
76
Q: A:
Explain the queuing mechanisms using the frequently used airline industry analogy. Comparing the queuing analogy to the airline industry, a FIFO queue is analogous to an airline queuing model whereby passengers belonging to all classesfirst, business, and economyhave a single queue to board the plane. No service differentiation exists in this queuing model. CBWFQ, MWRR, and MDRR are analogous to an airline queuing service that assigns a separate queue for each class of passengers, with a weight or bandwidth assigned to each queue. A passenger queue is serviced at a rate determined by its weight or bandwidth allocation. In an airline model analogous to priority queuing, a separate queue for first, business, and economy is used, and they are served in strict priorityin other words, those in the first-class queue are served first. An airline model serving ten passengers from first class, six from business class, and four from economy class on a round-robin basis is similar to the custom queuing model.
Q: A:
How is the bandwidth used in CBWFQ when one class is not using its allocated bandwidth? The bandwidth specified for a traffic class in CBWFQ is its minimum guaranteed bandwidth at times of congestion. If a traffic class is not using its allocated bandwidth to its fullest, the other traffic classes in the queuing system can use any leftover bandwidth in proportion to their assigned bandwidth. Are there any exceptions to the precedence-based weight assignment procedure in flow-based WFQ? Yes. Weight is based on IP precedence for IP traffic unless any of the following four conditions applies: RSVP has negotiated a specific weight for a specific flow. The traffic is voice and Local Frame Interleave is on. The traffic is a locally generated packet, such as a routing update packet set with an internal packet priority flag. A strict priority queue for voice is enabled by using the ip rtp priority command.
Q: A:
In each of those cases, a weight specific to the application is used. Q: A: How does flow-based WFQ treat non-IP traffic? There are separate classification routines for non-IP flows based on the packet's data link layer type (for example, IPX, AppleTalk, DECnet, BRIDGE, RSRB). All these non-IP flows are assigned a weight similar to an IP precedence 0 packet. Each non-IP conversation has a separate flow and is based on the data link layer type. Thus, even though you treat non-IP flows as precedence 0 packets, you provide fairness among the non-IP protocols. I have WFQ scheduling on my interface. But the actual bandwidth usage between the various traffic classes is different from the theoretical allocation. Why? You use the show interfaces fair or show queueing fair command to observe queue depths and drop counts for each traffic class. In this way, you can determine whether the load is such that you can expect achieved bandwidth ratios to equal configured ratios. If the load is not sufficient to keep all class queues nonempty, the achieved bandwidth allocation does not match the theoretical allocation because the WFQ algorithm is a work-conserving algorithm and allows classes to use all available bandwidth.
Q:
A:
References
1. "An Engineering Approach to Computer Networking," S. Keshav, Reading, MA: Addison-Wesley, 1997. 2. "A classical self-clocked WFQ algorithm," A. Demers, S. Keshav, and S. Shenker, SIGCOMM 1989, Austin, TX, September 1989.
77
78
Figure 5-1 WRR Queues with Their Deficit Counters Before Start of Service
2 1 0
The queues show the cells queued, and the cells making up a packet are marked in the same shade of black. Queue 2, for example, has a 2-cell, 3-cell, and 4-cell packet in its queue. Queue 0 is the first queue being served. The deficit counter is initialized to 2, the queue's weight. At the head of the queue is a 4-cell packet. Therefore, the deficit counter becomes 2 - 4 = -2 after serving the packet. Because the deficit counter is negative, the queue cannot be served until it accumulates to a value greater than zero, as in Figure 5-2.
79
Queue 1 is the next queue to be served. Its deficit counter is initialized to 3. The 3-cell packet at the head of the queue is served, which makes the deficit counter become 3 - 3 = 0. Because the counter is not greater than zero, you skip to the next queue, as in Figure 5-3. Figure 5-3 MWRR After Serving Queue 1 in the First Round
80
Now it is Queue 2's turn to be serviced. Its deficit counter is initialized to 4. The 2-cell packet at the head of the queue is served, which makes the deficit counter 4 - 2 = 2. The next 3-cell packet is also served, as the deficit counter is greater than zero. After the 3-cell packet is served, the deficit counter is 2-3 = -1, as in Figure 5-4. Figure 5-4 MWRR After Serving Queue 2 in the First Round
Queue 0 is now served in the second round. The deficit counter from the last round was -2. Incrementing the deficit counter by the queue's weight makes the counter -2 + 2 =0. No packet can be served because the deficit counter is still not greater than zero, so you skip to the next queue, as in Figure 5-5.
81
Queue 1 has a deficit counter of zero in the first round. For the second round, the deficit counter is 0 + 3 = 3. The 4-cell packet at the head of the queue is served, making the deficit counter 3 - 4 = -1, as in Figure 5-6.
82
In the second round, Queue 2's deficit counter from the first round is incremented by the queue's weight, making it -1 + 4=3. The 4-cell packet at the head of Queue 2 is served, making the deficit counter 3 - 4 = -1. Because Queue 2 is now empty, the deficit counter is initialized to zero, as in Figure 5-7.
83
Now, it is again Queue 0's turn to be served. Its deficit counter becomes 0 + 2 = 2. The 2-cell packet at the head of the queue is served, which results in a deficit counter of 2 - 2 = 0. Now skip to Queue 1, as in Figure 5-8.
84
Queue 1's new deficit counter is -1 + 3 = 2. The 2-cell packet at the head of Queue 1 is served, resulting in a deficit counter of 2 - 2 = 0. The resulting Queue 1 is now empty. Because Queue 2 is already empty, skip to Queue 0, as in Figure 5-9.
85
Queue 0's deficit counter in the fourth round becomes 2. The 3-cell packet is served, which makes the deficit counter equal to -1. Because Queue 0 is now empty, reset the deficit counter to zero.
MWRR Implementation
MWRR is implemented in the Cisco Catalyst family of switches and the Cisco 8540 routers. These switches and routers differ in terms of the number of available MWRR queues and in the ways you can classify traffic into the queues. MWRR in 8540 routers offers four queues between any interface pair based on Type of Service (ToS) group bits. Table 5-2 shows the ToS class allocation based on the IP precedence bits Table 5-2. MWRR ToS Class Allocation ToS Class Bits 00 0 00 1 01 2 01 3 10 0
86
10 11 11
1 2 3
8500 ToS-based MWRR is similar to ToS-based Distributed WFQ (DWFQ), discussed in Chapter 4, "Per-Hop Behavior: Resource Allocation I," but differs in terms of which precedence bits are used to implement it. ToS-based DWFQ uses the two low-order precedence bits, whereas 8500 ToS-based MWRR uses the two high-order precedence bits. In both cases, the leftover bit can signify the drop priority. Drop priority indicates which IP precedence packets can be dropped at a higher probability between the IP precedence values making up a ToS class.
Catalyst 6000 and 6500 series switches use MWRR with two queues, Queue 1 and Queue 2, based on the Layer 2 Institute of Electrical and Electronic Engineers (IEEE) 802.1p Class of Service (CoS) field. Frames with CoS values of 03 go to Queue 1, and frames with CoS values of 47 go to Queue 2. 802.1p CoS is discussed in Chapter 8, "Layer 2 QoS: Interworking with IP QoS." 6500 series switches also implement strict priority queues as part of MWRR to support the low-latency requirements of voice and other real-time traffic.
You can enable Quality of Service (QoS)-based forwarding in an 8540 router by using the global command qos switching. The default weight allocation for ToS classes 03 is 1, 2, 4, and 8, respectively. Hence, ToS classes 03 get an effective bandwidth of 1/15, 2/15, 4/15, and 8/15 of the interface bandwidth. In this case, the bandwidth allocation for classes 03 is 15:15:30:40 or 3:3:6:8 because the WRR scheduling weight can only be between 115. Listing 5-1 shows a sample configuration to enable ToS-based MWRR globally on an 8500 router. Listing 5-1 Enabling ToS-Based MWRR qos qos qos qos qos switching mapping precedence mapping precedence mapping precedence mapping precedence
0 1 2 3
3 3 6 8
The configuration to QoS-switch according to the above criteria for traffic coming into port 1 and going out of port 0 only is given in Listing 5-2.
87
Listing 5-2 Enabling ToS-Based MWRR on Specific Traffic qos switching qos mapping <incoming interface> precedence 0 wrr-weight 3 qos mapping <incoming interface> precedence 1 wrr-weight 3 qos mapping <incoming interface> precedence 2 wrr-weight 6 qos mapping <incoming interface> precedence 3 wrr-weight 8
The general DRR algorithm described in this section is modified to allow a low-latency queue. In MDRR, all queues are serviced in a round-robin fashion with the exception of the low-latency queue. You can define this queue to run in either one of two ways: in strict priority or alternate priority mode. In strict priority mode, the low-latency queue is serviced whenever the queue is nonempty. This allows the lowest possible delay for this traffic. It is conceivable, however, for the other queues to starve if the highpriority, low-latency queue is full for long periods of time because it can potentially take 100 percent of the interface bandwidth. In alternate priority mode, the low-latency queue is serviced alternating between the low-latency queue and the remaining CoS queues. In addition to a low-latency queue, MDRR supports up to seven other queues, making the total number of queues to eight. Assuming that 0 is the low-latency queue, the queues are served in the following order: 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7. In alternate priority mode the largest delay for Queue 0 is equal to the largest single quantum for the other queues rather than the sum of all the quanta for the queues if Queue 0 were served in traditional round-robin fashion. In addition to being DRR-draining, MDRR is not conventional round-robin scheduling. Instead, DRR is modified in such a way that it limits the latency on one user-configurable queue, thus providing better jitter characteristics.
88
An MDRR Example
This example, which illustrates an alternate-priority low-latency queue, defines three queuesQueue 2, Queue 1, and Queue 0, with weights of 1, 2, and 1, respectively. Queue 2 is the low-latency queue running in alternate-priority mode. All the queues, along with their deficit counters, are shown in Figure 5-10. Figure 5-10 Queues 02, Along with Their Deficit Counters
Table 5-4 provides the weight and quantum associated with each queue. When MDRR is run on the output interface queue, the interface maximum transmission unit (MTU) is used. When MDRR is run, the fabric queues. Table 5-4. Queues 02, Along with Their Associated Weights and Quantum Values Queue Number Weight Quantum = Weight MTU (MTU = 1500 Bytes) Queue 2 1 1500 Queue 1 2 3000 Queue 0 1 1500 On the first pass, Queue 2 is served. Queue 2's deficit counter is initialized to equal its quantum value, 1500. Queue 2 is served as long as the deficit counter is greater than 0. After serving a packet, Queue 2's size is subtracted from the deficit counter. The first 500-byte packet from the queue gets served because the deficit counter is 1500. Now, the deficit counter is updated as 1500-500 = 1000. Therefore, the next packet is served. After the 1500-byte packet is served, the deficit counter becomes -500 and Queue 2 can no longer be served. Figure 5-11 shows the three queues and the deficit counters after Queue 2 is served.
89
Because you are in alternate-priority mode, you alternate between serving Queue 2 and another queue. This other queue is selected in a round-robin fashion. Consider that in the round robin, it is now Queue 0's turn. The deficit counter is initialized to 1500, the quantum value for the queue. The first 1500-byte packet is served. After serving the first packet, its deficit counter is updated as 1500-1500 = 0. Hence, no other packet can be served in this pass. Figure 5-12 shows the three queues and their deficit counters after Queue 0 is served.
90
Because you alternate between the low-latency queue and the other queues served in the round robin, Queue 2 is served next. Queue 2's deficit counter is updated to -500 + 1500 = 1000. This allows the next packet in Queue 2 to be served. After sending the 500-byte packet, the deficit counter becomes 500. It could have served another packet, but Queue 2 is empty. Therefore, its deficit counter is reset to 0. An empty queue is not attended, and the deficit counter remains 0 until a packet arrives on the queue. Figure 5-13 shows the queues and the counters at this point.
91
Queue 1 is served next. It deficit counter is initialized to 3000. This allows three packets to be sent, leaving the deficit counter to be 3000 - 1500 - 500 - 1500 = - 500. Figure 5-14 shows the queues and the deficit counters at this stage.
92
Queue 0 is the next queue serviced and sends two packets, making the deficit counter 1500 - 1000 - 1500= 500. Because the queue is now empty, the deficit counter is reset to 0. Figure 5-15 depicts the queues and counters at this stage.
93
Queue 1 serves the remaining packet in a similar fashion in its next pass. Because the queue becomes empty, its deficit counter is reset to 0.
MDRR Implementation
Cisco 12000 series routers support MDRR. MDRR can run on the output interface queue (transmit [TX] side) or on the input interface queue (receive [RX] side) when feeding the fabric queues to the output interface. Different hardware revisions of line cards termed as engine 0, 1, 2, 3, and so on, exist for Cisco 12000 series routers. The nature of MDRR support on a line card depends on the line card's hardware revision. Engine 0 supports MDRR software implementation. Line card hardware revisions, Engine 2 and above, support MDRR hardware implementation.
MDRR on the RX
MDRR is implemented in either software or hardware on a line card. In a software implementation, each line card can send traffic to 16 destination slots because the 12000 series routers use a 16x16 switching fabric. For each destination slot, the switching fabric has eight CoS queues, making the total number of CoS queues 128 (16 x 8). You can configure each CoS queue independently.
94
In the hardware implementation, each line card has eight CoS queues per destination interface. With 16 destination slots and 16 interfaces per slot, the maximum number of CoS queues is 16 16 8 = 2048. All the interfaces on a destination slot have the same CoS parameters.
MDRR on the TX
Each interface has eight CoS queues, which you can configure independently in both hardware- and softwarebased MDRR implementations. Flexible mapping between IP precedence and the eight possible queues is offered in the MDRR implementation. MDRR allows a maximum of eight queues so that each IP precedence value can be made its own queue. The mapping is flexible, however. The number of queues needed and the precedence values mapped to those queues are user-configurable. You can map one or more precedence values into a queue. MDRR also offers individualized drop policy and bandwidth allocation. Each queue has its own associated Random Early Detection (RED) parameters that determine its drop thresholds and DRR quantum, the latter which determines how much bandwidth it gets. The quantum (in other words, the average number of bytes taken from the queue for each service) is user-configurable.
Case Study 5-2: Bandwidth Allocation and Minimum Jitter Configuration for Voice Traffic with Congestion Avoidance Policy
Traffic is classified into different classes so that a certain minimum bandwidth can be allocated for each class depending on the need and importance of the traffic. An ISP implements five traffic classesgold, silver, bronze, best-effort, and a voice class carrying voice traffic and requiring minimum jitter. You need four queues, 03, to carry the four traffic classes (best-effort, bronze, silver, gold), and a fifth lowlatency queue to carry the voice traffic. This example shows three OC3 Point-to-Point Protocol (PPP) over Synchronous Optical Network (SONET) (PoS) interfaces, one each in slots 13. Listing 5-3 gives a sample configuration for this purpose. Listing 5-3 Defining Traffic Classes and Allocating Them to Appropriate Queues with a Minimum Bandwidth During Congestion interface POS1/0 tx-cos cos-a interface POS2/0 tx-cos cos-a interface POS3/0 tx-cos cos-a slot-table-cos table-a destination-slot 0 cos-a destination-slot 1 cos-a destination-slot 2 cos-a rx-cos-slot 1 table-a rx-cos-slot 2 table-a rx-cos-slot 3 table-a cos-queue-group cos-a precedence all random-detect-label 0 precedence 0 queue 0 precedence 1 queue 1 precedence 2 queue 2 precedence 3 queue 3 precedence 4 queue low-latency
95
precedence 5 queue 0 random-detect-label 0 50 200 2 exponential-weighting-constant 8 queue 0 10 queue 1 20 queue 2 30 queue 3 40 queue low-latency strict-priority 20 All interfaces for PoS1/0, PoS2/0, and PoS3/0 are configured with TX CoS based on the cos-queue group cos-a command. The cos-a command defines a CoS policy. Traffic is mapped into classes based on their IP precedence value in the packet. Each of the five classes is allocated to its individual queue, and weights are allocated based on the bandwidth allocation for each class. The bandwidth allocation for a class is proportional to its weight. The percentage of interface bandwidth allocation for Queues 03 and the low-latency queue is 8.33, 16.67, 25, 33.33, and 16.67, respectively. Voice is delay-sensitive but not bandwidth-intensive. Hence, it is allocated a low-latency queue with strict priority, but it doesn't need a high-bandwidth allocation. The network supports only IP precedence values of 04. IP precedence 5 is allocated to Queue 0 for besteffort service. IP precedence 6 and 7 packets are control packets that a router originates. They are flagged internally and are transmitted first, regardless of the MDRR configuration. The cos-a cos-queue-group command defines a Weighted Random Early Detection (WRED), a congestion avoidance policy that applies to all queues as follows: Minimum threshold: 50 packets Maximum threshold: 200 packets Probability of dropping packets at maximum threshold:?=50 percent Exponential weighting constant to calculate average queue depth: 8
WRED is discussed in Chapter 6, "Per-Hop Behavior: Congestion Avoidance and Packet Drop Policy." The MDRR algorithm can also be applied on the input interface line card on the fabric queues delivering the packet to the destination line card. The slot-table-cos command defines the CoS policy for each destination line card's CoS fabric queues on the receive line card. In the example, the table-a slot-table-cos command defines the CoS policy for destination line cards 02 based on the cos-a cos-queue-group command. Note that the destination line card can be the same as the receive line card because the input and output interfaces for certain traffic can exist on the same line card. The rx-cos-slot command applies the table-a slot-table-cos command to a particular slot (line card). Listing 5-4 shows the CoS configuration on the router. Listing 5-4 CoS Information Router#show cos Interface Queue Cos Group PO1/0 cos-a PO2/0 cos-a PO3/0 cos-a Rx Slot 1 2 3 Slot Table table-a table-a table-a
Slot Table Name - table-0 1 cos-a 2 cos-a 3 cos-a Cos Queue Group - cos-a
96
precedence all mapped label 0 red label 0 min thresh 100, max thresh 300 max prob 2 exponential-weighting-constant 8 queue 0 weight 10 queue 1 weight 20 queue 2 weight 40 queue 3 weight 80 queue 4 weight 10 queue 5 weight 10 queue 6 weight 10 low latency queue weight 20, priority strict Note that Queues 46 are not mapped to any IP precedence, so they are empty queues. Only Queues 03 and the low-latency queue are mapped to an IP precedence, and bandwidth is allocated proportional to the queue weights during congestion. From the line card, the show controllers frfab/tofab cos-queue length/parameters/variables command shows information regarding the CoS receive and transmit queues.
Summary
In this chapter, we discuss two new scheduling algorithms, MWRR and MDRR, that are used for resource allocation. MWRR and MDRR are similar to WFQ algorithm in their scheduling behavior. MWRR and MDRR scheduling can also support voice traffic if the voice queue is made a strict priority queue. At this time, MWRR and MDRR are used in the Catalyst family of switches and Cisco 12000 series routers, respectively.
Q: A:
Q: A:
References
1. "An Engineering Approach to Computer Networking," S. Keshav, Addison-Wesley, 1997. 2. "Efficient Fair Queuing using Deficit Round Robin," M. Shreedhar, George Varghese, SIGCOMM 1995, pp. 231-242.
97
98
Figure 6-1 TCP Congestion Window Showing Slow Start and Congestion Avoidance Operations
When packet loss occurs for reasons other than network congestion, waiting for the RTT times to expire can have an adverse performance impact, especially in high-speed networks. To avoid this scenario, TCP fast retransmit and recovery algorithms are used.
99
100
RED is implemented by means of two different algorithms: Average queue size computation This determines the degree of burstiness allowed in the queue. Packet drop probability For a given average queue size, the probability that a packet is dropped determines how frequently the router drops packets. These algorithms are discussed in the following sections.
With low values of n, the average queue size closely tracks the current queue size, resulting in the following RED behavior: The average queue size moves rapidly and fluctuates with changes in the traffic levels. The RED process responds quickly to long queues. When the queue falls below the minimum threshold, the process stops dropping packets. If n is too low, RED overreacts to temporary traffic bursts and drops traffic unnecessarily.
101
When the average queue depth is above the minimum threshold, RED starts dropping packets. The packet drop rate increases linearly as the average queue size increases, until the average queue size reaches the maximum threshold. When the average queue size is above the maximum threshold, all packets are dropped. Packet drop probability is illustrated in Figure 6-3. Figure 6-3 RED Packet Drop Probability
WRED
WRED introduces grades of service among packets based on a packet's drop probability and allows selective RED parameters based on IP precedence. As a result, WRED drops more aggressively for certainprecedence-level packets and less aggressively for other-precedence-level packets.
WRED Implementation
WRED can be run on the central processor of the router or on a distributed mode on Cisco's 7500 series routers with Versatile Interface Processors (VIPs). By default, the maximum threshold is the same for all precedence levels, but the minimum threshold varies with packet precedence. Hence, you drop the lower precedence packets more aggressively than the higher precedence packets. The default minimum threshold value for precedence 0 traffic is half the value of the maximum threshold. You can enable WRED based on traffic classes by using modular QoS CLI. Modular QoS CLI allows different RED parameters for each traffic class and is discussed in Appendix A, "Cisco Modular QoS Commandline Interface." In Cisco 12000 series routers, WRED is available in either hardware- or software-based implementations, depending on the hardware revision of the line card. Cisco 12000 series routers allow eight class of service (CoS) queues. You can map a CoS queue to carry packets of one or more precedence value(s). After the CoS queues are defined, RED parameters can be applied independently to the different CoS queues. Because this router platform uses a switch-based architecture, you can enable WRED on both the fabric queues on the receive side and the interface queues on the transmit side. For information on how to enable WRED on a Cisco 1200 series routers, refer to Case Study 5-10 in Chapter 5, "Per-Hop Behavior: Allocation II."
102
Case Study 6-1: Congestion Avoidance to Enhance Link Utilization by Using WRED
A service provider offers premium servicesplatinum, gold, silver, and bronzeto its customers and differentiates customer traffic on its backbone by marking the traffic with IP precedence 4, 3, 2, and 1, respectively, based on their premium service level. The service provider offers these premium services along with the best-effort service that sets traffic to a precedence of 0. The service provider's peering connections to the other top-tier Internet service providers are occasionally congested, pointing to a need for a congestion avoidance policy. Active queue management using WRED is needed on interfaces that peer with the other service providers to control congestion. On the interface connecting a peer service provider, enable WRED by using the randomdetect command. Note that the WRED minimum drop threshold parameter needs to be relatively higher for higher-precedence (better service) traffic when compared to the lower-precedence traffic, so lower-precedence packets are dropped before higher-precedence traffic gets affected. Listings 6-1 and 6-2 show the WRED default parameters and packet drop statistics by using the show queueing random-detect command when WRED is running in the central processor and VIP line cards of a Cisco 7500 router, respectively. Listing 6-1 WRED Operation on a Low-end 7200 Series Router and on a 7500 Series Router in a Nondistributed Mode Router#show queueing random-detect Hssi3/0/0 Queueing strategy: random early detection (WRED) Exp-weight-constant: 9 (1/512) mean queue depth: 40 drops: class random tail min-th max-th 0 13783 174972 20 40 1 14790 109428 22 40 2 14522 119275 24 40 3 14166 128738 26 40 4 13384 138281 28 40 5 12285 147148 31 40 6 10893 156288 33 40 7 9573 166044 35 40 rsvp 0 0 37 40
mark-prob 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10
When running on the central processor, WRED is run on the interface transmit queue and its thresholds are determined as follows: min_threshold (i) The minimum threshold corresponding to traffic of precedence i, equal to: (12 + i18) output hold queue max_threshold Equal to the output hold queue The output of the show queuing red command gives the following packet statistics: Random Drop WRED drops when the mean queue depth falls in between the minimum and maximum WRED thresholds.
103
Tail-drop WRED drops when the mean queue depth exceeds the maximum threshold.
Mark Probability The drop probability when the mean queue depth is equal to the maximum threshold.
Listing 6-2 WRED Operation on a VIP-Based 7500 Series Router Router#show queueing random-detect Hssi3/0/0 Queueing strategy: VIP-based fair queueing Packet drop strategy: VIP-based random early detection (DWRED) Exp-weight-constant: 9 (1/512) Mean queue depth: 0 Queue size: 0 Maximum available buffers: 5649 Output packets: 118483 WRED drops: 800 No buffer: 0 Class 0 1 2 3 4 5 6 7 Random drop 23 0 0 0 0 0 0 0 Tail drop 0 0 0 0 0 0 0 0 Minimum threshold 1412 1588 1764 1940 2116 2292 2468 2644 Maximum threshold 2824 2824 2824 2824 2824 2824 2824 2824 Mark probability 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 Output Packets 116414 0 12345 20031 45670 0 2345 0
When running WRED in a distributed mode on VIP-based 7500 routers, WRED processing for an interface line card is done locally and not on the route processor. Hence, distributed WRED operates on the queue on the VIP and not on the interface transmit queue. Distributed WRED uses Cisco Express Forwarding (CEF)-based inter-process communication to propagate configuration and statistics information between the Route Switch Processor (RSP) and the VIP. (CEF is discussed in Appendix B, "Packet Switching Mechanisms." ) Hence, CEF should be enabled when operating WRED in a distributed mode. The interface drop statistics include Distributed WRED (DWRED) drops. DWRED calculates the default maximum threshold based on the pool size (loosely speaking, VIP queue size), the interface maximum transmission unit (MTU) size, and the interface bandwidth. Pool size is, in turn, dependent on the memory size on the VIP and other factors and, therefore, difficult to pin down to a fixed number. Therefore, the pool size information only helps to estimate the burst size that can be accommodated. You can configure the DWRED maximum threshold to a value different from the default, if necessary.
Case 6-2: WRED Based on Traffic Classes Using Modular QoS CLI
A service provider is interested in running WRED in its routers because the interface queues in the routers often exhibit symptoms of global synchronization. The service provider carries critical User Datagram Protocol (UDP)-based application traffic on its network and hence, it doesn't want to randomly drop any of its critical application traffic, although tail drop can be applied at the queue's maximum threshold. Listing 6-3 excludes the critical UDP-based application traffic before applying WRED on the router's interface. Listing 6-3 Enable WRED Using Modular QoS CLI on the Noncritical Traffic Class Only class-map non-critical match access-group 101
104
policy-map wred match class non-critical random-detect exponential-weighting-constant 9 random-detect precedence 0 112 375 1 interface Hssi0/0/0 service-policy output wred As shown in Listing 6-3, you first define a noncritical traffic class that includes all traffic except the critical UDP traffic. Next, use an access-list 101 command (not shown in Listing 6-3) to match all traffic, except the critical UDP application traffic. Follow this with the match access-group 101 command, which matches all traffic that passed the access-list 101 definition, and the policy-map wred command, which enables WRED on the noncritical traffic. In this Listing, WRED is enabled with an exponential weighting constant of 9. At the same time, minimum threshold, maximum threshold, and mark probability denominator values for IP precedence 0 packets are set to 112, 375, and 1, respectively. The policy-map wred is then enabled on the router's Hssi0/0/0 interface.
Flow WRED
Only adaptive TCP flows respond to a congestion signal and slow down, while nonadaptive UDP flows, which do not respond to congestion signals, don't slow down. For this reason, nonadaptive flows can send packets at a much higher rate than adaptive flows at times of congestion. Hence, greedy, nonadaptive flows tend to use a higher queue resource than the adaptive flows that slow down in response to congestion signals. Flow WRED modifies WRED such that it penalizes flows taking up more than their fair share of queue resources. To provide fairness among the active traffic flows in the queue, WRED classifies all arriving packets into the queue based on their flow and precedence. It also maintains state for all active flows, or flows that have packets in the queue. This state information is used to determine the fair amount of queue resources for each flow (queue size/number of active flows), and flows taking more than their fair share are penalized more than the others are. To accommodate for a flow's traffic burstiness, you can increase each flow's fair share by a scaling factor before it gets penalized using the following formulas: Fair share of queue resources per active flow = queue size/number of active flows Scaled fair share of queue resources per flow = (queue size/number of active flows) scaling factor A flow exceeding the scaled fair share of queue resources per flow in the queue is penalized by an increase in the non-zero drop probability for all the newly arriving packets in the queue. As an example, consider a packet arriving on a queue running Flow WRED. Flow WRED considers both the IP precedence value in the packet and the flow state information to determine the packet's drop probability. The packet's IP precedence determines the configured (or the default) minimum and maximum WRED thresholds for the packet. If the average queue size is below the minimum threshold, the packet gets zero drop probability (in other words, it is not dropped). If the average queue size is in between the packet's minimum and maximum threshold (as determined by the packet's IP precedence), the flow state information is taken into consideration. If the packet belongs to a flow that exceeded the scaled fair share of queue resources per flow in the queue, you increase the packet's drop probability by decreasing the WRED maximum threshold as follows: New Maximum threshold = Minimum threshold + ((Maximum threshold Minimum threshold)/2) The non-zero probability is then derived based on the minimum threshold and the new maximum threshold values. Because the drop probability curve is much steeper now, as shown in Figure 6-4, you apply a higher drop probability on the packet. If the flow is within its fair allocation of queue resources, the packet gets a nonzero drop probability determined by the normal WRED calculation.
105
If the average queue size exceeds the maximum threshold, continue to drop packets using a process similar to that used in the WRED operation. Flow WRED increases the probability of a packet getting dropped only if the packet belongs to a flow whose packets in the queue exceed the scaled per-flow limit. Otherwise, Flow WRED operates similar to WRED. Note Flow WRED, as described previously, still applies non-zero drop probability for flows with a few packets in the queue when the average queue size is between the minimum and maximum thresholds. You can implement Flow WRED such that it doesn't drop packets of a flow with a few, specified number of packets in the queue by increasing the minimum threshold to be close to or equal to the maximum threshold. You apply tail-drop on the packet at the queue's WRED maximum threshold value.
106
You can use the show queueing random-detect command to verify an interface's queuing strategy.
ECN
Thus far, in active queue management using WRED, packets have been dropped to signal congestion to a TCP source. ECN provides routers the functionality to signal congestion to a TCP source by marking a packet header rather than dropping the packet[4]. In the scenarios where a WRED-enabled router dropped a packet to signal congestion, it can now set the ECN bit in the packet, avoiding potential delays due to packet retransmissions caused by packet loss. ECN functionality requires support for a Congestion Experienced (CE) bit in the IP header and a transport protocol that understood the CE bit. An ECN field of 2 bits in the IP header is provided for this purpose. The ECN-Capable Transport (ECT) bit is set by the TCP source to indicate that the transport protocol's end-points are ECN-capable. The CE bit is set by the router to indicate congestion to the end nodes. Bits 6 and 7 in the IPv4 Type of Service (ToS) byte form the ECN field and are designated as the ECT bit and the CE bit, respectively. Bits 6 and 7 are listed in differentiated services architecture as currently unused. For TCP, ECN requires three new mechanisms: ECN Capability Negotiation The TCP endpoints negotiate during setup to determine if they are both ECN-capable. ECN-Echo flag in the TCP header The TCP receiver uses the ECN-Echo flag to inform the TCP source that a CE packet has been received. Congestion Window Reduced (CWR) flag in the TCP header The TCP source uses the CWR flag to inform the TCP receiver that the congestion window has been reduced. ECN functionality is still under discussion in the standard bodies.
SPD
SPD helps differentiate important control traffic (such as routing protocol packets) over normal data traffic to the router. This enables the router to keep its Interior Gateway Protocol (IGP) and Border Gateway Protocol (BGP) routing information during congestion by enqueuing routing protocol packets over the normal data traffic. SPD implements a selective packet drop policy on the router's IP process queue. Therefore, it applies to only process switched traffic. Even when the router is using route-cache forwarding (also called fast switching), some of the transit data traffic still needs to be process switched in order to create a route-cache entry. When a router is using CEF, though, all transit data traffic is usually CEF switched and the only packets that reach the IP process input queue are the important control packets such as routing and keepalives, normal data traffic destined to the router, and transit traffic that is not CEF supported. Process switching, route-cache forwarding, and CEF are discussed in Appendix B. Traffic arriving at the IP process input queue is classified in three ways: Important IP control traffic (routing protocol packets), often called priority traffic. Normal IP traffic, such as telnet/ping packets to a router interface, IP packets with options, and any IP feature or encapsulation not supported by CEF. Aggressive dropable packets. These are IP packets that fail the IP sanity check; that is, they might have incorrect checksums, invalid versions, an expired Time-to-Live (TTL) value, an invalid UDP/TCP
107
port number, an invalid IP protocol field, and so on. Most of these packets trigger an Internet Control Message Protocol (ICMP) packet to notify the sender of the bad packet. A small number of these packets are generated due to normal utilities such as a trace route. Such packets in large numbers, however, can be part of a malicious smurf attack intended to cripple the router by filling up the IP process queue. It is essential to selectively drop these packets without losing the important control information. SPD operates in the following modes: Disabled The SPD feature is disabled on the router. Normal The IP input queue is less than the queue minimum threshold. No packets are dropped. Random drop The IP input queue is more than the minimum threshold but less than the maximum threshold. Normal IP packets are dropped in this mode, with a drop probability shown in the following formula:
Random drops are called SPD flushes. Important IP control traffic is still enqueued. Full drop The IP input queue is above the maximum threshold. All normal IP traffic is dropped. Important IP control traffic is still received to a special process level queue, termed the priority queue, that is drained before the normal one. Aggressive drop This is a special aggressive drop mode for IP packets failing the sanity check. All bad IP packets are dropped when the input queue is above the minimum threshold. You can enable this special drop mechanism using the ip spd mode aggressive command. Figure 6-5 illustrates SPD operation and its modes.
108
Case Study 6-4: Preventing Bad IP Packet Smurf Attacks by Using SPD
Your network is seeing a smurf attack of IP packets with expired TTL values. These bad IP packets are adversely affecting the router, as they need to be processed by the router in the process switching path because those packets trigger a TTL-exceeded ICMP packet. You should turn on the aggressive drop mechanism to drop such packets if they exceed the set SPD minimum threshold. This case study discusses SPD operation by using various show commands. SPD can be enabled by using the ip spd enable command. When running IOS version 12.0 or higher, though, SPD is turned on by default. Listing 6-5 shows the enabled SPD parameters and the SPD mode of operation: Listing 6-5 SPD Parameters and Current Mode of Operation R3#sh ip spd Current mode: normal. Queue min/max thresholds: 73/74, Headroom: 100 IP normal queue: 20, priority queue: 20. SPD special drop mode: none. You use the show ip spd command to show the SPD operation mode, the minimum and maximum thresholds, and the IP process and priority queue sizes. The minimum and maximum input queue SPD thresholds are user-configurable. IP normal queue and priority queue give the value of their respective queues at the time the show ip spd command is executed. IP normal queue is 20, which is below the minimum queue threshold of 73. Hence, SPD is currently operating in the normal mode. A special queue, termed the IP priority queue, is used to hold the important IP traffic when the IP input process queue is full. The IP input process queue is a global queue that holds all IP packets waiting to be processed by the IP input process. These packets cannot be processed on a CPU interrupt. You can configure the priority queue length using the ip spd headroom command. Any important IP traffic arriving on a full IP priority queue is dropped. Note Note that some of the SPD command formats changed in 12.0 IOS releases when compared to the earlier 11.1 IOS versions. While all the 11.1 IOS version SPD commands started with ip spd, some of the SPD commands in 12.0 and later versions of IOS can only be configured as spd <> instead
109
of ip spd <>. Refer to the appropriate Cisco documentation for clarification on the command formats.
Headroom shows the priority queue's depth. Priority queue holds the priority traffic. Presently, the priority queue has 20 packets to be processed. The default normal queue thresholds are derived from the input queue length. The show interface command shows the queue length of the input queue. All bad IP packets need to be process-switched. You should enable the SPD aggressive drop mechanism to drop bad or illegal IP packets when the normal IP queue exceeds the minimum threshold. You can enable the SPD aggressive drop mode using the ip spd mode aggressive command. Listing 6-6 shows the SPD operation with special SPD aggressive mode on. Listing 6-6 Aggressive SPD Drop Mechanism for Bad or Illegal IP Packets R3#sh ip spd Current mode: normal. Queue min/max thresholds: 73/74, Headroom: 100 IP normal queue: 20, priority queue: 5. SPD special drop mode: aggressively drop bad packets For detailed SPD packet statistics, refer to Listing 6-7. Listing 6-7 Detailed SPD Packet Statistics R3#show interface hssi0/0/0 switching Hssi0/0/0 Throttle count 10 Drops RP 13568 SPD Flushes Fast 322684 SPD Aggress Fast 322684 SPD Priority Inputs 91719
SP SSE Drops
0 0 4
You use the show interfaces switching command to view statistics on the SPD flushes and aggressive drops. It includes information on received important (priority) IP traffic and on any drops that occur when you exceed the headroom threshold. Route Processor (RP) drops are categorized as those dropped by a processor on a full input queue. They match with the input drops shown in the show interface Hssi0/0/0 command output. Fast SPD flushes are the random, full drops as well as the special aggressive drops done by SPD when the input queue is not physically exhausted. No SPD drops occur when the normal queue is below the minimum threshold. Fast SPD aggressive drops occur when SPD is operating in the special aggressive mode. They show the number of bad or illegal IP packets dropped by SPD. No SPD aggressive drops occur when the normal queue is below the minimum threshold. The SPD priority queue carries the priority traffic. Priority traffic is control traffic tagged with a precedence of 6.
110
Summary
This chapter focuses on congestion avoidance and packet drop policy algorithms: RED, WRED, Flow WRED, and ECN. RED is an active queue management algorithm that allows routers to detect congestion before the queue overflows. It aims to reduce the average queue size, hence reducing queuing delay and avoiding global synchronization by adopting a probability-based packet drop strategy between certain queue thresholds. Weighted version of RED, WRED allows different RED parameters based on a packet's IP precedence. Flow WRED extends WRED to provide fairness in the packet drop behavior among the different flow types. ECN enables congestion notification for incipient congestion by marking packets rather than dropping them. The chapter also discusses SPD, a selective packet drop policy that is used on the IP input queue of a Cisco router. SPD is used to differentiate between the data and control packets that are enqueued to the IP process in a Cisco router.
111
A: Q: A:
Flow WRED is not bad for voice because it doesn't penalize all UDP traffic, only greedy UDP flows. Should I be running SPD in all my routers? The SPD feature is critical in routers in any large-scale IP network, especially when they do route-cachebased forwarding. With CEF-based switching, the utility of SPD is a bit low because SPD applies only for process-switched traffic, but it is still useful in differentiating the routing protocol packets and keepalive packets from the other normal packets destined to the router. SPD with the special aggressive drop mode is useful for minimizing potential denial-of-service attacks on a router by using bad IP packets such as packets with expired TTL values.
References
1. "TCP Slow Start, Congestion Avoidance, Fast Recovery, and Fast Recovery Algorithms," RFC 2001. 2. "Recommendations on Queue Management and Congestion Avoidance in the Internet," B. Braden, D. Clark, J. Crowcroft, B. Davie, S. Deering, D. Estrin, S. Floyd, V. Jacobson, G. Minshall, C. Partridge, L. Peterson, K.Ramakrishnan, S. Shenker, J. Wroclawski, L. Zhang, April 1998. 3. "Random Early Detection Gateways for Congestion Avoidance," S. Floyd and V. Jacobson, IEEE/ACM Transactions on Networking, V.1 N.4, August 1993, pp. 397-413. 4. "A Proposal to add Explicit Congestion Notification (ECN) to IP," K. Ramakrishnan, S. Floyd, RFC 2481. 5. "RED in a Different Light," V. Jacobson, K. Nichols, K. Poduri, work in progress.
112
RSVP
The Internet Engineering Task Force (IETF) specified RSVP[1] as a signaling protocol for the intserv architecture. RSVP enables applications to signal per-flow QoS requirements to the network. Service parameters are used to specifically quantify these requirements for admission control. RSVP is used in multicast applications such as audio/video conferencing and broadcasting. Although the initial target for RSVP is multimedia traffic, there is a clear interest in reserving bandwidth for unicast traffic such as Network File System (NFS), and for Virtual Private Network (VPN) management. RSVP signals resource reservation requests along the routed path available within the network. It does not perform its own routing; instead, it is designed to use the Internet's current robust routing protocols. Like other IP traffic, it depends on the underlying routing protocol to determine the path for both its data and its control traffic. As the routing protocol information adapts to network topology changes, RSVP reservations are carried over to the new path. This modularity helps RSVP to function effectively with any underlying routing service. RSVP provides opaque transport of traffic control and policy control messages, and provides transparent operation through nonsupporting regions.
RSVP Operation
End systems use RSVP to request a specific QoS from the network on behalf of an application data stream. RSVP requests are carried through the network, visiting each node the network uses to carry the stream. At each node, RSVP attempts to make a resource reservation for the stream. RSVP-enabled routers help deliver the right flows to the right locations. Figure 7-1 gives an overview of the important modules and the data and control flow information of a client and router running RSVP.
113
Figure 7-1 Data and Control Flow Information of a Client and Router Running RSVP
The RSVP daemon in a router communicates with two local decision modulesadmission control and policy controlbefore making a resource reservation[2]. Admission control determines whether the node has sufficient available resources to supply the requested QoS. Policy control determines whether the user has administrative permission to make the reservation. If either check fails, the RSVP daemon sends an error notification to the application process that originated the request. If both checks succeed, the RSVP daemon sets parameters in a packet classifier and a packet scheduler to obtain the desired QoS. The packet classifier determines the QoS class for each packet, and the packet scheduler orders packet transmission based on its QoS class. The Weighted Fair Queuing (WFQ) and Weighted Random Early Detection (WRED) disciplines provide scheduler support for QoS. WFQ and WRED are discussed in Chapters 4 and 6, respectively. During the admission control decision process, a reservation for the requested capacity is put in place if sufficient capacity remains in the requested traffic class. Otherwise, the admission request is refused, but the traffic is still forwarded with the default service for that traffic's traffic class. In many cases, even an admission request that failed at one or more routers can still supply acceptable quality, as it might have succeeded in installing a reservation in all the routers suffering congestion. This is because other reservations might not be fully utilizing their reserved capacity. Reservations must follow on the same unicast path or on the multicast tree at all times. In case of link failures, the router should inform the RSVP daemon so that RSVP messages are generated on a new route. You can break down the process of installing a reservation into five distinct steps[3]: 1. Data senders send RSVP PATH control messages the same way they send regular data traffic. These messages describe the data they are sending or intend to send. 2. Each RSVP router intercepts the PATH messages, saves the previous hop IP address, writes its own address as the previous hop, and sends the updated message along the same route the application data is using. 3. Receiver stations select a subset of the sessions for which they are receiving PATH information and request RSVP resource reservations from the previous hop router using an RSVP RESV message. The RSVP RESV messages going from a receiver to a sender take an exact reverse path when compared to the path taken by the RSVP PATH messages. 4. The RSVP routers determine whether they can honor those RESV requests. If they can't, they refuse the reservations. If they can, they merge reservation requests being received and request a reservation from the previous hop router. 5. The senders receive reservation requests from the next hop routers indicating that reservations are in place. Note that the actual reservation allocation is made by the RESV messages. Figure 7-2 shows the RSVP reservation setup mechanism.
114
As discussed in Chapter 3, "Network Boundary Traffic Conditioners: Packet Classifier, Marker, and Traffic Rate Management," an individual flow is made of packets going from an application on a source machine to an application on a destination machine. The FlowSpec parameterizes a flow's requirements for admission control.
RSVP Components
The operational responsibilities of the three RSVP components are as follows: An RSVP sender is an application that originates traffic in an RSVP session. The flow specifications that RSVP senders can send across the RSVP network are: o Average data rate o Maximum burst size An RSVP-enabled router network provides the path between the RSVP senders and the RSVP receivers. An RSVP receiver is an application that receives traffic in an RSVP session. In conferencing and Voice over IP (VoIP) applications, an application can act as both an RSVP sender and receiver. The flow specifications that RSVP receivers can send across the RSVP network are: o Average data rate o Maximum burst size o QoS, including Guaranteed service PATH messages also describe the worst-case delays in the network. Controlled load service The routers guarantee only that network delays will be maximized.
RSVP Messages
RSVP uses seven message types for its operation: two required message typesPATH and RESVand five optional message typesPATH ERROR, PATH TEARDOWN, RESV ERROR, RESV CONFIRM, and RSV TEARDOWN. The RSVP routers and clients use them to create and maintain reservation states. RSVP usually runs directly over the IP. As such, RSVP messages are unreliable datagrams. They help create soft states within the routers, and a periodic refresh is needed. The following are the sender message types: PATH messages are sent periodically by senders. The senders describe the flows in terms of the source and destination IP addresses, the IP protocol, and the User Datagram Protocol (UDP) or
115
Transmission Control Protocol (TCP) ports, if applicable. They quantify the expected resource requirements for this data by specifying its mean rate and burst size. They are sent to the multicast group or unicast destination of the flow for which the reservation is being made; RSVP routers detect them because they are sent in UDP messages to a particular UDP port, or because they have the IP Router Alert option in their IP header. A router creates a Path State Block (PSB) when the PATH messages are received. PATH messages contain a periodic hello interval indicating how frequently the sender sends them. The default hello interval is 30 seconds. It is important to keep the hello interval small, or to have a fast retransmit scheme, because lost PATH messages can result in poor performance for VoIP, as that would delay the establishment of an RSVP reservation along the path of the VoIP call. The PSB is discarded upon a PATH TEARDOWN or ingress link failure, or when the PSB has not been refreshed by a new PATH message after four hello intervals. When error(s) in a PATH message are found, the optional PATH ERROR message is sent by the receiver or router, notifying the sender of the problem. Typically, this is a fundamental format or integrity check fault. PATH TEARDOWN messages are sent to the multicast group with the sender's source address when the PATH must be flushed from the database, either due to a link failure or because the sender is exiting the multicast group.
The following are the receiver message types: RESV messages are sent periodically by receivers. The receivers describe the flows and resource guarantees they need using information derived from the PATH messages, in terms of the source and destination IP addresses, the IP protocol, and the UDP or TCP ports, if applicable. They also describe the bit rate and delay characteristics they need, using flow specifications. They traverse through all RSVP routers along the routed path to the sender for which the reservation is being made. Routers create Reservation State Blocks (RSBs) when RESV messages (FlowSpec, FilterSpec) are granted. RESV messages contain a periodic hello interval indicating how frequently the receiver sends them. The RSB is discarded upon a RESV TEARDOWN or ingress link failure, or when they have not been refreshed by a new RESV message after four hello intervals. When error(s) in an RESV message are found, an RESV ERROR message is sent by a sender or router informing the receiver of a problem. Typically, it is due to a fundamental format or integrity check fault, or because insufficient resources were available to make the requested guarantees. When the effect of an RESV message applies end to end and a receiver requests notification of the fact, RESV CONFIRM messages are sent to the receivers or merge point routers. RESV TEARDOWN messages are sent when an RSB must be flushed from the database, either due to a link failure or because the sender is exiting the multicast group.
Reservation Styles
You can categorize RSVP flow reservations into two major typesdistinct and sharedwhich are discussed in the following sections.
Distinct Reservations
Distinct reservations are appropriate for those applications in which multiple data sources are likely to transmit simultaneously. In a video application, each sender emits a distinct data stream requiring separate admission control and queue management on the routers along its path to the receiver. Such a flow, therefore, requires a separate reservation per sender on each link along the path. Distinct reservations are explicit about the sender and are installed using a Fixed Filter (FF) reservation style. Symbolically, you can represent an FF-style reservation request by FF (S,Q), where the S represents the sender selection and the Q represents the FlowSpec.
116
Unicast applications form the simplest case of a distinct reservation, in which there is one sender and one receiver.
Shared Reservations
Shared reservations are appropriate for those applications in which multiple data sources are unlikely to transmit simultaneously. Digitized audio applications, such as VoIP, are suitable for shared reservations. In this case, as a small number of people talk at any time, a limited number of senders send at any given time. Such a flow, therefore, does not require a separate reservation per sender; it requires a single reservation that you can apply to any sender within a set, as needed. RSVP refers to such a flow as a shared flow and installs it using a shared explicit or wildcard reservation scope. These two reservation styles are discussed below. The Shared Explicit (SE) reservation style specifically identifies the flows that reserve network resources. Symbolically, you can represent an SE-style reservation request by SE((S1,S2){Q}), where the S1, S2, . . . represent specific senders for the reservation and the Q represents the FlowSpec. The Wildcard Filter (WF) reserves bandwidth and delay characteristics for any sender. It does not admit the sender's specification; it accepts all senders, which is denoted by setting the source address and port to zero. Symbolically, you can represent a WF-style reservation request by WF(* {Q}), where the asterisk represents the wildcard sender selection and the Q represents the FlowSpec. Table 7-1 shows the different reservation filters based on the reservation styles and sender's scope, and Figure 7-3 illustrates the three reservation filter styles described previously.
117
Table 7-1. Different Reservation Filters, Based on Style and Sender Scope Sender Selection Scope Reservation Styles Distinct Shared Explicit FF SE Wildcard None defined WF
Service Types
RSVP provides two types of integrated services that the receivers can request through their RSVP RESV messages: controlled load service and guaranteed bit rate.
118
Controlled Load
Under controlled load service [4], the network guarantees that the reserved flow will reach its destination with a minimum of interference from the best-effort traffic. In addition, Cisco's implementation of the service offers isolation between the reserved flows. Flow isolation allows a flow reservation to operate unaffected by the presence of any other flow reservations that might exist in the network. The controlled load service is primarily intended for a broad class of applications running on the Internet today that are sensitive to overloaded conditions. These applications work well on unloaded nets, but they degrade quickly under overloaded conditions. An example of such an application is File Transfer Protocol (FTP).
119
In the Cisco-specific implementation, to support RSVP over ATM, RSVP creates a Variable Bit Rate (VBR) ATM switched virtual circuit (SVC) for every reservation across an ATM network. It then redirects all reserved traffic down the corresponding SVCs and relies on the ATM interface to police the traffic.
RSVP Scalability
One drawback of RSVP is that the amount of state information required increases with the number of per-flow reservations. As many hundreds of thousands of real-time unicast and multicast flows can exist in the Internet backbone at any time, state information on a per-flow granularity is considered a nonscalable solution for Internet backbones. RSVP with per-flow reservations scales well for medium-size corporate intranets with link speeds of DS3 or less. For large intranets and for Internet service provider (ISP) backbones, you can make RSVP scale well when you use it with large multicast groups, large static classes, or an aggregation of flows at the edges rather than per-flow reservations. RSVP reservation aggregation[6] proposes to aggregate several end-to-end reservations sharing common ingress and egress routers into one large, end-to-end reservation. Another approach is to use RSVP at the edges and diffserv in the network backbone to address RSVP scalability issues in the core of a large network. RSVP-to-diffserv mapping is discussed in Chapter 2, "Differentiated Services Architecture." The service provider networks and the Internet of the future are assumed to have, for the most part, sufficient capacity to carry normal telephony traffic. If a network is engineered with sufficient capacity, you can provision all telephony traffic as a single class. Depending on available network capacity, telephony traffic can require relatively modest capacity, which is given some fraction of the capacity overall, without the need for resource allocation per individual call.
Case Study 7-1: Reserving End-to-End Bandwidth for an Application Using RSVP
The sender and receiver applications need to signal over the network using RSVP so that they can reserve end-to-end network bandwidth for digital audio playback. The sender and receiver application traffic is made up of UDP packets with destination port number 1040. The network setup is shown in Figure 7-4 (note that the IP addresses in this figure indicate just the last octet of 210.210.210.X network address). Assume that the sender and receiver applications are not yet RSVP-compliant and rely on the end routers for RSVP signaling. Figure 7-4 RSVP Signaling and Reservations for Traffic FLows Between Two End-Hosts
In case the applications are not RSVP-compliant, end routers should be set up to behave as though they are receiving periodic signaling information from the sender and the receiver. Another RSVP reservation should be configured for Internet Control Message Protocol (ICMP) packets to allow pinging between the sender and the receiver to troubleshoot end-to-end network connectivity issues.
120
Note The use of RSVP for ICMP messages in this case study is meant primarily to illustrate RSVP operation, though the practical need might be limited.
To enable end-to-end RSVP signaling over the network, the inbound and outbound interfaces of the routers along the path of the sender to the receiver need to be configured for WFQ and RSVP. Note that WFQ is on by default on interfaces with bandwidths of 2 MB or less. The show interface command shows the queuing strategy being used by an interface. In this case study, RSVP reservations are made for the following protocols: UDP Port 1040 application traffic from Host A to Host B. ICMP packets from Host A to Host B and vice versa to verify and diagnose connectivity issues at times of network congestion. This enables ping (ICMP Echo Request) packets from one end host to the other, and ICMP Echo Reply packets in response to the requests to work at all times.
Listing 7-1 shows a sample configuration for Router R1 for the RSVP setup. Listing 7-1 RSVP-Related Configuration on Router R1 interface Ethernet0 ip address 210.210.210.1 255.255.255.224 fair-queue 64 256 234 ip rsvp bandwidth 7500 7500 ! interface Serial0 ip address 210.210.210.101 255.255.255.252 fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ip rsvp sender 210.210.210.60 210.210.210.30 1 0 0 210.210.210.30 Et0 1 1 ip rsvp sender 210.210.210.60 210.210.210.30 UDP 1040 0 210.210.210.30 Et0 32 32 ip rsvp reservation 210.210.210.60 210.210.210.30 1 0 0 210.210.210.30 Et0 ff 1 1 This configuration enables WFQ and RSVP on the router interfaces. The fair-queue command enables you to set up the queue's drop threshold and the maximum number of normal and reservable conversation queues. The ip rsvp bandwidth command sets up the maximum amount of reservable bandwidth on the interface and the maximum allowable reservable bandwidth for any particular reservation. The ip rsvp sender command is used to set up the router as if it were receiving periodic RSVP PATH messages from a downstream sender, and enable it to send RSVP PATH messages upstream. Note that this command is needed only when the sender cannot send RSVP PATH messages. Two ip rsvp sender commands are used in the configuration. They are used to simulate receipt of periodic RSVP PATH messages from sender 210.210.210.30 for ICMP and UDP Port 1040 traffic, respectively. Router R1 forwards the received RSVP PATH messages further downstream toward the destination 210.210.210.60. The ip rsvp reservation command is used to set up the router as if it were receiving periodic RSVP RESV messages. The command in the configuration shows receipt of periodic RSVP RESV messages from 210.210.210.30 for the ICMP traffic. This is to enable RSVP reservation for ping (ICMP) packets from Host B to Host A. The RSVP RESV messages are forwarded toward source 210.210.210.60. Listings 7-2 and 7-3 show the RSVP-related configuration for Routers R2 and R3, respectively.
121
Listing 7-2 RSVP-Related Configuration on Router R2 interface Serial0 ip address 210.210.210.102 255.255.255.252 fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! interface Serial1 ip address 210.210.210.105 255.255.255.252 fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 Note that this router doesn't need any RSVP sender and reservation statements, as they are needed only on edge routers. Edge routers dynamically propagate the RSVP messages to allow Router R2 to make reservations for the RSVP flows. Listing 7-3 RSVP-Related Configuration on Router R3 interface Ethernet0 ip address 210.210.210.33 255.255.255.224 fair-queue 64 256 234 ip rsvp bandwidth 7500 7500 ! interface Serial1 ip address 210.210.210.106 255.255.255.252 fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ip rsvp sender 210.210.210.30 210.210.210.60 1 0 0 210.210.210.60 Et0 1 1 ip rsvp reservation 210.210.210.60 210.210.210.30 1 0 0 210.210.210.60 Et0 FF LOAD 1 1 ip rsvp reservation 210.210.210.60 210.210.210.30 UDP 1040 0 210.210.210.60 Et0 FF LOAD 32 32 Listings 7-4 through 7-10 provide output of the different RSVP-related information from the router. Listing 7-4 Interface-Related RSVP Information of Router R1 Router#show ip rsvp interface interfac allocate i/f max flow max per/255 UDP UDP M/C Et0 1K 7500K 7500K 0 /255 0 0 Se0 33K 1158K 1158K 7 /255 0 0
IP 1 1
UDP_IP 0 0
The show ip rsvp interface command shows the total allocated bandwidth on an interface. By default, the maximum reservable bandwidth on an interface is 0.75 times the total interface bandwidth. Listing 7-5 RSVP Sender Information on Router R1 Router#show ip rsvp sender To From Prev Hop I/F BPS Bytes 210.210.210.30 210.210.210.60 210.210.210.102 Se0 1K 210.210.210.60 210.210.210.30 210.210.210.30 Et0 1K 210.210.210.60 210.210.210.30
122
210.210.210.30
Et0
32K
32K
The show ip rsvp sender command shows that Router R1 saw three different RSVP PATH messages. Listing 7-6 Received RSVP Reservation Requests by Router R1 Router#show ip rsvp reservation To From Pro DPort Sport Next Hop I/F Fi Serv BPS Bytes 210.210.210.30 210.210.210.60 1 0 0 210.210.210.30 Et0 FF LOAD 1K 1K 210.210.210.60 210.210.210.30 1 0 0 210.210.210.102 Se0 FF LOAD 1K 1K 210.210.210.60 210.210.210.30 UDP 1040 0 210.210.210.102 Se0 FF LOAD 32K 32K The show ip rsvp reservation command shows that Router R1 received three RSVP RESV messages. As an example, for the UDP flow, Router R1 received the RSVP RESV message from Router R2 with a next-hop of 210.210.210.102. Listing 7-7 Installed RSVP Reservations on Router R1 Router#show ip rsvp installed RSVP: Ethernet0 BPS To From Weight Conv 1K 210.210.210.30 210.210.210.60 4 264 RSVP: Serial0 BPS To From Weight Conv 1K 210.210.210.60 210.210.210.30 128 264 32K 210.210.210.60 210.210.210.30 4 265
Protoc DPort 1 0
Sport 0 Sport 0 0
The show ip rsvp installed command shows three active RSVP reservations in Router R1. It also shows the conversation number and weight assigned by WFQ for each reservation. Note that the weight for the largest RSVP reservation is always 4. The weight for the reservation of ICMP packets on the serial0 interface is derived as: 4 (RSVP reservation request) (largest RSVP reservation allocated) = 4 32 1 = 128. Listing 7-8 RSVP Reservations Sent Upstream by Router R1 Router#show ip rsvp request To From Pro DPort Sport Next Hop I/F Fi Serv BPS Bytes 210.210.210.30 210.210.210.60 1 0 0 210.210.210.102 Se0 FF LOAD 1K 1K 210.210.210.60 210.210.210.30 1 0 0 210.210.210.30 Et0 FF LOAD 1K 1K 210.210.210.60 210.210.210.30 UDP 1040 0 210.210.210.30 Et0 FF LOAD 32K 32K The show ip rsvp request command shows that Router R1 passed on three RSVP RESV messages. As an example, for the UDP flow, Router R1 sent the RSVP RESV message to the downstream host with a next-hop of 210.210.210.30.
123
Listing 7-9 RSVP Neighbors of Router R1 Router#show ip rsvp neighbor Interfac Neighbor Encapsulation Et0 210.210.210.30 RSVP Se0 210.210.210.102 RSVP The show ip rsvp neighbor command shows the neighbor routers or hosts from which Router R1 received RSVP messages. Listing 7-10 Serial0 Queue Information of Router R1 Router#show queue serial0 Input queue: 0/75/1071 (size/max/drops); Total output drops: 107516 Queueing strategy: weighted fair Output queue: 41/1000/64/107516 (size/max total/threshold/drops) Conversations 1/4/256 (active/max active/max total) Reserved Conversations 1/1 (allocated/max allocated) (depth/weight/discards/tail drops/interleaves) 1/4096/100054/0/0 Conversation 265, linktype: ip, length: 50 source: 210.210.210.30, destination: 210.210.210.60, id: 0x033D, ttl: 254, TOS: 0 prot: 17, source port 38427, destination port 1040 (depth/weight/discards/tail drops/interleaves) 40/4096/1131/0/0 Conversation 71, linktype: ip, length: 104 source: 210.210.210.30, destination: 210.210.210.60, id: 0x0023, ttl: 254, prot: 1 (depth/weight/discards/tail drops/interleaves) 1/128/65046/0/0 Conversation 264, linktype: ip, length: 104 source: 210.210.210.30, destination: 210.210.210.60, id: 0x0023, ttl: 254, prot: 1 The show queue s0 command shows the capture of the packets in the output queue of the serial0 interface at the time the command was issued. Note that the conservation numbers of the RSVP flows in the queue match those shown in the show ip rsvp installed command output.
124
The req-qos guaranteed-delay command sets up guaranteed delay as the desired (requested) QoS to a dial peer. Other QoS requests can be either controlled load or best effort. By default, the voice call gets only besteffort service. Note The retransmit time for RSVP messages can be too long for VoIP. Because lost RSVP PATH messages can result in poor performance for VoIP by delaying the establishment of an RSVP reservation along the VoIP call's path, it is important to keep hello interval for RSVP PATH messages small, or to have a fast retransmit scheme.
Summary
In intserv, RSVP is used to signal QoS information using control messages that are different from the actual data packets. RSVP signaling results in certain resource guarantees along the traffic's routed path. Like diffserv, intserv using RSVP depends on the packet scheduler with QoS support (such as WFQ and WRED) in the router to offer the desired QoS for the RSVP reserved flows.
Q:
A:
Q:
A:
Q: A:
Q:
A:
Q:
125
A:
The weight and conversation numbers are assigned by the WFQ algorithm. Make sure WFQ is enabled on the interface that has the reservation installed. WFQ is discussed in detail in Chapter 4. The show ip rsvp installed command shows the conversation number 256 and weight 4 of an RSVP flow reservation on serial0. However, the show queue s0 command output shows a weight of 4096 for conversation 256. How can you explain this discrepancy in weights given by these two commands? It is likely that the RSVP flow is sending more than the reserved bandwidth. If the flow is sending traffic at a rate that goes over the bandwidth reservation, the excess traffic is treated as best effort and is assigned a weight of 4096 in WFQ. Therefore, the packet with a weight of 4096 might be a nonconforming packet. Chapter 4 describes the WFQ algorithm in detail. The router cannot install the RSVP reservation for the flow I just configured on the router. The debug ip rsvp command shows the following message: RSVP RESV: no path information for 207.2.23.1. What does this debug message mean? Generally, you should see this message on a router that has a static RSVP reservation configuration for a flow so that the router behaves as if it is receiving periodic RESV messages for this flow. The debug informs that the router didn't receive any PATH message for the corresponding flow. RSVP cannot send a RESV message without first receiving a corresponding PATH message for the flow. Troubleshoot over the path leading to the sender to find out why the router didn't receive any RSVP PATH message. Are RSVP messages Cisco Express Forwarding (CEF)-switched? No. RSVP messages are control packets that need to be processed by the router. Hence, they are process-switched. Note, however, that the data packets belonging to an RSVP flow follow whatever switching path is configured in the router. Packet switching is discussed in Appendix B, "Packet Switching Mechanisms."
Q:
A:
Q:
A:
Q: A:
References
1. "Resource ReSerVation Protocol (RSVP)Version 1 Functional Specification," R. Branden and others, RFC 2205, 1997. 2. "Resource Reservation Protocol (RSVP) Version 1 Applicability Statement, Some Guidelines on Deployment," A. Mankin and others, RFC 2208, 1997. 3. RSVP home page: http://www.isi.edu/rsvp 4. "Specification of the Controlled-Load Network Element Service," J. Wroclawski, RFC 2211, 1997. 5. "Specification of Guaranteed Quality of Service," S. Shenker, C. Partridge, and R. Guerin, RFC 2212, 1997. 6. "Aggregation of RSVP for IPv4 and IPv6 Reservations," IETF Draft, Baker and others, 1999.
126
ATM
ATM[1] is a fixed-size cell-switching and multiplexing technology. It is connection-oriented, and a virtual circuit (VC) must be set up across the ATM network before any user data can be transferred between two or more ATM attached devices. Primarily, ATM has two types of connections, or VCs: permanent virtual circuits (PVCs) and switched virtual circuits (SVCs). PVCs are generally static and need a manual or external configuration to set them up. SVCs are dynamic and are created based on demand. Their setup requires a signaling protocol between the ATM endpoints and ATM switches. An ATM network is composed of ATM switches and ATM end nodes, or hosts. The cell header contains the information the ATM switches use to switch ATM cells. The data link layer is broken down into two sublayers: the ATM Adaptation Layer (AAL) and the ATM layer. You map the different services to the common ATM layer through the AAL. Higher layers pass down the user information in the form of bits to the AAL. User information gets encapsulated into an AAL frame, and then the ATM layer breaks the information down into ATM cells. The reverse is done at the receiver end. This process is known as segmentation and reassembly (SAR).
127
Generic flow control (GFC) has local flow-control significance to the user for flow control. The GFC mechanism is used to control traffic flow from end stations to the network. The GFC field is not present in the NNI cell format. Two modes of operation are defined: uncontrolled access and controlled access. Traffic enters the network without GFC-based flow control in uncontrolled access mode. In controlled access mode, ATMattached end nodes shape their transmission in accordance with the value present in GFC. Most UNI implementations don't use this field. A virtual path (VP) consists of a bundle of VCs and is assigned to a virtual path identifier (VPI). ATM switches can switch VPIs, along with all the VCs within them. VCs are the paths over which user data is sent. Each VC within a VP is assigned a virtual channel identifier (VCI). VPI and VCI fields are used in ATM switches to make switching decisions. Figure 8-2 shows a VC being set up between routers R1 and R2 through a network of ATM switches. All the cells leaving R1 for R2 are tagged with a VPI of 0 and a VCI of 64. The ATM switch S1 looks at the VPI and VCI pair on Port 0 and looks them up in its translation table. Based on the lookup, the ATM switch switches the cells out of Port 1 with a VPI of 1 and a VCI of 100. Similarly, Switch S4 switches cells on Port 4 with a VPI of 1 and a VCI of 100 onto Port 3 with a VPI of 2 and a VCI of 100 based on its translation table.
128
The remaining fields in the ATM header are as follows: Payload type identifier (PTI) This 3-bit field is used to identify the kind of payload carried in the cell. It is used to differentiate between operation, administration, and maintenance (OAM) information and user data. Cell loss priority (CLP) CLP defines a cell's priority. If CLP is not set (CLP = 0), it is considered at a higher priority than cells with CLP set (CLP = 1). With CLP set, the cell has a higher chance of being discarded at times of network congestion. Header error control (HEC) HEC is used for detecting and correcting the errors in the ATM header.
ATM QoS
ATM offers QoS guarantees by making the ATM end system explicitly specify a traffic contract describing its intended traffic flow characteristics. The flow descriptor carries QoS parameters, such as Peak Cell Rate (PCR), Sustained Cell Rate (SCR), and burst size. ATM end systems are responsible for making sure the transmitted traffic meets the QoS contract. The ATM end system shapes traffic by buffering data and transmitting it within the contracted QoS parameters. The ATM switches police each user's traffic characteristics and compare them to their QoS contract. If certain traffic is
129
found to exceed the QoS contract, a switch can set the CLP bit on the nonconforming traffic. A cell with the CLP bit set has a higher drop probability at times of congestion.
130
VP Shaping
Similar to the ATM services for an ATM VC, an ATM serviceor shapingcan be applied on a VP for traffic contract restrictions on an entire VP. Within a shaped VP, all VCs still can be UBR without strict traffic constrictions carrying best-effort traffic. Figure 8-3 shows a logical diagram of how VCs are bundled within a VP. Figure 8-3 Bundling of Multiple VCs in a VP
Note A VP ATM service can support all the ATM service classes that can be supported at the VC level.
131
255.255.255.252
The ABR service is configured with a PCR and MCR of 10 Mbps and 1Mbps, respectively. The atm abr ratefactor 8 8 command configures the cell transmission rate increases and decreases in response to RM control information from the network. Note that the PCR and MCR default values are the line rate and 0, respectively. The default rate factor is 16.
132
Each ISP connected by the CLEC is assigned an interface on the Tier-1 ISP router, and all its VCs fall under the same VP. In the sample configuration on the Tier-1 ISP router shown in Listing 8-2, the CLEC's ISP 1 is assigned interface ATM0/0/0 and VP 2. The VP traffic shaping is configured under the main ATM interface. All VCs use the same VP, and each is assigned its own subinterface under the main interface. Listing 8-2 VP Traffic Shaping Configuration on the ATM0/0/0 interface of the Tier 1 ISP Router interface ATM0/0/0 atm pvp 2 6000000 ! interface ATM0/0/0.1 point-to-point ip address 212.12.12.1 255.255.255.252 atm pvc 101 2 101 aal5snap ! interface ATM0/0/0.2 point-to-point ip address 212.12.12.4 255.255.255.252 atm pvc 102 2 102 aal5snap interface ATM 0/0/0.3 point-to-point ip address 212.12.12.8 255.255.255.252 atm pvc 103 2 103 aal5snap As shown in Listing 8-2, the atm pvp configuration command sets up VP traffic shaping for all the VCs of the specified VP to the stipulated peak rate. To examine the VP groups' configuration, you can use the show atm vp command, which shows the VP traffic shaping parameters of all VP groups in the router. The user can create up to 256 different VP groups. It also shows the number of VCs in a VP group. Listing 8-3 shows the ATM VP traffic shaping parameters. Listing 8-3 ATM VP Traffic Shaping Parameters Router#show atm vp Interface ATM0/0/0 VPI 2
Data VCs 3
CES VCs 0
Status ACTIVE
You can obtain detailed information on VP 2 by using the show atm vp 2 command. This command shows information about all the VCs with a VPI of 2 in the router. Two management virtual channels are created automatically by the router per each VP group configured for the segment OAM F4 flow cells and for the endto-end OAM F4 flow cells. Listing 8-4 shows detailed information on the VCs with VPI 2.
133
Listing 8-4 Information on VP Traffic Shaping Parameters for VPI 2 Router#show atm vp 2 ATM0/0/0 VPI: 2, PeakRate: 6000000, CesRate: 0, DataVCs: 3, CesVCs: 0, Status: ACTIVE VCD 1 2 101 102 103 VCI 3 4 101 102 103 Type PVC PVC PVC PVC PVC InPkts 0 0 0 0 0 OutPkts 0 0 0 0 0 AAL/Encap F4 OAM F4 OAM AAL5-SNAP AAL5-SNAP AAL5-SNAP Status ACTIVE ACTIVE ACTIVE ACTIVE ACTIVE
TotalInPkts: 0, TotalOutPkts: 0, TotalInFast: 0, TotalOutFast: 0, TotalBroadcasts: 0 The show atm vc command shows information on all the VCs in the router without regard to the VPI information. Listing 8-5 displays information on all ATM VCs enabled on the router. Listing 8-5 ATM VC Parameters VCD / Name 1 2 101 102 103 Peak Kbps Avg/Min Burst Kbps Cells
VPI 2 2 2 2 2
134
Note that from the IP QoS perspective, the IP traffic is transported across the ATM network without any loss (as any drops in the ATM network are IP QoS-unaware) by using an ATM service that suits the traffic's IP service needs. For IP QoS, each VC in an ATM network maintains a separate queue. You can apply the WRED and WFQ IP QoS functions on each VC queue. Two scenarios are discussed in this section as a way to preserve IP QoS over an ATM network: 1. A single PVC carrying all the IP traffic to its destination IP traffic exceeding the ATM PVC parameters and service at the ingress to the ATM network gets queued, and IP QoS techniques such as WRED and WFQ are applied on the queue as it builds up due to congestion conditions. WRED ensures that high-precedence traffic has low loss relative to lower-precedence traffic. WFQ ensures that high-precedence traffic gets a higher bandwidth relative to the lower-precedence traffic, because it schedules high-precedence traffic more often. Note that when Class-Based Weighted Fair Queuing (CBWFQ) is run on a PVC, you can make bandwidth allocations based on traffic class. CBWFQ is discussed in Chapter 4. 2. A VC bundle (made of multiple PVCs) carrying IP traffic to its destination When carrying traffic with different QoS (real-time, non-real-time, best-effort) to the same destination, it is a good idea to provision multiple PVCs across the ATM network to the destination so that each IP traffic class is carried by a separate PVC. Each PVC is provisioned to an ATM service class based on the IP traffic it is mapped to carry. Figure 8-6 depicts IP-ATM QoS using a VC bundle. Some VC bundle characteristics are as follows:
135
o o o
Each VC in the bundle is mapped to carry traffic with certain IP precedence value(s). You can map a VC to one or more IP precedence values. Note, however, that only one routing peering or adjacency is made per PVC bundle. You can monitor VC integrity by using ATM OAM or Interim Local Management Interface (ILMI). If a bundle's high-precedence VC fails, you can either "bump" its traffic to a lowerprecedence VC in the bundle, or the entire bundle can be declared down. A separate queue exists for each VC in the bundle. You can apply the WRED and WFQ IP QoS techniques on each VC queue.
ATM service for the IP traffic is expected to be above UBR (best-effort) class so that no packets (cells) are dropped as part of ATM QoS within the ATM network. Figure 8-7 illustrates the two IP-ATM QoS scenarios. Figure 8-7 IP-ATM QoS Interworking
The difference between using a single PVC or a PVC bundle for IP QoS interworking depends on the cost and traffic needs in the network. Although a PVC bundle can be more expensive than a single PVC, a PVC bundle provides traffic isolation for critical traffic classes such as voice. On the other hand, a PVC bundle requires prior traffic engineering so that all the PVCs in the bundle are utilized optimally. Otherwise, you can run into conditions in which the PVC in the bundle carrying the high-precedence traffic gets congested while the PVC carrying the lower-precedence traffic is running relatively uncongested. Note that you cannot automatically bump high-precedence traffic to a different member PVC when the PVC carrying high precedence gets congested. When using a single PVC, you can enable CBWFQ with a priority queue on it so that voice traffic is prioritized over the rest of the traffic carried by the PVC. CBWFQ with a priority queue is discussed in Chapter 4.
136
The ATM PVC is engineered to be lossless because drops in the ATM network are without regard to the IP precedence value in the packet. The ATM network just functions as a lossless physical transport mechanism offering a service required by the underlying traffic. To begin with, the different traffic classes and the policies are defined for each class. The defined policy is then applied to the ingress interface to the ATM network. Listing 8-6 is a sample configuration for this functionality. Listing 8-6 Configuration for IP-ATM QoS Interworking interface ATM0/0/0.1 point-to-point ip address 200.200.200.1 255.255.255.252 pvc 0/101 abr 10000 1000 encapsulation aal5nlpid service-policy output atmpolicy class-map control match precedence 7 match precedence 6 class-map gold match precedence 5 match precedence 4 class-map silver match precedence 3 match precedence 2 class-map bronze match precedence 1 match precedence 0 policy-map atmpolicy class control bandwidth 10000 random-detect class gold bandwidth 40000 random-detect class silver bandwidth 30000 random-detect class bronze bandwidth 20000 random-detect
137
Note that the configuration uses modular QoS command-line interface (CLI). Modular QoS CLI is discussed in Appendix A, "Cisco Modular QoS Command-Line Interface." The configuration shown in Listing 8-6 defines three class mapsgold, silver, and bronzeto classify data traffic based on precedence. A fourth class map, control, is defined to match all network control traffic that can carry an IP precedence of 6 and 7. After the initial traffic classification step, the atmpolicy policy map is defined to stipulate a bandwidth and WRED policy for the four traffic classes. The gold class, for example, is allocated a bandwidth of 40 Mbps and a WRED drop policy. Finally, the atmpolicy policy is applied to the ATM subinterface carrying the ATM PVC to the destination across the ATM cloud by using the service-policy configuration command. The show policymap command shows the policies for each class of a policy map. The atmpolicy policy map information is shown in Listing 8-7. Listing 8-7 The atmpolicy Policy Map Information Router#show policy-map atmpolicy Policy Map atmpolicy Weighted Fair Queueing Class control Bandwidth 20000 (kbps) exponential weight 9 class min-threshold max-threshold mark-probability ---------------------------------------------------------0 1 2 3 4 5 6 7 rsvp 128 256 512 512 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10
Class gold Bandwidth 10000 (kbps) exponential weight 9 class min-threshold max-threshold mark-probability ---------------------------------------------------------0 1/10 1 1/10 2 1/10 3 1/10 4 128 512 1/10 5 256 512 1/10 6 1/10 7 1/10 rsvp 1/10 Class silver Bandwidth 30000 (kbps) exponential weight 9 class min-threshold max-threshold mark-probability ---------------------------------------------------------0 1 2 3 4 5 6 128 256 512 512 1/10 1/10 1/10 1/10 1/10 1/10 1/10
138
7 1/10 rsvp 1/10 Class bronze Bandwidth 20000 (kbps) exponential weight 9 class min-threshold max-threshold mark-probability ---------------------------------------------------------0 1 2 3 4 5 6 7 rsvp 128 256 512 512 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10 1/10
139
traffic it is mapped to carry. A silver PVC bundle member, for example, carries traffic requiring better service than a bronze PVC, which carries best-effort traffic. The PVC bundle member control is defined as a protect VC because the control PVC is crucial for the entire PVC bundle to operate properly. Hence, if the control PVC goes down, the entire PVC bundle is brought down. The bundle members gold, silver, and bronze form a protect group. All the member VCs in the protect group need to go down before the bundle is brought down. If a protect group's member VC goes down, its traffic is bumped to a lower-precedence VC.
Frame Relay
Frame Relay is a popular wide-area network (WAN) packet technology well suited for data traffic. It is a simple protocol that avoids link-layer flow control and error correction functions within the Frame Relay network. These functions are left for the applications in the end stations. The protocol is best suited for data traffic, because it can carry occasional bursts. Frame Relay operates using VCs. A VC offers a logical connection between two end points in a Frame Relay network. A network can use a Frame Relay VC as a replacement for a private leased line. You can use PVCs and SVCs in Frame Relay networks. A PVC is set up by a network operator through a network management station, whereas an SVC is set up dynamically on a call-by-call basis. PVCs are most commonly used, and SVC support is relatively new. A user frame is placed in a Frame Relay header to be sent on a Frame Relay network. The Frame Relay header is shown in Figure 8-8.
140
The 10-bit data-link connection identifier (DLCI) is the Frame Relay VC number corresponding to a logical connection to the Frame Relay network. A DLCI has only local significance. A Frame Relay switch maintains the VC mapping tables to switch a frame to its destination DLCI. The Address Extension (AE) bit indicates a 3- or 4-byte header. It is not supported under the present Cisco Frame Relay implementation. The Command and Response (C/R) bit is not used by the Frame Relay protocol and is transmitted unchanged end to end. The Frame Check Sequence (FCS) is used to verify the frame's integrity by the switches in the Frame Relay network and the destination station. A frame that fails the FCS test is dropped. A Frame Relay network doesn't attempt to perform any error correction. It is up to the higher-layer protocol at the end stations to retransmit the frame after discovering the frame might have been lost.
141
The DE bit is set by the router or other DTE device to indicate that the marked frame is of lesser importance relative to other frames being transmitted. It provides a basic prioritization mechanism in Frame Relay networks. Frames with the DE bit set are discarded first, before the frames without the DE bit flagged.
142
Frame Relay traffic shaping allows occasional bursts over the CIR on a PVC, although the rate throttles to the CIR at times of congestion. A PVC can also be configured for a fixed data rate equal to the CIR (BE = 0). Adaptive FRTS At every time interval TC, a process checks if any BECN was received from the Frame Relay network. If BECN was received in the last TC interval, the transmission rate is dropped by 25 percent of the current rate. It might continue to drop until the lower limit is reached, which is the minimum CIR (MINCIR). No matter how congested the Frame Relay network is, the router does not drop its transmission rate below MINCIR. The default value for MINCIR is half of the CIR. After the traffic rate has adapted to the congestion in the Frame Relay network, it takes 16 TC intervals without BECN to start increasing the traffic rate back to the configured CIR. FECN/BECN Integration A UDP- or IP-based application can result in a unidirectional traffic flow, but the application might not necessarily have data flowing in the opposite direction because both IP and UDP do not use any acknowledging scheme. If you enable traffic shaping on UDP data, the only congestion notification set by the network is the FECN bit received in the frames arriving at the destination router. The router sourcing all this traffic does not see any BECNs because there's no return traffic. FECN/BECN integration implemented a command to send a Q.922 test frame with the BECN bit set in response to a frame that has the FECN set. Note that you need to clear the DE bit in a BECN frame you send in response to a received FECN. The source router gets a test frame, which is discarded, but the BECN bit is used to throttle the data flow. To have this interaction, traffic shaping commands need to be present on the ingress and egress ports. Note In case of unidirectional traffic, the traffic-shape adaptive command is needed on the router sourcing the traffic, and the traffic-shape fecn-adapt command is used to send a Q.922 test frame[4] with the BECN bit set in response to an FECN.
143
144
Enabling FRTS is a prerequisite to turning on FRF.12 fragmentation. For voice, FRTS uses a flow-based Weighted Fair Queuing (WFQ) with a priority queue (PQ) on the shaping queue. Each Frame Relay PVC has its own PQ-WFQ structure that is used to schedule packets based on their IP precedence values as per the flow-based WFQ algorithm discussed in Chapter 4. FRTS uses dual FIFO queues at the interface levelthe first queue for priority packets such as voice, Local Management Interface (LMI), and related high-priority packets and the second queue for the rest of the traffic. Note that FRTS disallows any queuing other than dual FIFO at the interface level. In Figure 8-12, three different flowstwo data flows and one voice flowarrive for trans-mission on an interface. The flows are routed to their respective PVC structures based on the flow header information. In this case, flows 1 and 2 belong to PVC 1, and flow 3 belongs to PVC 2. Each PVC has its own shaping queue. In this case, PQ-WFQ is enabled on PVC 1 such that all voice flow packets are scheduled first, and all voice packets go to the priority interface FIFO queue. Data packets scheduled by the shaping queue are fragmented based on the fragment threshold value and are put in the normal FIFO interface queue. It is assumed that the fragment threshold is set such that none of the voice packets need to be fragmented. Though a separate shaping queue exists for each PVC, all the PVCs on the interface share the dual FIFO interface queues. Packets are transmitted from the priority FIFO queue with a strict priority over the packets in the normal FIFO interface queue. Figure 8-12 A Conceptual View of FRF.12 Operation with Multiple PVCs
Case Study 8-6: Frame Relay Traffic Shaping with QoS Autosense
Say the network administrator of a nationwide network wants his Frame Relay-connected routers to shape traffic according to the contracted QoS (CIR, BC, BE) information. Instead of an explicit configuration, the administrator wants to use extended LMI so that the routers get their per-VC QoS information directly from the connected switch in the Frame Relay network. For the QoS autosense to work, extended LMI must be configured on both the router and its connected switch. The router dynamically gets the QoS information from the switch for each VC.
145
Based on the QoS (CIR, BC, BE) information received by the router, the router shapes the traffic on the VC. Listing 8-10 shows an example configuration needed in a router to enable FRTS with QoS autosense. Only the configuration on the router is shown. Listing 8-10 An FRTS Configuration with QoS Autosense interface Serial0/0 encapsulation frame-relay frame-relay lmi-type ansi frame-relay traffic-shaping frame-relay qos-autosense interface Serial0/0.1 point-to-point ip address 202.12.12.1 255.255.255.252 frame-relay interface-dlci 17 IETF protocol ip 202.12.12.2 The frame-relay qos-autosense command enables extended LMI to dynamically learn QoS information for each VC on the interface. The frame-relay traffic-shaping command shapes the traffic on each VC based on the received QoS information. Listings 8-11 and 8-12 show the output of a few relevant show commands for this feature. As shown in Listing 8-11, the show frame-relay qos-autosense command also shows the values received from the switch. As shown in Listing 8-12, the show traffic-shape command shows the QoS parameters and the traffic descriptor parameters. Listing 8-11 Frame Relay QoS Autosense Information Router#show frame-relay qos-autosense ELMI information for interface Serial1 Connected to switch:FRSM-4T1 Platform:AXIS Vendor:cisco (Time elapsed since last update 00:00:03) DLCI = 17 OUT: CIR IN: CIR Priority 0 64000 BC 9600 BE 9600 FMIF 4497 32000 BC 30000 BE 20000 FMIF 4497 (Time elapsed since last update 00:00:03)
Listing 8-12 Frame Relay Traffic Shaping Parameters Router#show traffic-shape I/F Se0/0 access Target list Rate 64000 Byte Limit 2400 Sustain bits/int 9600 Excess bits/int 9600 Interval (ms) 150
The Byte Limit value is the size of the bucket in bytes (Size of bucket = BC + BE = 9600 + 9600 = 19200 bits = 2400 bytes). Interval is the TC interval (TC = BC CIR = 9600/64000 = 150 ms).
146
Listing 8-13 E-Business Site FRTS Configuration on the Interface Connecting to the Service Provider interface Serial 0/0/0 encapsulation frame-relay interface Serial0/0/0.1 point-to-point ip address 202.12.12.1 255.255.255.252 frame-relay interface-dlci 17 IETF protocol ip 202.12.12.2 frame-relay class adaptivefrts map-class frame-relay adaptivefrts frame-relay traffic-shape rate 256000 frame-relay traffic-shape adaptive 64000 Listing 8-14 E-Business Site DTS Configuration on the Interface Connecting to the Service Provider interface Serial 0/0/0 encapsulation frame-relay class-map myclass match any policy-map mypolicy class-map myclass shape peak 256000 8000 8000 shape adaptive 64000 interface serial0/0/0.1 point-to-point ip address 202.12.12.1 255.255.255.252 frame-relay interface-dlci 17 IETF protocol ip 202.12.12.1 service-policy output mypolicy The router at the e-business site is set up to send at a CIR of 256000 and dynamically adapts the rate at which it receives BECNs. The minimum CIR with continued receipt of BECNs (congestion) is 64 Kbps. Listings 8-15 and 8-16 show the FRTS and DTS configurations needed on the service provider router. Listing 8-15 FRTS Configuration on the Service Provider Router Interface interface Serial 0/0 encapsulation frame-relay frame-relay traffic-shaping interface Serial0/0/0.1 point-to-point ip address 202.12.12.2 255.255.255.252 frame-relay interface-dlci 20 IETF protocol ip 202.12.12.1 frame-relay class BECNforFECN map-class frame-relay BECNforFECN frame-relay traffic peak 32000 frame-relay bc out 8000 frame-relay be out 8000 Listing 8-16 DTS Configuration on the Service Provider Router Interface interface Serial 0/0 encapsulation frame-relay class-map myclass match any policy-map mypolicy
147
class-map myclass shape peak 32000 8000 8000 shape fecn-adapt interface serial0/0/0.1 point-to-point ip address 202.12.12.2 255.255.255.252 frame-relay interface-dlci 17 IETF protocol ip 202.12.12.1 service-policy output mypolicy Because the traffic from the e-business router to the remote router is UDP-based and unidirectional, the ebusiness site router does not receive any BECNs from the service provider router to adaptively shape its traffic to the adaptive rate in case of congestion. Hence, the service provider router needs to be made to reflect a frame with a BECN bit set in response to a frame with an FECN bit for the e-business router to shape adaptively. The configuration sets the router to send with a CIR of 32 Kbps and respond with a BECN to any received FECNs. DTS uses modular QoS CLI, which is discussed in Appendix A.
Case Study 8-8: Using Multiple PVCs to a Destination Based on Traffic Type
Imagine that the network architect at an e-commerce site connecting to a service provider backbone wants to use multiple PVCs to carry traffic of different priority. The network currently carries IP traffic at four precedence levels: 03. The e-commerce site orders four PVCs to carry traffic at the four precedence levels. Each PVC is provisioned according to the precedence of the traffic it carries. Listing 8-17 is the sample configuration for this application. Listing 8-17 Configuration to Route Traffic to a PVC Based on Traffic Type interface Serial5/0 encapsulation frame-relay interface Serial 5/0.1 multipoint frame-relay priority-dlci-group 1 203 202 201 200 frame-relay qos-autosense access-list 100 access-list 101 access-list 102 access-list 103 priority-list 1 priority-list 1 priority-list 1 priority-list 1 permit ip any any precedence permit ip any any precedence permit ip any any precedence permit ip any any precedence protocol ip high list 103 protocol ip medium list 102 protocol ip normal list 101 protocol ip low list 100 routine priority immediate flash
The four PVCs configured to the destination have 203, 202, 201, and 200 as their DLCI numbers. The framerelay priority-dlci-group command configures these four DLCIs to carry traffic of high, medium, normal, and low priority, respectively. Note that the command doesn't enable priority queuing, it only assigns different DLCIs to carry different traffic classes. The command uses priority-list 1 to categorize traffic into the four classes that carry traffic with IP precedence 03. The high-priority traffic is carried on the high-priority DLCI, the medium-priority traffic on the medium-priority DLCI, and so on. Listings 8-18 and 8-19 display the output of two relevant show commands. The show frame-relay pvc command in Listing 8-18 shows all the known PVC/DLCI and priority DLCI group information on the router. This command also displays each DLCI's packet statistics. The show queueing priority command in Listing 8-19 displays the configured priority lists on the router.
148
Listing 8-18 show frame-relay pvc Command Output Router#show frame-relay pvc PVC Statistics for interface Serial5/0 (Frame Relay DTE) Active Inactive Deleted Static Local 4 0 0 0 Switched 0 0 0 0 Unused 0 0 0 0 DLCI = 200, DLCI USAGE = LOCAL, PVC STATUS = ACTIVE, INTERFACE = Se5/0 input pkts 0 output pkts 0 in bytes 0 out bytes 0 dropped pkts 0 in FECN pkts 0 in BECN pkts 0 out FECN pkts 0 out BECN pkts 0 in DE pkts 0 out DE pkts 0 out bcast pkts 0 out bcast bytes 0 pvc create time 00:05:31, last time pvc status changed 00:05:31 DLCI = 201, DLCI USAGE = LOCAL, PVC STATUS = ACTIVE, INTERFACE = Se5/0 input pkts 0 output pkts 0 in bytes 0 out bytes 0 dropped pkts 0 in FECN pkts 0 in BECN pkts 0 out FECN pkts 0 out BECN pkts 0 in DE pkts 0 out DE pkts 0 out bcast pkts 0 out bcast bytes 0 pvc create time 00:05:55, last time pvc status changed 00:05:55 DLCI = 202, DLCI USAGE = LOCAL, PVC STATUS = ACTIVE, INTERFACE = Se5/0 input pkts 0 output pkts 0 in bytes 0 out bytes 0 dropped pkts 0 in FECN pkts 0 in BECN pkts 0 out FECN pkts 0 out BECN pkts 0 in DE pkts 0 out DE pkts 0 out bcast pkts 0 out bcast bytes 0 pvc create time 00:04:36, last time pvc status changed 00:04:36 DLCI = 203, DLCI USAGE = LOCAL, PVC STATUS = ACTIVE, INTERFACE = Se5/0 input pkts 0 output pkts 0 in bytes 0 out bytes 0 dropped pkts 0 in FECN pkts 0 in BECN pkts 0 out FECN pkts 0 out BECN pkts 0 in DE pkts 0 out DE pkts 0 out bcast pkts 0 out bcast bytes 0 pvc create time 00:04:37, last time pvc status changed 00:04:37 Priority DLCI Group 1, DLCI 203 (HIGH), DLCI 202 (MEDIUM) DLCI 201 (NORMAL), DLCI 200 (LOW) Listing 8-19 show queueing priority Command Output Router#show queueing priority Current priority queue configuration: List Queue Args 1 high protocol ip list 1 medium protocol ip list 1 normal protocol ip list 1 low protocol ip list
149
Case Study 8-10: Mapping Between Frame Relay DE Bits and IP Precedence Bits
Say a finance company is using a Frame Relay network as a backbone for its nationwide network. The company categorizes traffic at the edges so that priority traffic is serviced preferentially over any background, lower-priority traffic. Its traffic falls into two classes: high priority and low priority, indicated by IP precedence levels 3 and 0, respectively. The company wants to map the low-priority, background traffic with the DE bit set so that nonpriority traffic is discarded when necessary without affecting the priority traffic. The egress traffic at the Frame Relay network is mapped back with IP precedence levels of 3 and 0 based on whether the DE bit is set. Listing 8-21 shows a configuration for mapping IP precedence 0 packets with the DE bit flagged on the Frame Relay circuit. Listing 8-21 Sample Configuration to Map Precedence 0 IP Packets with the DE Bit Set to 1 on the Frame Relay Network frame-relay de-list 1 protocol ip list 101 interface serial 1/0/0 encapsulation frame-relay interface serial1/0/0.1 point-to-point frame-relay interface-dlci 18 broadcast frame-relay de-group 101 18 access-list 101 permit ip any any precedence routine The de-group command defines the packet class and the DLCI on which the DE bit mapping occurs. The delist command defines the packet class that needs to be sent on the Frame Relay network with the DE bit set. In the preceding example, all IP packets with precedence 0 that need to go on the Frame Relay DLCI have the DE bit set. All other IP traffic goes on the Frame Relay without the DE bit flagged (DE = 0).
150
The network administrator is aware that VoIP has time constraints and uses IP precedence 5 for it. The data traffic uses IP precedence levels 04. Although the network uses WFQ, some of the VoIP traffic sees unusually high delay, causing jitter. The administrator notices that the data traffic on average is 1024 bytes, and the VoIP traffic is 64 bytes. Because large packet sizes of the data traffic can cause delays for the VoIP traffic, it is necessary to fragment the data packets exceeding a certain size so that VoIP traffic sees only minimum delays and low jitter. In addition, PQ-WFQ is used to reduce delay and jitter for voice traffic. Listing 8-22 is a sample configuration for enabling Frame Relay fragmentation. Listing 8-22 Enable Frame Relay Fragmentation with PQ-WFQ Shaping Queue interface Serial 1/0 ip address 220.200.200.2 255.255.255.252 encapsulation frame-relay frame-relay traffic-shaping ip rtp priority 16384 16383 640 frame-relay interface-dlci 110 class frag map-class frag frame-relay cir 64000 frame-relay bc 8000 frame-relay fragment 64 frame-relay fair-queue Class frag is defined under the Frame Relay interface to configure Frame Relay fragment-ation along with traffic shaping. The frame-relay fragment command defines the fragment size. According to the configuration, any packet bigger than 64 bytes is fragmented. Frame Relay traffic shaping needs to be enabled on the Frame Relay interface to enable fragmentation. A CIR of 64 Kbps and a BC of 8000 bytes is defined for traffic shaping. On the queue used to buffer frames for traffic shaping, WFQ scheduling is used to decide which buffered packet to transmit next. The ip rtp priority command is used to enable a priority queue within WFQ. By default, end-to-end fragmentation based on the FRF.12 specification is used. Other types of fragmentation also are supported on the Cisco router. Fragmentation based on the FRF.11 Annex C specification is used if the vofr command is used on the Frame Relay interface. The command vofr cisco is used to enable Cisco proprietary fragmentation.
151
802.1Q defines a new tagged frame type by adding a 4-byte tag, which is made up of the following: 2 bytes of Tagged Protocol Identifier (TPID) o 0x8100 is used to indicate an 802.1Q packet 2 bytes of Tagged Control Information (TCI) o 3-bit 802.1p bits o 1-bit canonical format identifier (CFI) o 12-bit VLAN Identifier (ID)
152
Figure 8-15 shows how the original Ethernet/802.3 frame is changed into a tagged 802.1Q frame. The FCS needs to be recalculated after introducing the 4-byte tag. Figure 8-15 An Ethernet Frame to a Tagged 802.1Q Frame
802.1p provides a way to maintain priority information across LANs. It offers eight priorities from the three 802.1p bits. To support 802.1p, the link layer has to support multiple queuesone for each priority or traffic class. The high-priority traffic is always preferred over lower-priority traffic. A switch preserves the priority values while switching a frame. Note You can easily map the 3-bit IP precedence field to the three priority bits of IEEE 802.1p, and vice versa, to provide interworking between IP QoS and IEEE 802.1p.
Cisco's Catalyst family of switches uses 802.1p class of service (CoS) bits for prioritizing traffic based on QoS features such as Weighted Round Robin (WRR) and WRED. WRR and WRED are discussed in detail in Chapter 5, "Per-Hop Behavior: Resource Allocation II," and in Chapter 6, respectively. Though 802.1p uses the tag defined by IEEE 802.1Q, a standard for VLANs, the 802.1p can still be used in the absence of VLANs, as shown in Figure 8-16. A VLAN ID of 0 is reserved and is used to indicate the absence of VLANs. Figure 8-16 Use of 802.1p in the Absence of VLANs
With the addition of the 4-byte tag introduced by the 802.1Q and 802.1p specifications, an Ethernet frame can now exceed the maximum frame size of 1518 bytes. Hence, IEEE 802.3ac is tasked with modifying the 802.3 standard to extend the maximum frame size from 1518 to 1522 bytes. Note IEEE 802.1Q "Standard for Virtual Bridged Local Area Networks" defines a method of establishing VLANs.
153
Expedited traffic capabilities are defined as part of the 802.1p standard. 802.1p is part of the recently modified version of 802.1D[7], a standard for transparent bridging. Expedited traffic capabilities define traffic classes to allow user priorities at frame level. IETF's Integrated Services over Specific Lower Layers (ISSLL) Working Group[8] is defining a way to map Layer 3 Resource Reservation Protocol (RSVP) requests to 802.1p priorities through a Subnet Bandwidth Manager (SBM). SBM is covered in the next section.
SBM SBM[9] is a signaling protocol that supports RSVP-based admission control over the Ethernet family of 802.3style networks. As discussed in Chapter 7, "Integrated Services: RSVP," RSVP is an end-to-end signaling mechanism used to request specific resource reservation from the network. Across an Ethernet, guaranteed bandwidth reservation is not possible by any single station on the segment because the Ethernet segment is a shared medium. A station on the Ethernet segment has no idea of the reservations guaranteed by the other stations on the segment and can send traffic to the segment at a rate that might compromise the reservations existing on the other stations. SBM is a means of supporting RSVP-based reservations on 802.3style networks. An Ethernet segment with one or more SBM-capable devices is referred to as a managed Ethernet segment. A managed segment can be either a shared segment with one or many SBM stations or a switched segment with up to two SBM stations. On a managed segment, one of the SBM-capable devices acts as a designated SBM (DSBM). You can elect a DSBM dynamically, or you can stipulate it by static configuration on SBM stations. All other SBM-capable stations other than the DSBM on a managed segment act as DSBM clients. A DSBM is responsible for admission control for all resource reservation requests originating from DSBM clients in its managed segment. Cisco routers and Windows NT 5.0 are examples of stations with DSBM functionality. A station with SBM functionality is called an SBM-capable station. An SBM-capable station configured to participate in DSBM election is called a candidate DSBM station. Initialization SBM uses two multicast addresses for its operation: AllSBMAAddress This address is used for DSBM election and all DSBM messages. All SBM-capable stations listen on this address. A DSBM sends its messages on this address. DSBMLogicalAddress This address is used for RSVP PATH messages from DSBM clients to DSBM. Only candidate DSBM stations listen to this address. When RSVP is enabled on an SBM-capable station on a shared segment, RSVP registers the interface on the shared segment to listen to the AllSBMAddress multicast address. If the station receives a message I_AM_DSBM on the AllSBMAddress, the interface is considered to be on a managed segment. RSVP on a managed segment should operate according to SBM protocol. A DSBM client on a managed segment will listen for RSVP PATH messages destined to it through the DSBM on AllSBMAddress. A station configured as a candidate DSBM listens to DSBMLogicalAddress along with AllSBMAddress. A candidate DSBM sends a DSBM_WILLING message on the managed segment. All RSVP PATH messages originated by the SBM clients are sent to the DSBMLogicalAddress. A DSBM does not listen for PATH messages on the AllSBAAddress.
154
Broadly, an SBM-capable station adds the following functionality: DSBM Election A mechanism to elect a DSBM on a managed segment RSVP Extensions Extensions for incoming and outgoing RSVP PATH and PATH TEAR message processing The following two sections discuss these requirements. DSBM Election A new SBM station on a managed segment initially listens for a period of time to see if a DSBM is already elected for that segment. If it receives an I_AM_DSBM message on AllSBMAddress, it doesn't participate in the DSBM election until the DSBM goes down and a new DSBM election process becomes necessary. A DSBM sends an I_AM_DSBM message every DSBMRefreshInterval seconds. If a DSBM client doesn't see an I_AM_DSBM message after DSBMDeadInterval (a multiple of DSBMRefreshInterval) seconds, it assumes the DSBM is probably down and starts a new DSBM election process if it is configured to be a candidate DSBM. During DSBM election, each candidate DSBM station sends a DSBM_WILLING message to DSBMLogicalAddress listing its interface address and SBM priority. SBM priority determines the precedence of a candidate DSBM to become a DSBM. A higher-priority candidate DSBM station wins the election. If the SBM priority is the same, the tie is broken by using the IP addresses of the candidate DSBMs. A candidate DSBM with the highest IP address wins the election. RSVP Extensions Under SBM, a DSBM intercepts all incoming and outgoing RSVP PATH and PATH-TEAR messages, adding an extra hop to the normal RSVP operation. All outgoing RSVP PATH messages from the SBM client on a managed segment are sent to the segment's DSBM device (using DSBMLogicalAddress) instead of to the RSVP session destination address. After processing, the DSBM forwards the PATH message to the RSVP session destination address. As part of the processing, the DSBM builds and updates a Path State Block (PSB) for the session and maintains the previous hop L2/L3 addresses of the PATH message. An incoming RSVP PATH message on a DSBM requires parsing of the additional SBM objects and setting up the required SBM-related information in the PSB. An RSVP PATH-TEAR message is used to tear down an established PSB for an RSVP session. RSVP RESV messages don't need any changes as part of SBM. An SBM client wanting to make a reservation after processing an incoming RSVP PATH message follows the standard RSVP rules and sends RSVP RESV messages to the previous hop L2/L3 addresses (the segment's DSBM) of the incoming PATH message based on the information in the session's PSB. A DSBM processes the RSVP RESV message from an SBM client based on available bandwidth. If the request cannot be granted, an RSVP RESVERR message is sent to the SBM client requesting the reservation. If the reservation request can be granted, the RESV message is forwarded to the previous hop address of the incoming PATH message based on the session's PSB. Similar to standard RSVP, a DSBM can merge reservation requests when possible.
155
Summary
ATM provides rich QoS functionality that offers a variety of services. PPD and EPD cell discard techniques improve the effective throughput relative to a random cell discard in an ATM network carrying IP traffic. When a cell needs to be dropped in an ATM network, PPD can send a partial packet (not all packet cells can be dropped), whereas EPD drops all the packet cells. For running IP QoS end to end across an ATM network, packet drops and scheduling at the ingress to the ATM network should be done based on the packet's IP precedence. For this purpose, the IP QoS technologies WFQ and WRED are used. Within the ATM cloud, the ATM VC carrying the IP traffic is provisioned to be lossless and offers a service meeting the IP traffic's service requirements. When all the traffic to a destination is carried over a single PVC, you can apply IP QoS on a per-VC basis. In the case of a PVC bundle where multiple PVCs exist to the destination, you can map certain IP precedence to each PVC and apply per-VC IP QoS for each VC in the bundle. Frame Relay offers a CIR-based QoS with a capability to burst above the committed QoS when the network is not congested. It also offers extensive congestion control parameters. Frame Relay fragmentation enables real-time traffic to be carried on the same PVC as the data packets made up of relatively large packet sizes. On a shared or a switched Ethernet, you can prioritize traffic using 802.1p on an 802.1Q frame. A high-priority frame gets precedence over a lower-priority frame. RSVP-based bandwidth reservation on an Ethernet becomes a problem because it is a shared medium. SBM designates a single station to make RSVP-type bandwidth reservations for the entire Ethernet segment.
Q: A:
Q:
A:
Q: A:
Intended application
156
Enabled based on a Frame Relay PVC or DLCI. Supports only flow-based WFQ, CBWFQ Supports PQ, CQ, WFQ, and CBWFQ on its internal shaping queue. on its internal shaping queue. WFQ and CBWFQ are supported with WFQ and CBWFQ are supported with and without a strict priority queue. and without a strict priority queue. Not supported. Works only with FRTS. Supports any queuing at the interface level. Supports only a dual-FIFO queue at the interface level.
References
1. The ATM Forum, http://www.atmforum.com/ 2. "Dynamics of TCP Traffic over ATM Networks," A. Romanow and S. Floyd, IEEE JSAC, V. 13 N. 4, May 1995, pp. 633641. 3. Early Packet Discard (EDP) Page, http://www.aciri.org/floyd/epd.html 4. "ISDN Data Link Layer Specification for Frame Mode Bearer Services," International Telegraph and Telephone Consultative Committee, CCITT Recommendation Q.922, 19 April 1991. 5. Frame Relay Forum (FRF), http://www.frforum.com/ 6. "Virtual Bridged Local Area Networks," IEEE 802.1Q, http://grouper.ieee.org/groups/802/1/vlan.html 7. "MAC Bridges," IEEE 802.1D, http://grouper.ieee.org/groups/802/1/mac.html 8. "Integrated Services over Specific Link Layers (ISSLL)," IETF Working Group, http://www.ietf.org/html.charters/issll-charter.html 9. "SBM: A Protocol for RSVP-Based Admission Control over IEEE 802-Style Networks," Ed. Yavatkar and others, "draft-ietf-issll-is802-sbm-08.txt," http://search.ietf.org/internet-drafts/draft-ietfissll-is802-sbm-08.txt
157
MPLS
MPLS[1] is an IETF standard for label-swapping -based forwarding in the presence of routing information. It consists of two principal components: control and forwarding. The control component uses a label distribution protocol to maintain label-forwarding information for all destinations in the MPLS network. The forwarding component switches packets by swapping labels using the label information carried in the packet and the label-forwarding information maintained by the control component. MPLS, as the name suggests, works for different network layer protocols. As such, the forwarding component is independent of any network layer protocol. The control component has to support label distribution for different network layer protocols to enable MPLS use with multiple network layer protocols.
Forwarding Component
MPLS packet forwarding occurs by using a label-swapping technique. When a packet carrying a label arrives at a Label Switching Router (LSR), the LSR uses the label as the index in its Label Information Base (LIB). For an incoming label, LIB carries a matching entry with the corresponding outgoing label, interface, and link-level encapsulation information to forward the packet. Based on the information in the LIB, the LSR swaps the incoming label with the outgoing label and transmits the packet on the outgoing interface with the appropriate link-layer encapsulation. A Label Edge Router (LER) is an edge router in the MPLS cloud. Some LER interfaces perform non-MPLSbased forwarding and some run MPLS. An LER adds an MPLS label to all packets entering the MPLS cloud from non-MPLS interfaces. On the same token, it removes the MPLS label from a packet leaving the MPLS cloud. The forwarding behavior in an MPLS network is depicted in Figure 9-1.
158
The preceding procedure simplifies a normal IP router's forwarding behavior. A non-MPLS router performs destination-based routing based on the longest match from the entries in the routing-table-based forwarding table. An MPLS router, on the other hand, uses a short label, which comes before the Layer 3 header, to make a forwarding decision based on an exact match of the label in the LIB. As such, the forwarding procedure is simple enough to allow a potential hardware implementation.
Control Component
The control component is responsible for creating label bindings and then distributing the label binding information among LSRs. Label binding is an association between a label and network layer's reachability information or a single traffic flow, based on the forwarding granularity. On one end, a label can be associated to a group of routes, thereby providing MPLS good-scaling capabilities. On the other end, a label can be bound to a single application flow, accommodating flexible forwarding functionality. MPLS network operation is shown in Figure 9-2.
159
When label binding is based on routing information, MPLS performs destination-based forwarding. Destinationbased forwarding is not amenable to more granular and flexible routing policies, however. For cases involving flexible forwarding policies, the label binding might not be based on routing information. MPLS provides flexible forwarding policies at a granularity of a flow or group of flows. You can use this aspect of MPLS to offer a new service called traffic engineering. Traffic engineering is discussed in Chapter 10. The next section discusses label-binding procedures for achieving destination-based forwarding. Label Binding for Destination-Based Forwarding Cisco Express Forwarding (CEF) is the recommended packet switching mechanism for IP networks today. CEF is discussed in Appendix B, "Packet Switching Mechanisms." A CEF table carries forwarding information based on the routing table; as such, it forwards packets on the basis of the destination. MPLS extends the CEF table to accommodate label allocation for each entry. LIB binds each CEF table entry with a label. MPLS allows three methods for label allocation and distribution: Downstream label allocation Downstream label allocation on demand Upstream label allocation
For all the different types of label allocations, a protocol called Label Distribution Protocol (LDP) is used to distribute labels between routers. Note that the terms "downstream" and "upstream" are used with respect to the direction of the data flow. Downstream Label Allocation Downstream label allocation occurs in the direction opposite the actual data flow's direction. The label carried in a packet is generated and bound to a prefix by an LSR at the link's downstream end. As such, each LSR originates labels for its directly connected prefixes, binds them as an incoming label for the prefixes, and distributes the label association to its prefixes to all the upstream routers. An upstream router puts the received label binding as an outgoing label for the prefix in the CEF table and, in turn, creates an incoming label to it and advertises it to a router further upstream.
160
In independent label distribution mode, each downstream router binds an incoming label for a prefix independently and advertises it as an outgoing label to all its upstream routers. It is not necessary to receive an outgoing label for a prefix before an incoming label is created and advertised. When a router has both the incoming and outgoing labels for a prefix, it can start switching packets by label swapping. The other label distribution mode is termed ordered control mode. In this mode, a router waits for the label from its downstream neighbor before sending its label upstream. Downstream Label Allocation on Demand This label allocation process is similar to downstream allocation, but it is created on demand by an upstream router. The upstream router identifies the next hop for each prefix from the CEF table and issues a request to the next hop for a label binding for that route. The rest of the allocation process is similar to downstream label allocation. Upstream Label Allocation Upstream label allocation occurs in the direction of the actual data flow. The label carried in the data packet's header is generated and bound to the prefix by the LSR at the upstream end of the link. For each CEF entry in an LSR, an outgoing label is allocated and distributed as an incoming label to downstream routers. In this case, incoming labels are allocated to prefixes. When an LSR has both the incoming and the outgoing labels for a prefix, it can start switching packets carrying a label by using label swapping. When an LSR creates a binding between an outgoing labels and a route, the switch, in addition to populating its LIB, also updates its CEF with the binding information. This enables the LSR to add labels to previously unlabeled packets it is originating. Table 9-1 compares downstream and upstream label distribution methods. Table 9-1. Comparison Between Downstream and Upstream Label Distribution Downstream Allocation Upstream Allocation Direction of Label Occurs in the direction opposite the data Occurs in the direction of the data flow. flow. Allocation Allocates the outgoing label for all entries in the Label Allocation Allocates the incoming prefix for all CEF table and distributes the incoming label to and Distribution entries in the CEF table and distributes the outgoing label to the upstream the downstream routers. routers. Distributes incoming labels. Label Distribution Allocates outgoing labels. Protocol Applicable for non-ATM-based IP Downstream label allocation on demand and Applicability networks. upstream label allocation are most useful in Asynchronous Transfer Mode (ATM) networks. Note Some important points to note regarding the MPLS control component: The total number of labels used in an LSR is no greater than the number of its CEF entries. Actually, in most cases, you can associate a single label with a group of routes sharing the same next hop; hence, the number of labels used is much less than the number of CEF entries. This provides a scalable architecture. Label allocation is driven by topology information as reflected by CEF, which is based on the routing information and not on actual data traffic. MPLS doesn't replace IP routing protocols. The MPLS control component depends on the existence of routing information in a router, but it is independent of the kind of IP routing protocol used in an MPLS network. For that matter, you can use any or multiple routing protocols in an MPLS network.
161
Label Encapsulation
A packet can carry label information in a variety of ways: As a 4-byte label inserted between the Layer 2 and network layer headers This applies to Point-to-Point Protocol (PPP) links and Ethernet (all flavors) LANs. A single MPLS label or a label stack (multiple labels) can be carried in this way. Figure 9-3 shows how the label is carried over PPP links and over an Ethernet-type LAN. Figure 9-3 MPLS Label in Ethernet and PPP Frame
As a part of the Layer 2 header This applies to ATM, where the label information is carried in the VPI/VCI fields, as shown in Figure 9-4. Figure 9-4 MPLS Label Carried in the VPI/VCI Fields in an ATM Header
As part of the ATM Adaptation Layer 5 (AAL5) frame before segmentation and reassembly (SAR) This occurs in an ATM environment for label information made up of a label stack (multiple MPLS label fields). Note An addition of an MPLS label or a label stack to a 1492-byte packet might lead to packet fragmentation. Transmission Control Protocol (TCP) path maximum transmission unit (MTU) discovery packets carrying the MPLS label, if sent, detect the need to fragment across an MPLS network. Note, however, that many Ethernet links actually support 1500- or 1508-byte packets. In addition, in most network designs, labeled packets are usually carried over ATM or PPP links and not on local-area network (LAN) segments.
An MPLS label field consists of a label header and a 20-bit label. The label header consists of three fields: CoS, S bit, and Time-to-Live (TTL). The 4-byte MPLS label field format is shown in Figure 9-5.
162
CoS (3 bits) This field is used to deliver differentiated services in an MPLS network. To deliver end-to-end IP QoS, you can copy the IP precedence field to the CoS field at the edge of the MPLS network. Note The CoS field in the MPLS header has only 3 bits. As such, it can carry only the 3-bit IP precedence field and not the 6-bit Differentiated Services Code Point (DSCP) field. Therefore, as needed, the CoS information can be carried as one of the labels in an MPLS label stack. The label field is 20 bits in length and can fit in either the IP precedence field or the DSCP field.
S bit Indicates a label entry at the bottom of the label stack. It is set to 1 for the last entry in the label stack and to zero for all other label stack entries. This allows binding a prefix with multiple labels, also called a label stack. In the case of a label stack, each label has its own associated CoS, S, and TTL values.
TTL (8 bits) Indicates the time to live for an MPLS packet. The TTL value, when set at the edge of the MPLS network, is decremented at each MPLS network hop. Note The IP TTL field is copied into the MPLS TTL field during label imposition by default. It enables the traceroute utility to show all the MPLS hops when the destination lies within or across the MPLS cloud. The no mpls ip propagate-ttl command is used to disallow copying of IP TTL into the MPLS TTL field at the ingress to the MPLS network. In this case, the MPLS TTL field is set to 255. Hence, the traceroute output does not show any hops within the MPLS network. It shows only one IP hop to transit the entire MPLS domain.
163
The mpls ip command is used to enable MPLS on a router. One prerequisite for label distribution is CEF, which you can enable by using the global ip cef command. In downstream label allocation, an MPLS router allocates an incoming label for each CEF entry and distributes it on all its interfaces. An upstream router receives a label binding and uses it as an outgoing label for its associated CEF prefix. In this method of label allocation, the incoming label is allocated locally and the outgoing label is received remotely from a downstream router.
164
This case study concentrates on studying the behavior of downstream label distribution for one prefix, 222.222.222.3. The following Listings show information on the prefix 222.222.222.3 from the CEF table, the label bindings, and the label forwarding table from each router. The show ip cef, show mpls ldp bindings, and show mpls forwarding-table commands are used to fetch information on CEF, label binding, and labelbased forwarding, respectively. Listings 9-1, 9-2, and 9-3 show information on CEF, label binding, and the label distribution protocol from Router NewYork. Listing 9-1 Information on the 222.222.222.3 Prefix in the CEF Table of Router NewYork NewYork#show ip cef 222.222.222.3 222.222.222.3/32, version 170, connected, receive Listing 9-2 Label Binding Information for Prefix 222.222.222.3 on Router NewYork NewYork#show mpls ldp bindings LIB entry: 222.222.222.3/32, rev 118 local binding: label: imp-null remote binding: lsr: 222.222.222.2:0, label: 26 remote binding: lsr: 222.222.222.4:0, label: 29 Listing 9-3 Label Distribution Protocol Parameters NewYork#show mpls ldp parameters Protocol version: 1 Downstream label pool: min label: 10; max_label: 100000; reserved labels: 16 Session hold time: 180 sec; keep alive interval: 60 sec Discovery hello: holdtime: 15 sec; interval: 5 sec Discovery directed hello: holdtime: 180 sec; interval: 5 sec MPLS performs label binding for all prefixes in the CEF table and distributes them to all established LDP neighbors. The show mpls ldp bindings command shows the label bindings. A local binding is advertised upstream by a router. In this case, the local binding for prefix 222.222.222.3 is NULL because it is a directly connected IP address on the router. Router NewYork advertises the null label binding to its LDP adjacent routers, Chicago and Dallas. A router receiving a null label binding for a prefix pops the label out when forwarding a packet destined to this prefix. The remote bindings are the label bindings advertised by the respective routers. The remote LSR routers 222.222.222.2 and 222.222.222.4 have a local label of 26 and 29, respectively. A local label binding is advertised to all LDP adjacent routers. The show mpls ldp parameters command displays the LDP protocol and label binding information. Because Router NewYork receives packets destined to its directly connected 222.222.222.3 prefix without a label, the label forwarding table does not carry any information on prefix 222.222.222.3. Listings 9-4, 9-5, and 9-6 display CEF, label binding, and label forwarding information for prefix 222.222.222.3 from Router Chicago. Listing 9-4 CEF Information on Prefix 222.222.222.3 in Router Chicago Chicago#show ip cef 222.222.222.3 222.222.222.3/32, version 179, cached adjacency to Hssi1/0 0 packets, 0 bytes via 210.210.210.9, Hssi1/0, 0 dependencies next hop 210.210.210.9, Hssi1/0 valid cached adjacency
165
Listing 9-5 Label Binding Information for Prefix 222.222.222.3 in Router Chicago Chicago#show mpls ldp bindings LIB entry: 222.222.222.3/32, rev 90 local binding: label: 26 remote binding: lsr: 222.222.222.3:0, label: imp-null remote binding: lsr: 222.222.222.1:0, label: 31 Listing 9-6 Label Forwarding Information on Prefix 222.222.222.3 in Router Chicago Chicago#show mpls forwarding-table Local Outgoing Prefix label label or VC or Tunnel Id 26 Pop label 222.222.222.3/32
The local label 26 is the label binding for 222.222.222.3 on Router Chicago. This local binding is triggered by the presence of 222.222.222.3 in the CEF table, and it is distributed to all its LDP neighbors. LSR SanFrancisco has a local binding of label 31 for 222.222.222.3, which it advertised to all its LDP neighbors. The show mpls forwarding-table command shows the information required to switch packets by label swapping. The outgoing label is the pop label because it received a null label from the remote NewYork LSR on its outgoing interface (because the prefix is a directly connected address on the remote LSR). In this example, any incoming packet with a label of 26 is switched to the outgoing interface Hssi1/0 after removing or popping the label. The label is removed because the packet is switched to its ultimate destination router. This phenomenon for popping the label at the penultimate hop to the destination is termed penultimate hop popping (PHP). Listings 9-7, 9-8, and 9-9 give CEF, label binding, and label-based forwarding information for the 222.222.222.3 prefix in Router Dallas. Listing 9.7 CEF-related Information on Prefix 222.222.222.3 in Router Dallas Dallas#show ip cef 222.222.222.3 222.222.222.3/32, version 18, cached adjacency to Hssi1/0 0 packets, 0 bytes via 210.210.210.14, Hssi1/0, 0 dependencies next hop 210.210.210.14, Hssi1/0 valid cached adjacency Listing 9-8 Label Bindings for Prefix 222.222.222.3 in Router Dallas Dallas#show mpls ldp bindings LIB entry: 222.222.222.3/32, rev 18 local binding: label: 29 remote binding: lsr: 222.222.222.3:0, label: imp-null remote binding: lsr: 222.222.222.1:0, label: 31 Listing 9-9 Label-based Forwarding Information for Prefix 222.222.222.3 in Router Dallas Dallas#show mpls forwarding-table Local Outgoing Prefix Bytes label Outgoing Next Hop label label or VC or Tunnel Id switched interface 29 Pop label 222.222.222.3/32 1190 Hs1/0 point2point Explanations for the preceding listings from Router Dallas are largely similar to the discussion on the listings from Router Chicago. Listings 9-10, 9-11, and 9-12 display CEF, label binding, and label-based forwarding information on prefix 222.222.222.3 in Router SanFrancisco. Listing 9-10 CEF Entry for Prefix 222.222.222.3 in Router SanFrancisco
166
SanFrancisco#show ip cef 222.222.222.3 222.222.222.3/32, version 38, per-destination sharing 0 packets, 0 bytes via 210.210.210.22, Hssi1/1, 0 dependencies traffic share 1 next hop 210.210.210.22, Hssi1/1 valid adjacency via 210.210.210.18, Hssi1/0, 0 dependencies traffic share 1 next hop 210.210.210.18, Hssi1/0 valid adjacency 0 packets, 0 bytes switched through the prefix Listing 9-11 Label Binding Information on Prefix 222.222.222.3 in Router SanFrancisco SanFrancisco#show mpls ldp bindings LIB entry: 222.222.222.3/32, rev 8 local binding: label: 31 remote binding: lsr: 222.222.222.2:0, label: 26 remote binding: lsr: 222.222.222.4:0, label: 29 Listing 9-12 Label-Based Forwarding Information on Prefix 222.222.222.3 in Router SanFrancisco SanFrancisco#show mpls forwarding-table Local Outgoing Prefix Bytes label Outgoing Next Hop label label or VC or Tunnel Id switched interface 31 26 222.222.222.3/32 0 Hs1/1 point2point 29 222.222.222.3/32 0 Hs1/0 point2point The CEF table shows two equal cost paths taken to reach 222.222.222.3. The local binding for the prefix is 31, and it distributes to all its LDP neighbors. The remote bindings show the local label bindings distributed by the LDP adjacent routers. The MPLS router label switches packets received with a local label with an outgoing label based on the received remote binding information.
MPLS QoS
QoS is an important component of MPLS. In an MPLS network, QoS information is carried in the label header's MPLS CoS field. Like IP QoS, MPLS QoS is achieved in two main logical steps, as shown in Table 9-2, and uses the same associated QoS functions. Figure 9-7 depicts the QoS functions used in an MPLS network. Figure 9-7 QoS in an MPLS Network
167
MPLS uses the same IP QoS functions to provide differentiated QoS for traffic within an MPLS network. The only real difference is that MPLS QoS is based on the CoS bits in the MPLS label, whereas IP QoS is based on the IP precedence field in the IP header. On an ATM backbone with an MPLS-enabled ATM switch, the switch can support MPLS CoS in two ways: Single Label Switched Path (LSP) with Available Bit Rate (ABR) service Parallel LSPs with Label Bit Rate (LBR) service Table 9-2. MPLS QoS Applicable QoS Functions QoS action Committed Access (Option 1) CAR polices traffic on the ingress router for all Rate (CAR) incoming IP traffic entering the MPLS cloud. It sets an IP precedence value for traffic according to the traffic profile and policies. The IP packet's IP precedence value is copied into the MPLS CoS field. (Option 2) CAR polices traffic on the ingress router for all incoming IP traffic entering the MPLS cloud. It sets an MPLS CoS value for traffic according to the traffic profile and contract. The precedence value in the IP header is left unchanged end-to-end, unlike Option 1. Weighted Fair Queuing Traffic differentiation based on the MPLS CoS field in the (WFQ), Weighted MPLS backbone using the IP QoS functions WFQ and Random Error WRED. Detection (WRED)
A single LSP using ATM ABR service can be established through LDP. All MPLS traffic uses the same ABR LSP, and the differentiation is made on the ingress routers to the ATM cloud by running WFQ and WRED algorithms on traffic going over an LSP. Multiple LSPs in parallel can be established through LDP to support traffic with multiple precedence values. Each established LSP is mapped to carry traffic of certain MPLS CoS values. The LSPs use the LBR ATM service. LBR is a new ATM service category that relies on scheduling and discarding in the ATM switch based on WFQ and WRED, respectively, and hence is more appropriate for IP. When an ATM switch doesn't support MPLS, you can use ATM QoS using the ATM Forum traffic class (Constant Bit Rate [CBR], Variable Bit Rate [VBR], and ABR) and its IP interworking, as discussed in Chapter 8, "Layer 2 QoS: Interworking with IP QoS."
End-to-End IP QoS
You can set the MPLS CoS bits at the edge of the network to provide traffic classification so that the QoS functions within the network can provide differentiated QoS. As such, to deliver end-to-end IP QoS across a QoS-enabled MPLS network, you map or copy the IP precedence value to the MPLS CoS bits at the edge of the MPLS network. The IP precedence value continues to be used after the packet exits the MPLS network. Table 9-3 shows the various QoS functions for delivering end-to-end IP QoS across an MPLS network.
168
IP QoS
Standard IP QoS policies are followed. At the network boundary, incoming traffic is policed and set with an IP precedence value based on its service level. Differentiated service is based on the precedence value in the IP network. Packet's IP precedence value is copied into the MPLS CoS field. Note that the MPLS CoS field can also be set directly based on the traffic profile and service contract. Traffic differentiation is based on the MPLS CoS field in the MPLS backbone using the IP QoS functions WFQ and WRED. IP precedence in the IP header continues to be the basis for traffic differentiation and network QoS.
3 4
MPLS network
MPLS QoS
Two sites, Seattle and Atlanta, in the IP network are connected through the MPLS network as shown previously in Figure 9-7. Discuss the MPLS CoS functionality needed in the MPLS network to offer a standard service class for the traffic from Seattle to Atlanta. Table 9-4. Four Service Classes Using MPLS CoS IP Precedence Bits (Value in Type of Service (ToS) Class Decimals) Bits 000 (0) 00 100 (4) 001 (1) 00 01 01 10 10 11 11
Drop Priority 0 1 0 1 0 1 0 1
101 (5) Class 2 (Premium) 010 (2) 110 (6) Class 3 (Control) 011 (3) 111 (7)
LER
IP traffic from the Seattle site going to the Atlanta site enters the MPLS network on the LER router. On the LER router interface that connects to the Seattle site, enable CAR to police incoming traffic according to the service contract. In Listing 9-13, the incoming traffic on the interface is contracted at standard CoS, with traffic of 20 Mbps getting an IP precedence of 5 and any exceeding traffic getting an IP precedence of 1. Listing 9-13 Enable CAR to Classify Traffic for Standard CoS interface Hssi0/0/0 ip address 221.221.221.254 255.255.255.252 rate-limit input 20000000 2000000 8000000 conform-action set-mpls-exp-transmit transmit 5 exceed-action set-mpls-exp-transmit 1 As part of IP/MPLS QoS Interworking, IP precedence is copied into the MPLS CoS field. From the LER, the Atlanta-bound traffic goes via the LSR. On the LER POS3/0/0 interface connecting to the LSR, WRED and WFQ are enabled to differentiate traffic based on the MPLS CoS value carried in the MPLS packet.
169
LSR
Traffic going to the Atlanta site goes via an MPLS ATM network. On the LSR, parallel LSPs with LBR service are set up such that separate LSPs are used to carry traffic belonging to each IP precedence. It enables traffic differentiation based on IP precedence. Listing 9-14 shows a sample configuration to enable MPLS-based parallel LSPs with LBR service. On each LSP, WRED is enabled. Listing 9-14 Enable Parallel LSPs with LBR Service on an LSR interface ATM1/1/0 ! interface ATM 1/1/0.1 mpls ip unnumbered Loopback0 mpls multi-vc mpls random detect
MPLS VPN
One important MPLS application is the VPN[2] service. A client with multiple remote sites can connect to a VPN service provider backbone at multiple locations. The VPN backbone offers connectivity among the different client sites. The characteristics of this connectivity make the service provider cloud look like a private network to the client. No communication is allowed between different VPNs on a VPN service provider backbone. Any device at a VPN site can communicate only with a device at a site belonging to the same VPN. In a service provider network, a Provider Edge router connects to the Customer Edge router at each VPN site. A VPN usually has multiple geographically distributed sites that connect to the service provider's local Provider Edge routers. A VPN site and its associated routes are assigned one or more VPN colors. Each VPN color defines the VPN a site belongs to. A site can communicate with another site connected to the VPN backbone only if it belongs to the same VPN color. A VPN intranet service is provided among all sites connected to the VPN backbone using the same color. A site to one VPN can communicate to a different VPN or to the Internet by using a VPN extranet service. A VPN extranet service can be provided by selectively leaking some external routes of a different VPN or of the Internet into a VPN intranet. Provider Edge routers hold routing information only for the VPNs to which they are directly connected. Provider Edge routers in the VPN provider network are fully meshed using multiprotocol internal BGP (IBGP)[3] peerings. Multiple VPNs can use the same IP version 4 (IPv4) addresses. Examples of such addresses are IP private addresses and unregistered IP addresses. The Provider Edge router needs to distinguish among such addresses of different VPNs. To enable this, a new address family called VPN-IPv4 is defined. The VPN-IPv4 address is a 12-byte value; the first eight bytes carry the route distinguisher (RD) value, and the last four bytes consist of the IPv4 address. The RD is used to make the private and unregistered IPv4 addresses in a VPN network unique in a service provider backbone. The RD consists of a 2-byte autonomous system (AS) number, followed by a 4-byte value that the provider can assign. VPN-IPv4 addresses are treated as a different address family and are carried in BGP by using BGP multiprotocol extensions. In these BGP extensions, label-mapping information is carried as part of the Network Layer Reachability Information (NLRI)[4]. The label identifies the output interface connecting to this NLRI. The extended community attribute[5] is used to carry Route Target (RT) and Source of Origin (SOO) values. A route can be associated with multiple RT values similar to the BGP community attribute that can carry multiple communities for an IP prefix. RT values are used to control route distribution, because a router can decide to accept or reject a route based on its RT value. The SOO value is used to uniquely identify a VPN site. Table 9-5 lists some important MPLS VPN terminology.
170
Term Customer Edge Router Provider Edge Router Provider Router VPN Routing and Forwarding (VRF) Instance VPN-IPv4 Address SOO RD RT
Table 9-5. MPLS VPN Terminology Definition A customer router that interfaces with a Provider Edge router. A provider router that interfaces with a Customer Edge router. A router internal to the provider network. It doesn't have any knowledge of the provisioned VPNs. A routing and forwarding table associated with one or more directly connected customer sites. A VRF is assigned on the VPN customer interfaces. VPN customer sites sharing the same routing information can be part of the same VRF. A VRF is identified by a name, and it has local significance only. Includes 64-bit RD and 32-bit IPv4 addresses. Identifies the originating site. 64-bit attribute used to uniquely identify VPN and customer address space in the provider backbone. 64-bit identifier to indicate which routers should receive the route.
This case study discusses how MPLS VPNs are implemented in a service provider back-bone. It studies the configuration used for VPN service implementation and discusses VPN operation across a provider network using various commands in the provider backbone's router. A VRF instance green for VPN service to the enterprise customer is defined. An RD of 200:100 is used to distinguish this customer's routes in the service provider backbone. The VPN configuration on Router
171
SanFrancisco and the VPN operation between the customer San Francisco site and New York site across the service provider network is discussed next. Listing 9-15 gives the MPLS VPN-related configuration on the SanFrancisco router. Listing 9-15 Setting Up MPLS VPN on Router SanFrancisco ip vrf green rd 200:100 route-target export 200:1 route-target import 200:1 route-target import 200:2 ip cef clns routing ! interface Loopback0 ip address 222.222.222.1 255.255.255.255 ip router isis isp ! interface Hssi0/0 ip vrf forwarding green ip address 222.222.222.25 255.255.255.252 ip router isis isp mpls ip ! interface POS5/1 ip address 222.222.222.21 255.255.255.252 ip router isis isp mpls ip ! interface POS5/3 ip address 222.222.222.17 255.255.255.252 ip router isis isp mpls ip ! ! router isis isp net 50.0000.0000.0000.0001.00 metric-style wide mpls traffic-eng router-id Loopback0 mpls traffic-eng level-1 ! router bgp 200 no synchronization neighbor 222.222.222.2 remote-as 200 neighbor 222.222.222.2 update-source Loopback0 neighbor 222.222.222.3 remote-as 200 neighbor 222.222.222.3 update-source Loopback0 neighbor 222.222.222.4 remote-as 200 neighbor 222.222.222.4 update-source Loopback0 no auto-summary ! address-family ipv4 vrf green no auto-summary no synchronization network 200.200.1.0 network 200.200.11.0 exit-address-family ! address-family vpnv4 neighbor 222.222.222.2 activate neighbor 222.222.222.2 send-community extended neighbor 222.222.222.3 activate neighbor 222.222.222.3 send-community extended
172
neighbor 222.222.222.4 activate neighbor 222.222.222.4 send-community extended no auto-summary exit-address-family ! ip route vrf green 200.200.1.0 255.255.255.0 Hssi0/0 ip route vrf green 200.200.11.0 255.255.255.0 Hssi0/0 ! VRF green is enabled on interface Hssi0/0 that connects to the customer San Francisco site on Router SanFrancisco. Listing 9-16 shows the RD and the interfaces used for VRF green. Listing 9-16 Information on VRF Instance Green SanFrancisco#sh ip vrf green Name Default RD green 200:100
Interfaces Hssi0/0
In the first section of the BGP configuration, Router SanFrancisco peers with the other routers through BGP in the service provider backbone. The loopback0 interface addresses are used for this peering because such interfaces never go down. The next section of the BGP configuration specifies the routes belonging to VRF green. Such routes can be learned dynamically using the VRF green instance of protocols, such as BGP and Routing Information Protocol version 2 (RIPv2). A VRF green instance of a protocol runs only on the VRF green interfaces. Hence, all VRF green routes are learned dynamically through the VRF green interfaces or are specified statically as VRF green routes that point to a VRF green interface. On Router SanFrancisco, the customer routes belonging to the San Francisco site are specified using static routes into VRF green. These static routes point to the point-to-point interface connecting to the San Francisco site. You can see static VRF green routes by using the show ip route vrf green command. The last section of the BGP configuration specifies the BGP peering to exchange the VPN-IPv4 address route family. BGP's extended community attribute is used to carry the export RT values. Local VRF routes are installed in the local BGP VPN-IPv4 table if their import RT value matches their specified export RT value. A VRF in a router has to specify an import RT value equal to one of the export RT values carried by a route for the route to be installed in the BGP VPN routing table. On Router SanFrancisco, an export RT of 200:1 is specified for VRF green. All local routes belonging to VRF green on this router are exported through BGP with an RT of 200:1 in the BGP extended community attribute. A router that needs the SanFrancisco routes belonging to the VRF green should import routes with an RT of 200:1 by using the command route-target import 200:1. Router SanFrancisco itself should have this command because its import RT should match its export RT for the router to carry the routes through BGP. VRF green specifies an RT of 200:2 in addition to 200:1. Therefore, Router SanFrancisco can import any BGP VPN routes that have an RT of either 200:1 or 200:2 into the VRF green routing table. You can show the VRF green routes carried in BGP by using the show ip bgp vpn vrf green command. You can see the routing table for VRF green by using the show ip router vrf green command. The IP routes and BGP routes belonging to VRF green in Router SanFrancisco are shown in Listings 9-17 and 9-18, respectively. Listing 9-17 SanFrancisco Routes in the VRF Routing Table Green SanFrancisco#show ip route vrf green Codes: C - connected, S - static, I - IGRP, R - RIP, M - mobile, B - BGP D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 E1 - OSPF external type 1, E2 - OSPF external type 2, E - EGP i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area * - candidate default, U - per-user static routeo - ODR Gateway of last resort is not set
173
C B S B S
222.222.222.0/30 is subnetted, 1 subnets 222.222.222.24 is directly connected, Hssi0/0 200.200.22.0/24 [200/0] via 222.222.222.3, 00:01:19 200.200.1.0/24 is directly connected, Hssi0/0 200.200.2.0/24 [200/0] via 222.222.222.3, 00:01:19 200.200.11.0/24 is directly connected, Hssi0/0
The VRF green instance of the routing table shows all the dynamically learned or statically specified VRF green routes. It also shows the BGP VPN-IPv4 that is imported into the local VRF green routing table as it matches the import RT 200:2. Listing 9-18 SanFrancisco BGP VPN-IPv4 Routes That Belong to VRF Green SanFrancisco#sh ip bgp vpn vrf green BGP table version is 57, local router ID is 222.222.222.1 Status codes: s suppressed, d damped, h history, * valid, > best, i - internal Origin codes: i - IGP, e - EGP, ? - incomplete Network Next Hop Metric LocPrf Weight Path Route Distinguisher: 200:100 (default for vrf green) *> 200.200.1.0 0.0.0.0 0 32768 i *>i200.200.2.0 222.222.222.3 0 100 0 3456 i *> 200.200.11.0 0.0.0.0 0 32768 i *>i200.200.22.0 222.222.222.3 0 100 0 3456 i BGP exchanges label bindings along with its VPN-IPv4 routes. A BGP label indicates the local output interface that can reach the route. Listing 9-19 shows the label bindings for the VPN-IPv4 routes carried by BGP. Listing 9-19 Label Bindings for the VPN-IPv4 Routes Carried by BGP SanFrancisco#sh ip bgp vpn vrf green label Network Next Hop In label/Out label Route Distinguisher: 200:100 (green) 200.200.1.0 0.0.0.0 26/nolabel 200.200.2.0 222.222.222.3 nolabel/39 200.200.11.0 0.0.0.0 27/nolabel 200.200.22.0 222.222.222.3 nolabel/40 Listing 9-19 shows that 39 is the label the NewYork router with router ID 222.222.222.3 advertised as a label binding for the 200.200.2.0 route. The next several Listings show how an IP packet from the customer San Francisco site reaches the 200.200.2.1 address in the New York site across the service provider network. Listing 9-20 shows label forwarding information for the prefix 222.222.222.3. Listing 9-20 Label Forwarding Information for the 222.222.222.3 Prefix SanFrancisco#show mpls forwarding-table 222.222.222.3 Local Outgoing Prefix Bytes mpls Outgoing Next Hop label label or VC or Tunnel Id switched interface 36 37 222.222.222.3/32 0 Po5/3 point2point 30 222.222.222.3/32 7161 Po5/1 point2point Depending on the nature of load balancing over parallel paths, the router can choose either label for 222.222.222.3. In this case, it chose label 30 through interface Po5/1. After a packet for the 200.200.2.1 prefix arrives on Router SanFrancisco, the router attaches a label stack of two labels (30 and 39) and sends the packet on interface PO5/1 toward Router Chicago. An inner label takes the packet to the next hop 222.222.222.3 and the outer label switches the packet to its actual destination
174
222.222.2.1. Listing 9-21 shows the label forwarding information on Router Chicago for an incoming label of 30. Listing 9-21 The Label Forwarding Table for Prefix 222.222.222.3 Chicago#sh mpls forwarding-table 222.222.222.3 32 Local Outgoing Prefix Bytes label Outgoing Next Hop label label or VC or Tunnel Id switched interface 30 Pop label 222.222.222.3/32 7161 Po0/0 point2point Being a penultimate hop to the next-hop address 222.222.222.3, Router Chicago pops the inner label of 30 and sends the packet toward Router NewYork through interface Serial0, as shown in Listing 9-22. Listing 9-22 Debug Output Showing How the Router Chicago Label-Switched an MPLS Packet from the Customer San Francisco Site to 200.200.2.1 Address of the Customer New York Site MPLS: Po0/1: recvd: CoS=0, TTL=255, Label(s)=30/39 MPLS: Po0/0: xmit: CoS=0, TTL=254, Label(s)=39 On arrival to Router NewYork, the packet still carries the second label, 39. Listing 9-23 shows the label forwarding table in Router NewYork for an incoming label of 39. Listing 9-23 Router NewYork's Label Forwarding Table for VRF Green NewYork#sh mpls forwarding-table vrf green Local Outgoing Prefix Bytes mpls Outgoing Next Hop label label or VC or Tunnel Id switched interface 39 Unlabelged 200.200.2.0/24[V] 520 Hs0/0 point2point 40 Unlabelged 200.200.22.0/24[V] 0 Hs0/0 point2point Based on the label forwarding information, Router NewYork removes the incoming label 39 and sends the packet out on interface Hssi0/0, which connects to the New York site. This label-switching action is depicted in Listing 9-24 by using debug commands in Router NewYork. Listing 9-24 Debug Output Showing How Router NewYork Label-Switched an MPLS Packet from SanFrancisco to the 200.200.2.1 Address MPLS: Po8/0: recvd: CoS=0, TTL=254, Label(s)=39 MPLS: Hs0/0: xmit: (no label) Listing 9-25 gives the VPN-related configuration on Router NewYork. Listing 9-25 VPN Setup on Router NewYork ! ip vrf green rd 200:100 route-target export 200:2 route-target import 200:1 route-target import 200:2 ip cef clns routing ! ! ! interface Loopback0 ip address 222.222.222.3 255.255.255.255 ip router isis isp ! interface POS8/0
175
ip address 222.222.222.9 255.255.255.252 ip router isis isp mpls ip ! interface Hssi0/0 ip vrf forwarding green ip address 222.222.222.37 255.255.255.252 ip router isis isp mpls ip ! interface POS8/2 ip address 222.222.222.14 255.255.255.252 ip router isis isp mpls ip ! router isis isp net 50.0000.0000.0000.0003.00 metric-style wide mpls traffic-eng router-id Loopback0 mpls traffic-eng level-1 ! router bgp 200 no synchronization neighbor 222.222.222.1 remote-as 200 neighbor 222.222.222.1 update-source Loopback0 neighbor 222.222.222.2 remote-as 200 neighbor 222.222.222.2 update-source Loopback0 neighbor 222.222.222.4 remote-as 200 neighbor 222.222.222.4 update-source Loopback0 no auto-summary ! address-family ipv4 vrf green neighbor 222.222.222.38 remote-as 3456 neighbor 222.222.222.38 activate no auto-summary no synchronization exit-address-family ! address-family vpnv4 neighbor 222.222.222.1 activate neighbor 222.222.222.1 send-community extended neighbor 222.222.222.2 activate neighbor 222.222.222.2 send-community extended neighbor 222.222.222.4 activate neighbor 222.222.222.4 send-community extended no auto-summary exit-address-family ! In Router NewYork, the second section of the BGP configuration shows that you use BGP peering to learn the VRF green routes from New York site. Note that on Router SanFrancisco discussed earlier, you configured the San Francisco site VRF green routes statically.
176
The traffic on an MPLS VPN-enabled network's access port is said to be committed if the access port's incoming and outgoing traffic falls below the contracted CAR and CDR, respectively. Committed packets are delivered with a probability higher than that of uncommitted traffic. Because of the connectionless nature of an IP VPN service, you can send packets from any site to any site within a VPN, but you must specify the committed traffic rate for a site's outgoing and incoming traffic separately. Because Frame Relay is connection-oriented, the same traffic rate applies for both ends of the circuit. To implement CAR and CDR service on an access port, a traffic policing function for both incoming and outgoing traffic is applied. A policing function applies a higher IP precedence value for committed traffic than uncommitted traffic. In the service provider VPN backbone, the WFQ and WRED differentiated QoS functions are applied to deliver committed traffic at a probability higher than uncommitted traffic. Table 9-5 illustrates the QoS functions applied on traffic from one VPN site to the other through an MPLS VPN provider network. Table 9-6. MPLS VPN QoS Functions Place of Step Application 1 Ingress router QoS Function (Option 1) CAR polices traffic on the Provider Edge router at the service provider for the incoming traffic from the Customer Edge router. Sets the IP precedence value for traffic according to the traffic profile and contract. The IP packet's IP precedence value is copied into the MPLS CoS field. (Option 2) CAR polices traffic on the Provider Edge router at the service provider for the incoming traffic from the Customer Edge router. Sets the MPLS CoS value for traffic according to the traffic profile and contract. Traffic differentiation based on the MPLS CoS field in the MPLS backbone by using the WFQ and WRED IP QoS functions. CoS field on the MPLS label is copied to the IP precedence field in the IP header. CAR does outbound traffic policing on the Provider Edge router interface connecting to the destination Customer Edge router based on CDR.
3 4
The service provider for the VPN customer provides the CAR and CDR services for each access port, as shown in Figure 9-9. The VPN service provider provisions its network to deliver its access ports' CAR and CDR rates. Any traffic over the committed rates is dropped at a higher probability over the committed traffic.
177
Committed packets in a properly provisioned MPLS VPN network are delivered with a high probability and provide the same service level as CIR in Frame Relay networks.
Guaranteed QoS
Guaranteed QoS requires the use of RSVP end-to-end along the path, from the source to the destination at the VPN sites connected by the VPN service provider backbone. The extent and level of guaranteed QoS depends on which part of the network makes explicit reservations through RSVP PATH messages. The three levels of guaranteed QoS deployment are discussed next. Figure 9-10 depicts these three options, which vary primarily on the QoS offered from the VPN service provider.
178
RSVP at VPN Sites and Diff-Serv Across the Service Provider Backbone
RSVP reservations are made only on the nodes in the VPN sites. At the ingress to the VPN service provider, the guaranteed traffic is marked with a high MPLS CoS value, such that the guaranteed traffic is delivered across the MPLS VPN service provider with a high degree of probability across its network by IP QoS functions such as WFQ and WRED. The MPLS VPN service provider passes any RSVP packet as any normal IP data packet. Thus, reserved traffic receives better service without the service provider having to keep any per-customer reservation state in the provider network.
179
Summary
MPLS enables packet switching based on the label information in the packet without the need to look into the packet's IP header. Packet switching is done using label swapping rather than best-match forwarding for the destination address in the IP forwarding table. An MPLS label carries three CoS bits to indicate packet precedence, similar to the precedence field in the IP header. You can use the MPLS CoS field to indicate packet precedence within the MPLS network and to deliver end-to-end IP QoS across an MPLS cloud. One important MPLS application is MPLS-based VPNs. MPLS VPNs enable a scalable VPN solution with the help of routing protocols that restrict the topology information known to an incoming packet from a VPN site. MPLS VPNs can deliver QoS functionality based on CAR and CDR and differentiated QoS in the service provider core. GB tunnels are used to offer guaranteed VPN QoS.
Q: A:
180
References
1. "Multiprotocol Label Switching Architecture," E. Rosen, A.Viswanathan, and R. Callon, IETF Draft, Work in Progress. 2. "BGP/MPLS VPNs," E. Rosen and Y. Rekhter, RFC 2547. 3. "Multiprotocol Extensions for BGP4," T. Bates, R. Chandra, D. Katz and Y. Rekhter, RFC 2283. 4. "Carrying Label Information in BGP4," Y. Rekhter and E. Rosen, Work in Progress. 5. "BGP Extended Communities Attribute," R. Srihari, D. Tappan, draft-ramachandra-bgp-extcommunities.txt.
181
182
For the traffic from Router A to Router B, for example, the Layer 2 cloud offers three physical paths: A->1->4>B, A->1->2->3->4->B, and A->1->6->5->4->B. The actual path the IP traffic takes, however, is determined by the path predetermined by the Layer 2 switches in the network. The use of the explicit Layer 2 transit layer gives you exact control over how traffic uses the available bandwidth in ways not currently possible by adjusting the Layer 3 IP routing metrics. Large mesh networks mean extra infrastructure costs. They might also cause scalability concerns for the underlying IGP routing protocol, such as OSPF and IS-IS, as the normal IGP flooding mechanism is inefficient in large mesh environments.
RRR
Routing protocols such as OSPF and IS-IS route traffic using the information on the network topology and the link metrics. In addition to the information supplied by a routing protocol, RRR routes an IP packet taking into consideration its traffic class, the traffic class' resource requirements, and the available network resources. Figure 10-2 shows two paths from San Francisco to New York in a service provider networkone through Dallas and another through Chicago. The service provider noticed that the traffic from San Francisco to New York usually takes the San Francisco->Chicago->New York path. This path becomes heavily congested during a certain period of the day, however. It also was noted that during this period of congestion on the path from San Francisco to New York through Chicago, the path from San Francisco to New York through Dallas is heavily underutilized. The need is to engineer the traffic such that it is routed across a network by best utilizing all the available network resources. In this case, all the traffic between San Francisco and New York is getting poor performance because the network cannot use the available alternate path between these two sites during a certain period of the day. This scenario typifies the application and the need for TE. Case Study 10-1 discusses a more detailed scenario to illustrate the workings of MPLS TE.
183
Figure 10-2 TE Tunnel from the San Francisco Router to the New York Router
After some traffic analysis, it was clear that if all New York-bound traffic from San Francisco is carried along the path through Dallas rather than Chicago during its period of congestion, both the paths will be optimally utilized. Therefore, a TE tunnel is established between San Francisco and New York. It is called a tunnel because the path taken by the traffic is predetermined at the San Francisco router and not by a hop-by-hop routing decision. Normally, the TE tunnel takes the path through Chicago. During the period of congestion at Chicago, however, the TE path changes to the path through Dallas. In this case, TE resulted in optimal utilization of the available network resources while avoiding points of network congestion. You can set up the TE path to change back to the path through Chicago after sufficient network resources along that path become available. RRR TE requires the user to define the traffic trunk, the resource requirements and policies on the traffic tunnel, and the computation or specification of the explicit path the TE tunnel will take. In this example, the traffic trunk is New York-bound traffic from San Francisco, the resource requirements are the TE tunnel's bandwidth and other policies, and the explicit path is the path from San Francisco to New York through Dallas. An RRR operational model is shown in Figure 10-3. It depicts the various operational functional blocks of the TE in a flowchart format. The following sections discuss each function in detail.
184
Note Prior to deploying TE in a network, it is important to profile the traffic flow patterns and statistics at various points in the network. You can do this using offline tools and techniques that are beyond the scope of this book. The goal is to come up with a traffic model that best optimizes the network resources.
TE Trunk Definition
The TE trunk defines the class of packets carried on a TE tunnel. This policy is local to the head-end router originating the TE tunnel setup. As discussed in Chapter 3, "Network Boundary Traffic Conditioners: Packet Classifier, Marker, and Traffic Rate Management," a traffic class is defined flexibly based on the Transmission Control Protocol/Internet Protocol (TCP/IP) traffic headers. A traffic class can be based on a single parameter, such as the IP destination or the MPLS Class of Service (CoS) field, or on a number of parameters, such as all File Transfer Protocol (FTP) traffic going from a certain sender to a specific destination. In RRR, all packets of a traffic class take a specified defined or dynamically determined common path across a network. For this reason, RRR traffic classes are also termed traffic trunks.
185
TE Tunnel Attributes
A TE tunnel is given attributes to describe the traffic trunk's requirements and to specify various administrative policies. This section discusses the various tunnel attributes.
Bandwidth
The bandwidth attribute shows the end-to-end bandwidth required by a TE tunnel. You can define it based on the requirements of the traffic class being carried within the TE tunnel.
186
Adaptability
The adaptability attribute specifies whether an existing TE tunnel needs to be reoptimized when a path better than the current TE tunnel path comes up. Reoptimization is discussed later in this chapter.
Resilience
The resilience attribute specifies the desired behavior if the current TE tunnel path no longer exists. This typically occurs due to network failures or preemption. Restoration of a TE tunnel when the current path doesn't work is addressed later in this chapter.
Available Bandwidth
The available bandwidth attribute describes the amount of bandwidth available at each setup priority. The available bandwidth might not be the actual available bandwidth. In certain situations, a network operator can choose to oversubscribe a link by assigning it to a value that is higher than its actual bandwidth. Note Available bandwidth for a higher setup priority should always be more than or equal to that for a lower setup priority.
Resource Class
The resource class attribute colors the link. As was discussed earlier in this chapter, a tunnel decides to include or exclude a link in its path selection computation based on the link's resource class attribute and its own resource class affinity attribute.
187
Step 2. Prunes links with insufficient bandwidth or fail resource policy. Step 3. Runs a separate Shortest Path First (SPF) algorithm to compute the shortest (minimum metric) path on the IGP protocol's link state database after removing any pruned links. This instance of the SPF algorithm is specific to the TE path in question and is different from the SPF algorithm a router uses to build its routing table based on the entire link-state database. An explicit path for the TE tunnel is computed from the SPF run. The computed explicit path is expressed as a sequence of router IP addresses. Upon request to establish a TE tunnel, an explicit path is used in establishing the TE tunnel based on the path selection order.
TE Tunnel Setup
TE-RSVP is used to signal TE tunnels[2]. It uses the same original RSVP messages as the generic signaling protocol, with certain modifications and extensions to support this new application. TE-RSVP helps build an explicitly routed LSP to establish a TE tunnel. TE-RSVP adds two important capabilities that enable it to build an Explicitly Routed Label Switched Path (ERLSP): a way to bind labels to RSVP flows and explicitly route RSVP messages. On an ER-LSP-based TE tunnel, the only sender to the TE tunnel is the LSP's first node, and the only destination is the LSP's last node. All intermediate nodes in the LSP do normal label switching based on the incoming label. The first node in the LSP initiates ER-LSP creation. The first node in the LSP is also referred to as the head-end router. The head-end router initiates a TE tunnel setup by sending an RSVP PATH message to the tunnel destination IP address with a Source Route Object (SRO) specifying the explicit route. The SRO contains a list of IP addresses with a pointer pointing to the next hop in the list. All nodes in the network forward the PATH message to the next-hop address based on the SRO. They also add the SRO to their path state block. When the destination receives the PATH message, it recognizes that it needs to set up an ER-LSP based on the Label Request Object (LRO) present in the PATH message and generates an RSVP reservation request (RESV) message for the session. In the RSVP RESV message that it sends toward the sender, the destination also creates a label and sends it as the LABEL object. A node receiving the RESV message uses the label to send all traffic over that path. It also creates a new label and sends it as a LABEL object in the RSVP RESV message to the next node toward the sender. This is the label the node expects for all incoming traffic on this path. Note The RSVP RESV message follows exactly the reverse path as the RSVP PATH message because the SRO in the path state block established at each node by the PATH message defines how the RSVP RESV message is forwarded upstream.
An ER-LSP is formed and the TE tunnel is established as a result of these operations. Note that no resource reservations are necessary if the traffic being carried is best-effort traffic. When resources need to be allocated
188
to an ER-LSP, the normal RSVP objects, Tspec and Rspec, are used for this purpose. The sender Tspec in the PATH message is used to define the traffic being sent over the path. The RSVP PATH message destination uses this information to construct appropriate receiver Tspecs and Rspecs used for resource allocation at each node in the ER-LSP.
TE Path Maintenance
TE path maintenance performs path reoptimization and restoration functions. These operations are carried out after the TE tunnel is established. Path reoptimization describes the desired behavior in case a better potential TE path comes up after a TE path has already been established. With path reoptimization, a router should look for opportunities to reoptimize an existing TE path. It is indicated to the router by the TE tunnel's adaptability attribute. Path restoration describes how a TE tunnel is restored when the current path doesn't work. The TE tunnel's resilience attribute describes this behavior.
TE-RSVP
TE-RSVP extends the available RSVP protocol to support LSP path signaling. TE-RSVP uses RSVP's available signaling messages, making certain extensions to support TE. Some important extensions include the following: Label reservation support To use RSVP for LSP tunnel signaling, RSVP needs to support label reservations and installation. Unlike normal RSVP flows, TE-RSVP uses RSVP for label reservations for flows without any bandwidth reservations. A new type of FlowSpec object is added for this purpose. TE-RSVP also manages labels to reserve labels for flows. Source routing support LSP tunnels use explicit source routing. Explicit source routing is implemented in RSVP by introducing a new object, SRO. RSVP host support
189
In TE-RSVP, RSVP PATH and RESV messages are originated by the network head-end routers. This is unlike the original RSVP, in which RSVP PATH and RESV messages are generated by applications in end-hosts. Hence, TE-RSVP requires RSVP host support in routers. Support for identification of the ER-LSP-based TE tunnel New types of Filter_Spec and Sender_Template objects are used to carry the tunnel identifier. The Session Object is also allowed to carry a null IP protocol number because an LSP tunnel is likely to carry IP packets of many different protocol numbers. Support for new reservation removal algorithm A new RSVP message, RESV Tear Confirm, is added. This message is added to reliably tear down an established TE tunnel. A summary of the RSVP objects that were added or modified to support TE is tabulated in Table 10-3. Table 10-3. New or Modified RSVP Objects for TE and Their Functions RSVP Object RSVP Purpose Message Label RESV Performs label distribution. Label Request PATH Used to request label allocation. Source Route PATH Specifies the explicit source route. Record Route PATH, RESV Used for diagnosis. This object is used to record the path taken by the RSVP message. Session Attribute PATH Specifies the holding priority and setup priority. Session PATH Can carry a null IP protocol number. Sender_Template PATH Sender_Template and Filter_Spec RESV Filter_Spec can carry a tunnel identifier to enable ER-LSP identification.
IS-IS Modifications
The IS reachability TLV is extended to carry the new data for link resource information. The extended IS reachability TLV is TLV type 22. Within this TLV, various sub-TLVs are used to carry the link attributes for TE.
OSPF Modifications
Because the baseline OSPF Router LSA is essentially nonextensible, OSPF extensions for TE use the Opaque LSA[5]. Three types of Opaque LSAs exist, each having a different flooding scope. OSPF extensions for TE use only Type 10 LSAs, which have a flooding scope of an area. The new Type 10 Opaque LSA for TE is called the TE LSA. This LSA describes routers and point-to-point links (similar to a Router LSA). For TE purposes, the existing Network LSA suffices for describing multiaccess links, so no additional LSA is defined for this purpose.
190
TE Approaches
The example discussed in the beginning of this chapter typifies one approach to TEengineering traffic around points of congestion. This approach to scope TE to only a few paths might not work when other traffic in the network is carried without TE, however. A commonly recommended approach for TE uses the full mesh of LSP-based TE tunnels between all the edge routers in a service provider network. Generally, these edge routers are customer Points of Presence (POP) routers, or routers with peer connections to other providers.
Discuss the operation of the TE tunnel and the TE extensions of IS-IS and RSVP based on the various show and debug commands in the router. All four routers are configured with the addressing scheme and connectivity, as shown in Figure 10-4. Integrated IS-IS is used as the IGP. IS-IS is set up for 32-bit-wide metric values, as opposed to normal 6-bit metrics. The loopback0 interface IP address is enabled as the router-id for MPLS TE. To enable MPLS TE, the mpls traffic-eng tunnels command at both the global and interface levels is enabled. RSVP is enabled on all the interfaces to facilitate RSVP signaling to establish the TE tunnels. RSVP performs two major tasks: It distributes labels to establish a label-switched path along an explicit path for the tunnel, and it makes any necessary bandwidth reservations along the label-switched path. Weighted Fair Queuing (WFQ) is enabled on the interfaces because RSVP depends on WFQ to provide bandwidth guarantees. Router SanFrancisco initiates the TE tunnel to the NewYork router. Interface tunnel0 is configured for this purpose. The tunnel mode is set to TE by the tunnel mode mpls traffic-eng command. A bandwidth reservation of 10 kbps is requested along the tunnel path. The setup and holding priority for the tunnel are set at 7, the highest possible priority value.
191
An explicit TE tunnel path through the Chicago router is specified by an explicit path named sfny. A path option of 1 is used to indicate that the explicitly specified path is the preferred path to establish the TE tunnel. A dynamically determined TE path is specified as the fail-over path. Note that the path option specifies a path's order of preference to be the TE tunnel path. Listings 10-1 through 10-4 list the configuration required on each router. Note that the listings show only the MPLS-related configuration, removing any extraneous information. Listing 10-1 Configuration on Router SanFrancisco ip cef clns routing mpls traffic-eng tunnels ! ! ! interface Loopback0 ip address 222.222.222.1 255.255.255.255 ip router isis isp ! interface Tunnel0 ip unnumbered Loopback0 tunnel destination 222.222.222.3 tunnel mode mpls traffic-eng tunnel mpls traffic-eng autoroute announce tunnel mpls traffic-eng priority 7 7 tunnel mpls traffic-eng bandwidth 10 tunnel mpls traffic-eng path-option 1 explicit name sfny tunnel mpls traffic-eng path-option 2 dynamic tunnel mpls traffic-eng record-route ! ! interface POS5/1 ip address 222.222.222.21 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! interface POS5/3 ip address 222.222.222.17 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! router isis isp net 50.0000.0000.0000.0001.00 metric-style wide mpls traffic-eng router-id Loopback0 mpls traffic-eng level 1 ! ip explicit-path name sfny enable next-address 222.222.222.22 next-address 222.222.222.9 Listing 10-2 Configuration on Router Chicago ip cef clns routing mpls traffic-eng tunnels !
192
! interface Loopback0 ip address 222.222.222.2 255.255.255.255 ip router isis isp ! interface POS0/0 ip address 222.222.222.10 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 468 ip rsvp bandwidth 1158 1158 ! interface POS0/1 ip address 222.222.222.22 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! interface POS0/3 ip address 222.222.222.25 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! router isis isp net 50.0000.0000.0000.0002.00 metric-style wide mpls traffic-eng router-id Loopback0 mpls traffic-eng level-1 Listing 10-3 Configuration on Router NewYork interface Loopback0 ip address 222.222.222.3 255.255.255.255 ip router isis isp ! interface POS8/0 ip address 222.222.222.9 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! interface POS8/2 ip address 222.222.222.14 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! router isis isp net 50.0000.0000.0000.0003.00 metric-style wide mpls traffic-eng router-id Loopback0 mpls traffic-eng level-1 ! Listing 10-4 Configuration on Router Dallas clns routing mpls traffic-eng tunnels
193
! interface Loopback0 ip address 222.222.222.4 255.255.255.255 ip router isis isp interface POS9/0 ip address 222.222.222.18 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! interface POS9/1 ip address 222.222.222.13 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! interface POS9/2 ip address 222.222.222.26 255.255.255.252 ip router isis isp mpls traffic-eng tunnels fair-queue 64 256 36 ip rsvp bandwidth 1158 1158 ! router isis isp net 50.0000.0000.0000.0004.00 metric-style wide mpls traffic-eng router-id Loopback0 mpls traffic-eng level-1 ! After enabling the preceding configurations on the routers, check the TE tunnel's status by using the show mpls traffic-eng tunnel tunnel0 command. Listing 10-5 shows the output of this command. The output shows the tunnel's status and the path option used in setting up the tunnel. In addition, it shows the tunnel configuration and the RSVP signaling information. Listing 10-5 The show mpls traffic-eng tunnel tunnel0 Command Output When the Tunnel Is Taking the Path of the Administratively Defined Explicit Route SanFrancisco#sh mpls tr tun t0 Name: SanFrancisco_t0 Status: Admin: up Oper: up (Tunnel0) Destination: 222.222.222.3 Path: valid Signalling: connected
path option 1, type explicit sfny (Basis for Setup, path weight 20) path option 2, type dynamic Config Paramters: Bandwidth: 10 AutoRoute: enabled Priority: 7 7 Affinity: 0x0/0xFFFF LockDown: disabled
InLabel : OutLabel : POS5/1, 26 RSVP Signalling Info: Src 222.222.222.1, Dst 222.222.222.3, Tun_Id 0, Tun_Instance 1 RSVP Path Info: My Address: 222.222.222.1 Explicit Route: 222.222.222.22 222.222.222.9 Record Route: Tspec: ave rate=10 kbits, burst=1000 bytes, peak rate=10 kbits
194
RSVP Resv Info: Record Route: 222.222.222.9 222.222.222.22 Fspec: ave rate=10 kbits, burst=1000 bytes, peak rate=Inf As is evident from the output shown, the tunnel was set up by using the user-given explicit path through Chicago. RSVP signaling shows the label 26 and the outgoing interface POS5/1 used to send packets on the tunnel path. Because the Record Route option was enabled using the tunnel mpls traffic-eng record-route command on the TE tunnel, the RSVP RESV information records the route taken by it. After the tunnel has been operational, assume that the link from Chicago to New York becomes no longer usable in the tunnel path. The Chicago-to-New York link can be made unusable by just bringing down the link, or by changing the local link resource class attribute. In this case study, the resource class attribute on the link from the Chicago router to the NewYork router is changed to 0x1. Therefore, a bit-wise logical AND operation between the link resource class (0x1) and the tunnel resource class mask (0x0000FFFF) is not logically equal to the tunnel resource affinity value (0x0000). Hence, the link cannot be used in the tunnel path. The resource class attribute on the link from the Chicago router to the NewYork router is changed to 0x1 by using the mpls traffic-eng attribute-flag 0x1 configuration command on the Chicago router's POS0/0 interface. This local link resource attribute change on the Chicago router makes the explicit path unusable because it fails the resource affinity test. Hence, the TE tunnel path falls over to a dynamically selected path. While selecting a dynamic path, the router removes the Chicago-to-NewYork link for the network topology based on its local link resource policy. The router recomputes the TE tunnel path to run through Dallas by running an instance of the SPF algorithm for the tunnel. Listing 10-6 shows the output of the show mpls traffic-eng tunnel tunnel0 command when the tunnel path is taking the dynamic path through the Dallas router. Listing 10-6 The show mpls traffic-eng tunnel tunnel0 Command Output When the Tunnel Path Is Taking the Dynamic Path Through the Dallas Router SanFrancisco#sh mpls tr tun t0 Name: SanFrancisco_t0 Status: Admin: up Oper: up (Tunnel0) Destination: 222.222.222.3 Path: valid Signalling: connected
path option 2, type dynamic (Basis for Setup, path weight 20) path option 1, type explicit sfny Config Paramters: Bandwidth: 10 AutoRoute: enabled Priority: 7 7 Affinity: 0x0/0xFFFF LockDown: disabled
InLabel : OutLabel : POS5/3, 26 RSVP Signalling Info: Src 222.222.222.1, Dst 222.222.222.3, Tun_Id 0, Tun_Instance 2 RSVP Path Info: My Address: 222.222.222.1 Explicit Route: 222.222.222.18 222.222.222.14 222.222.222.3 Record Route: Tspec: ave rate=10 kbits, burst=1000 bytes, peak rate=10 kbits RSVP Resv Info: Record Route: 222.222.222.14 222.222.222.18 Fspec: ave rate=10 kbits, burst=1000 bytes, peak rate=Inf
195
After the path through Chicago is established, once again a link in the tunnel path becomes unusable. The local resource attribute on the link from the SanFrancisco router to the Dallas router is changed to 0x1, making the established TE path no longer usable. Hence, taking into consideration the new local link resource attributes, the SanFrancisco router chooses the TE path through the Chicago and Dallas routers for the path to the NewYork router. Listing 10-7 shows the output of the show mpls traffic-eng tunnel tunnel0 command when the tunnel path is taking the dynamic path through the Chicago and Dallas routers to go to the NewYork router. Listing 10-7 The show mpls traffic-eng tunnel tunnel0 Command Output on the SanFrancisco Router When the Tunnel Path Is Taking the Dynamically Determined Path via Chicago and Dallas to Go to New York SanFrancisco#sh mpls traffic-eng tunnel t0 Name: SanFrancisco_t0 Status: Admin: up Oper: up (Tunnel0) Destination: 222.222.222.3 Path: valid Signalling: connected
path option 2, type dynamic (Basis for Setup, path weight 30) path option 1, type explicit sfny Config Paramters: Bandwidth: 10 AutoRoute: enabled Priority: 7 7 Affinity: 0x0/0xFFFF LockDown: disabled
InLabel : OutLabel : POS5/0, 26 RSVP Signalling Info: Src 222.222.222.1, Dst 222.222.222.3, Tun_Id 0, Tun_Instance 2 RSVP Path Info: My Address: 222.222.222.1 Explicit Route: 222.222.222.22 222.222.222.26 222.222.222.14 222.222.222.3 Record Route: Tspec: ave rate=10 kbits, burst=1000 bytes, peak rate=10 kbits RSVP Resv Info: Record Route: 222.222.222.14 222.222.222.26 222.222.222.22 Fspec: ave rate=10 kbits, burst=1000 bytes, peak rate=Inf SanFrancisco# The preceding example on the SanFrancisco router shows that the TE tunnel path to NewYork is now established through the Chicago and Dallas routers. To learn more about how the MPLS TE tunnel operates, look at Listing 10-8. It shows the relevant beginning section of the show interface tunnel0 command output. The show isis mpls traffic tunnel command can also be used to check on the established TE tunnels. Its output is shown in Listing 10-9. Listing 10.8 The show interface tunnel0 Command Output SanFrancisco#sh interface tunnel 0 Tunnel0 is up, line protocol is up Hardware is Tunnel Interface is unnumbered. Using address of Loopback0 (222.222.222.1) MTU 1496 bytes, BW 9 Kbit, DLY 500000 usec, rely 255/255, load 1/255 Encapsulation TUNNEL, loopback not set Keepalive set (10 sec) Tunnel source 222.222.222.1, destination 222.222.222.3 Tunnel protocol/transport Label Switching, key disabled, sequencing disabled Checksumming of packets disabled, fast tunneling enabled Listing 10-9 The show isis mpls traffic tunnel Command Output
196
SanFrancisco# sh isis mpls traffic-eng tunnel System Id Tunnel Name Bandwidth NewYork.00 Tunnel0 100000
Mode
Listings 10-10 and 10-11 use commands that show the operation of IS-IS with TE extensions. IS-IS with TE extensions not only gives the network topology with the link metric information, but also carries TE attributes of all the links in the network. Listing 10-10 shows the output of a command to show the MPLS TE attributes of all the MPLS-enabled links on the Chicago router with a router-id of 222.222.222.2. Listing 10-10 A Display of the MPLS Traffic Engineering Attributes of All the MPLS-Enabled Links on the Chicago Router with a router-id of 222.222.222.2 4c7507=R1#sh mpls traffic-eng topology 222.222.222.2 IGP Id: 0000.0000.0002.00, MPLS TE Id:222.222.222.2 Router Node, Internal Node_id 3 link[0 ]:Nbr IGP Id: 0000.0000.0004.00, nbr_node_id:2 Intf Address:222.222.222.25, Nbr Intf Address:222.222.222.26 admin_weight:10, affinity_bits:0x0 max_link_bw:1544 max_link_reservable: 1158 allocated reservable allocated reservable ----------------------------------bw[0]: 0 1158 bw[1]: 0 1158 bw[2]: 0 1158 bw[3]: 0 1158 bw[4]: 0 1158 bw[5]: 0 1158 bw[6]: 0 1158 bw[7]: 10 1148 link[1 ]:Nbr IGP Id: 0000.0000.0001.00, nbr_node_id:1 Intf Address:222.222.222.22, Nbr Intf Address:222.222.222.21 admin_weight:10, affinity_bits:0x0 max_link_bw:1544 max_link_reservable: 1158 allocated reservable allocated reservable ----------------------------------bw[0]: 0 1158 bw[1]: 0 1158 bw[2]: 0 1158 bw[3]: 0 1158 bw[4]: 0 1158 bw[5]: 0 1158 bw[6]: 0 1158 bw[7]: 0 1158 link[2 ]:Nbr IGP Id: 0000.0000.0003.00, nbr_node_id:4 Intf Address:222.222.222.10, Nbr Intf Address:222.222.222.9 admin_weight:10, affinity_bits:0x1 max_link_bw:1544 max_link_reservable: 1158 allocated reservable allocated reservable ----------------------------------bw[0]: 0 1158 bw[1]: 0 1158 bw[2]: 0 1158 bw[3]: 0 1158 bw[4]: 0 1158 bw[5]: 0 1158 bw[6]: 0 1158 bw[7]: 0 1158 SanFrancisco# Each router in MPLS TE advertises its own link attributes to the rest of the network using IS-IS extensions. The show isis mpls traffic-eng advertise command shows the TE attributes being advertised by a router. Listing 10-11 shows the output of this command on the SanFrancisco router. Listing 10-11 TE Attributes Being Advertised by the SanFrancisco Router SanFrancisco#sh isis mpls traffic-eng advertise System ID: SanFrancisco.00 Router ID: 222.222.222.1 Link Count: 2 Link[1] Neighbor System ID: W1-R2c4500m=R2.00 (broadcast link)
197
Interface IP address: 222.222.222.21 Neighbor IP Address: 222.222.222.22 Admin. Weight: 10 Physical BW: 1544000 bits/sec Reservable BW: 1158000 bits/sec BW unreserved[0]: 1158000 bits/sec, BW unreserved[1]: BW unreserved[2]: 1158000 bits/sec, BW unreserved[3]: BW unreserved[4]: 1158000 bits/sec, BW unreserved[5]: BW unreserved[6]: 1158000 bits/sec, BW unreserved[7]: Affinity Bits: 0x00000000 Link[2] Neighbor System ID: Y1-R5c7513=R4.00 (broadcast link) Interface IP address: 222.222.222.17 Neighbor IP Address: 222.222.222.18 Admin. Weight: 10 Physical BW: 1544000 bits/sec Reservable BW: 1158000 bits/sec BW unreserved[0]: 1158000 bits/sec, BW unreserved[1]: BW unreserved[2]: 1158000 bits/sec, BW unreserved[3]: BW unreserved[4]: 1158000 bits/sec, BW unreserved[5]: BW unreserved[6]: 1158000 bits/sec, BW unreserved[7]: Affinity Bits: 0x00000001 SanFrancisco#
By default, an operational TE tunnel is not installed in the routing table or announced by IS-IS. The tunnel mpls traffic-eng autoroute announce command is used to install the MPLS tunnel in the routing table and announce it through IS-IS. Listing 10-12 shows output of the show ip route command to the tunnel destination before and after enabling the autoroute announce command. Listing 10-12 Route to the Tunnel Destination 222.222.222.3 Before and After the Tunnel Is Installed in the Routing Table SanFrancisco#show ip route 222.222.222.3 Routing entry for 222.222.222.3/32 Known via "isis", distance 115, metric 30, type level-1 Redistributing via isis Last update from 222.222.222.22 on POS5/1, 00:02:07 ago Routing Descriptor Blocks: * 222.222.222.22, from 222.222.222.3, via POS5/1 Route metric is 30, traffic share count is 1 222.222.222.18, from 222.222.222.3, via POS5/3 Route metric is 30, traffic share count is 1 SanFrancisco(config)#int t0 SanFrancisco(config-if)#tunnel mpls traffic-eng autoroute announce SanFrancisco#sh ip route 222.222.222.3 Routing entry for 222.222.222.3/32 Known via "isis", distance 115, metric 30, type level-1 Redistributing via isis Last update from 222.222.222.3 on Tunnel0, 00:00:01 ago Routing Descriptor Blocks: * 222.222.222.3, from 222.222.222.3, via Tunnel0 Route metric is 30, traffic share count is 1 Listing 10-13 shows the information on the tunnel destination 222.222.222.3 in the Cisco Express Forwarding (CEF) table by using the show ip cef 222.222.222.3 internal command. This command output shows the label imposed on the traffic to the tunnel destination. Listing 10-13 Information on the Tunnel Destination 222.222.222.3 in the CEF Table SanFrancisco#sh ip cef 222.222.222.3 internal
198
222.222.222.3/32, version 253 0 packets, 0 bytes has label information: local label: tunnel head fast label rewrite: Tu0, point2point, labels imposed 26 via 222.222.222.3, Tunnel0, 0 dependencies next hop 222.222.222.3, Tunnel0 valid adjacency RSVP establishes labels to create a label-switched MPLS tunnel path. Each router keeps a label forwarding table to switch packets based on their incoming label. Listing 10-14 shows the label forwarding tables on the routers in the TE tunnel path to the destination. Note that the Dallas router removes the MPLS label before sending the packet to the destination NewYork router. Hence, the NewYork router's label forwarding table is not shown in this example. Listing 10-14 Information on the Label Forwarding Table on the SanFrancisco, Chicago, and Dallas Routers SanFrancisco#sh mpls forwarding-table 222.222.222.3 32 detail Local Outgoing Prefix Bytes mpls Outgoing Next Hop label label or VC or Tunnel Id switched interface Tun hd Unlabeled 222.222.222.3/32 0 Tu0 point2point MAC/Encaps=4/8, MTU=1500, Label Stack{26}, via POS5/1 0F008847 0001A000 Chicago#sh mpls forwarding-table Local Outgoing Prefix Bytes mpls label label or VC or Tunnel Id switched 26 30 222.222.222.1 0 [1] 0 Chicago# Dallas#sh mpls forwarding-table Local Outgoing Prefix Bytes mpls label label or VC or Tunnel Id switched 30 Pop label 222.222.222.1 0 [1] 0 Outgoing interface POS0/3 Next Hop point2point
Figure 10-5 shows how packets are label-switched along the TE tunnel's path. The SanFrancisco router imposes a label of 26 and sends all traffic destined to or through the tunnel destination on interface POS5/1. The Chicago router receives the packet with an incoming label of 26. Based on its label forwarding table, the Chicago router swaps the incoming label in the packet with an outgoing label of 30 and sends it on interface POS0/3. The packet now arrives on the Dallas router with an incoming label of 30. Because Dallas is the penultimate hop before the destination, it removes the MPLS label and sends it toward the destination through interface POS9/1. Figure 10-5 LSP for the TE Tunnel from San Francisco to New York
199
As noted previously, RSVP is used to establish TE tunnels. On the Chicago router, debug information is captured to show the contents of the RSVP messages used to establish the tunnel. Listings 10-15 and 1016 display the contents of the RSVP PATH and RESV messages, respectively, as seen by the Chicago router. Note the new and modified RSVP objects and the information they carried. Listing 10-15 RSVP PATH Message That the Chicago Router Receives from the SanFrancisco Router RSVP: version:1 flags:0000 type:PATH cksum:0000 ttl:254 reserved:0 length:248 SESSION type 7 length 16: Destination 222.222.222.3, TunnelId 0, Source 222.222.222.1 HOP type 1 length 12: DEDEDE15 : 00000000 TIME_VALUES type 1 length 8 : 00007530 EXPLICIT_ROUTE type 1 length 28: (#1) Strict IPv4 Prefix, 8 bytes, 222.222.222.22/32 (#2) Strict IPv4 Prefix, 8 bytes, 222.222.222.26/32 (#3) Strict IPv4 Prefix, 8 bytes, 222.222.222.14/32 LABEL_REQUEST type 1 length 8 : 00000800 SESSION_ATTRIBUTE type 7 length 24: setup_pri: 7, reservation_pri: 7 MAY REROUTE SESSION_NAME:SanFrancisco_t0 SENDER_TEMPLATE type 7 length 12: Source 222.222.222.1, tunnel_id 1 SENDER_TSPEC type 2 length 36: version=0, length in words=7 service id=1, service length=6 parameter id=127, flags=0, parameter length=5 average rate=1250 bytes/sec, burst depth=1000 bytes peak rate =1250 bytes/sec min unit=0 bytes, max unit=0 bytes ADSPEC type 2 length 84: version=0 length in words=19 General Parameters break bit=0 service length=8 IS Hops:0 Minimum Path Bandwidth (bytes/sec):2147483647 Path Latency (microseconds):0 Path MTU:-1 Guaranteed Service break bit=0 service length=8 Path Delay (microseconds):0 Path Jitter (microseconds):0 Path delay since shaping (microseconds):0 Path Jitter since shaping (microseconds):0 Controlled Load Service break bit=0 service length=0 RECORD_ROUTE type 1 length 12: (#1) IPv4 address, 222.222.222.21/32 Note the Explicit Route object in the RSVP PATH message that gives the exact path to be taken by the RSVP PATH message to establish the label-switched tunnel path. It also carries a LABEL REQUEST object to request label distribution and a RECORD ROUTE object to record the path taken by the message. Listing 10-16 RSVP RESV Message Sent by the Chicago Router to the SanFranciso Router RSVP: version:1 flags:0000 type:RESV cksum:D748 ttl:255 reserved:0 length:136 SESSION type 7 length 16: Destination 222.222.222.3, TunnelId 0, Source 222.222.222.1 HOP type 1 length 12: DEDEDE16 : 00000000 TIME_VALUES type 1 length 8 : 00007530 STYLE type 1 length 8 : RSVP_SE_OPTION FLOWSPEC type 2 length 36: version = 0 length in words = 7
200
service id = 5, service length = 6 tspec parameter id = 127, tspec flags = 0, tspec length = 5 average rate = 1250 bytes/sec, burst depth = 1000 bytes peak rate = 2147483647 bytes/sec min unit = 0 bytes, max unit = 0 bytes FILTER_SPEC type 7 length 12: Source 222.222.222.1, tunnel_id 1 LABEL type 1 length 8 : 0000001A RECORD_ROUTE type 1 length 28: (#1) IPv4 address, 222.222.222.14/32 (#2) IPv4 address, 222.222.222.26/32 (#3) IPv4 address, 222.222.222.22/32 Note that the RSVP RESV message carries the LABEL 0x1A (26 in decimal) and the FLOWSPEC information. Its RECORD ROUTE object recorded the path taken by the message. The RSVP session and the installed reservation are shown in Listing 10-17. Listing 10-17 The RSVP Session and Installed Reservation Information from the SanFrancisco Router SanFrancisco#show ip rsvp host sender To From Pro DPort Sport Prev Hop 222.222.222.3 222.222.222.1 0 0 1 SanFrancisco#show ip rsvp installed RSVP: POS5/1 BPS To From 10K 222.222.222.3 222.222.222.1
I/F
Protoc DPort 0 0
Sport 1
By default, the Time-to-Live (TTL) value in the IP header is copied to the TTL value on the MPLS header at the edge router to a label-switched path. Hence, a trace route utility would normally show all the hops in a labelswitched tunnel path. You can mask the label-switched hops in a network from outside users, however, by not copying the TTL value of the IP header into the MPLS header at the edge of the MPLS network. Listing 1018 shows the output of a trace route utility on the SanFrancisco router before and after enabling the no mpls ip propagate-ttl command. Listing 10-18 The traceroute Utility Output With and Without Setting the IP Header TTL Value in the MPLS Label SanFrancisco#trace 222.222.222.3 Type escape sequence to abort. Tracing the route to 222.222.222.3 1 222.222.222.22 12 msec 8 msec 8 msec 2 222.222.222.26 12 msec 8 msec 12 msec 3 222.222.222.14 12 msec 8 msec * SanFrancisco(config)#no mpls ip propagate-ttl
SanFrancisco#trace 222.222.222.3 Type escape sequence to abort. Tracing the route to 222.222.222.3 1 222.222.222.14 8 msec 8 msec * SanFrancisco#
Summary
MPLS TE uses MPLS' circuit-switching properties to engineer TE tunnels with adequate reserved resources along the tunnel path based on the traffic carried by the tunnel. TE-RSVP is used to establish the LSP-based
201
TE path for a TE tunnel. TE extensions for OSPF and IS-IS enable them to carry the available link resource information all over the network. The famous quote on TE by Mike O'Dell of UUNET"The efficacy with which one uses the available bandwidth in the transmission fabric directly drives the fundamental 'manufacturing efficiency' of the business and its cost structure"eloquently captures the reasoning for TE.
A:
Q: A:
202
References
1. "Requirements for Traffic Engineering Over MPLS," D. Awduche et al., draft-ietf-mpls-traffic-eng00.txt, work in progress. 2. "Extensions to RSVP for LSP Tunnels," D. Awduche et al., draft-ietf-mpls-rsvp-lsp-tunnel-04.txt, work in progress. 3. "IS-IS Extensions for Traffic Engineering," H. Smit and T. Li, draft-ietf-isis-traffic-00.txt, work in progress. 4. "Traffic Engineering Extensions to OSPF," D. Katz and D. Yueng, draft-katz-yeung-ospf-traffic-00.txt, work in progress. 5. "The OSPF Opaque LSA Option," R. Coltun, RFC 2370, July 1998.
203
204
Policy application Enabling the policy to an interface or a VC in the ATM or Frame Relay QoS models. This is achieved using the service-policy configuration command.
205
Listing A-3 Example of match-all class-map Keyword Usage class-map class1 match <> class-map class2 match <> class-map match-all class-all match class1 match class2
Policy Definition
After traffic class configuration, the next step is defining the QoS policies on the previously defined traffic classes. Policies are defined using the policy-map command. You can implement any QoS function as a subcommand under the policy-map definition. They include all edge QoS features, such as rate-limiting, rateshaping, and IP precedence or DSCP settings, as well as core QoS features, such as WFQ and WRED. Listings A-4 and A-5 show examples of policies meant for network boundary and core network interfaces, respectively. Listing A-4 Examples of Policies Meant for Network Boundary Interfaces policy-map epolicy1 match class1 rate-limit <> policy-map epolicy2 match class2 set ip dscp EF policy-map epolicy3 match class3 shape <> set ip precedence 4 In Listing A-4, epolicy1 defines a rate-limiting policy on the class1 traffic class, epolicy2 sets the DSCP field to EF on the class2 traffic class, and epolicy3 enables traffic shaping as well as sets the IP precedence value to 4 for all traffic belonging to the class3 class. Listing A-5 Examples of Policies Meant for Core Network Interfaces policy-map cpolicy1 match class1 bandwidth <> random-detect <> In Listing A-5, cpolicy1 defines a certain minimum bandwidth and WRED policies on the class1 traffic class. An example of a multidimensional policy configuration is shown in Listing A-6. In this example, epolicy1 defines a traffic rate-limiting policy on traffic class class1 and sets the IP precedence value to 3 for traffic belonging to the class2 class. Listing A-6 Multidimensional Policies policy-map epolicy1 match class1 rate-limit <> match class2
206
set ip precedence 3
Policy Application
After configuring all relevant class-maps and policy-maps, the final step is enabling the policy on an interface by associating a policy-map command to an interface using a service-policy command. An additional keyword, input or output, is used to specify whether the policy applies to incoming or outgoing packets on an interface, respectively. Listing A-7 shows examples for enabling a policy on an interface. A policy1 policy is applied on all the input traffic arriving on interface HSSI0/0/0. A policy2 policy is applied on the output traffic of interface POS0/0/0. Listing A-7 Examples of Enabling a Policy on an Interface interface HSSI0/0/0 service-policy input policy1 interface POS0/0/0 service-policy output policy2 The service-policy command is also available on a per-VC basis on ATM and Frame Relay interfaces, and on a per-interface basis on logical interfaces such as Tunnel and Fast EtherChannel interfaces, provided the policies specified by the associated policy-map are supported over a VC or logical interface.
Hierarchical Policies
For certain applications, it is necessary to allow definition of hierarchical policies. Given that service policies can be attached to interfaces as well as to individual VCs on a Frame Relay or ATM interface, there is already an implied hierarchy of policies that you can configure by attaching policies at different layers of the hierarchy. (You can attach one service-policy command to a Frame Relay interface and a different service-policy command to a permanent virtual circuit [PVC] on the Frame Relay interface.) In a real sense, an interface or PVC, though not defined by means of a class-map command, represents a traffic class that shares a common attribute. Hence, you also can configure a service-policy command not only on an interface or PVC, but also on a traffic class under a policy-map command. Listing A-8 shows how you can apply a service-policy command on a traffic class to define a policy. Listing A-8 Attaching a Service Policy Directly to a Class policy-map policy-hierarchy class class1 service-policy <> To illustrate hierarchical policies, consider a policy that polices aggregate Transmission Control Protocol (TCP) traffic to 10 Mbps but simultaneously polices certain TCP application traffic, such as aggregate Telnet and File Transfer Protocol (FTP) traffic, each to 1 Mbps. Listing A-9 shows a hierarchical policy configuration for this application. Listing A-9 Hierarchical Rate-Limiting Policy Example class-map tcp match <all tcp traffic> class-map telnet match <all telnet traffic> class-map ftp match <all ftp traffic>
207
policy-map telnet-ftp-police class telnet rate-limit 1000000 class ftp rate-limit 1000000 policy-map TCP-police-hierarchical class tcp rate-limit 10000000 service-policy telnet-ftp-police In the preceding listing, the classes tcp, telnet, and ftp define the traffic belonging to TCP and TCP-based applications, Telnet, and FTP, respectively. A telnet-ftp-police policy is defined to rate-limit traffic belonging to Telnet and FTP applications to 1 Mbps each. Finally, the TCP-police-hierarchical policy is defined to enable hierarchical policy configuration. This hierarchical policy rate-limits traffic belonging to the tcp class to 10 Mbps, while at the same time uses the telnet-ftp-police policy to rate-limit individual TCP-based application traffic belonging to Telnet and FTP traffic to 1 Mbps each.
Apart from the QoS policies, the input packet accounting occurs immediately after the input access-lists are applied, and the output packet accounting occurs immediately before the packet goes on the wire.
208
class class2 rate-limit 20 input <> In Listing A-11, the policy2 policy-map policy matches traffic against class2 before class1 (sequence number 5 before 10). It illustrates how you can change the order in which policies are executed. Listing A-11 Intra-Policy Execution Order Changed by Modifying Sequence Number policy-map policy1 class class1 rate-limit 10 input <> class class2 rate-limit 5 input <>
209
Process Switching
In process switching, a packet arrives on an incoming interface and is enqueued on the input queue of the process that switches the packet. When the process is scheduled to run by the scheduler, the process looks in the routing table for a route to the destination address. If a route is found, the next-hop address is retrieved from the routing table. The Layer 2 Media Access Control (MAC) rewrite information for this next hop is derived from the Address Resolution Protocol (ARP) table, and the packet is now enqueued on the outbound interface for transmission. Process switching is slow, inefficient, and processor-intensive because every packetswitching decision involves lookups in the routing table and in the ARP table. Process switching does perpacket load balancing when multiple equal cost paths exist to a destination. Process-switching operation is depicted in Figure B-1.
Route-Cache Forwarding
Route-cache forwarding addresses some of the problems with process switching. In this method, after the first packet to a destination is process-switched, the destination address, next-hop interface, and MAC address encapsulation to that next hop are all stored in a single table called the route cache. You can quickly switch subsequent packets to the destination by looking up the destination in this route cache. In this method, the switching decision is made on the same receive interrupt that fetched the incoming packet. Route-cache forwarding is also commonly referred to as fast switching. The route-cache forwarding mechanism for packet forwarding is depicted in Figure B-2. Figure B-1 Process Switching Packets by Doing Internet Protocol (IP) Routing Table and ARP Table Lookups
210
Route-cache forwarding populates a fast lookup cache on demand for destination prefixes. A route-cache entry for a destination is created only after the router receives the first packet to that destination. This first packet to a destination is process-switched, but any subsequent packets to the same destination are switched by looking them up in the faster and more efficient route cache. Route-cache entries are periodically aged out. In addition, network topology changes can immediately invalidate these entries. This demand-caching scheme is efficient for scenarios in which most of the traffic flows are associated with a subset of the destinations in the routing table. Traffic profiles in the Internet core and in large intranets, however, no longer fit this description. Traffic characteristics have changed toward an increased number of short-duration flows, typically sourced by Web-based and interactive applications. This changed pattern of Internet traffic calls for a paradigm change in the switching mechanismone that reduces the increased cache-maintenance activity caused by a greater number of short-duration flows and network topology changes, as reflected by the routing table. When multiple equal cost paths exist to a destination subnet, route-cache forwarding caches /32 host prefixes for all traffic going to that subnet to accomplish load balancing.
CEF
CEF is a scalable, Layer 3 switching mechanism designed to accommodate the changing traffic characteristics and network dynamics of the Internet and of large intranets. CEF provides a number of improvements over the traditional route-cache switching approach.
CEF Advantages
CEF avoids the potential overhead of continuous cache activity by using a topology-driven CEF table for the destination switching decision. The CEF table mirrors the entire contents of the routing information; there is a one-to-one correspondence between CEF table entries and routing table prefixes. Any route recursion is also resolved while creating a CEF entry. This translates into significant benefits in terms of performance, scalability, network resilience, and functionality. CEF's benefits are most apparent in large, complex networks that have dynamic traffic patterns. CEF also avoids performance hits during times of network instability by dropping packets without trying to process switch a packet based on the routing table when a CEF entry is missing. Because CEF is based on routing information, a missing CEF entry automatically implies a missing route entry, which itself might be due to lost peering sessions or network instability. An adjacency table is maintained along with the CEF table. The adjacency table keeps the Layer 2 header information separate from the CEF table and is populated by any protocol ARP, Open Shortest Path First (OSPF), BGP, and so onthat discovers an adjacency. Each adjacent node's link layer header is precomputed and stored along with the adjacency. The CEF table is populated by callbacks from the routing table. After a route is resolved, its corresponding CEF entry points to a next hop, which should be an adjacency. If an adjacency is found in the adjacency table, a pointer to the appropriate adjacency is cached in the CEF entry. Figure B-3 depicts a CEF switching operation. Figure B-3 CEF Switching Packets by Doing CEF Table Lookup
211
In addition to regular physical adjacencies, some special handling is also required for special adjacency types. CEF entries with prefixes that need special processing are cached with the appropriate special adjacency type. Table B-1 shows examples of these adjacency types. Table B-1. Examples of Certain Common Special Adjacency Types Adjacency Reason for Special Handling Type Receive Packets to these prefixes are intended for the router. CEF entries with this adjacency are /32 prefixes of the router's IP addresses, and direct and normal broadcast addresses. Null Packets to these prefixes are destined to the router's Null0 interface. CEF entries with this adjacency are created for prefixes destined to the router's Null0 interface. Packets to CEF entries pointing to Null adjacency are dropped by CEF. Glean CEF entries with Glean adjacency are created for subnet prefixes that are directly connected to one of the router's non-point-to-point interfaces. For packets that need to be forwarded to an end station, the adjacency database is gleaned for the specific prefix. Punt CEF entries with Punt adjacency are created for prefixes that can't be CEF-switched because certain features might require special handling or are not yet CEF-supported. Note The Null0 interface is a pseudo-interface that functions similarly to the null devices available on most operating systems. This interface is always up and can never forward or receive traffic. It provides an alternative method of filtering traffic.
Unlike the route-cache model, which caches /32 prefixes to accomplish load balancing, CEF uses an efficient hash function lookup to provide per-source/destination pair load balancing. A hash function points to a unique adjacency for each source and destination address. In addition, CEF can do per-packet load balancing by using a pointer that moves among the equal-cost adjacencies to a destination in a round-robin manner. Other significant advantages of CEF are extensive CEF packet accounting statistics and QoS policy propagation. Appendix C discusses QoS policy propagation.
212
Note Route-cache forwarding can also operate on a distributed mode by downloading the IP cache information to the interface line cards.
Because both CEF and route-cache model switch packets are based on routing information, first look into the routing table in Router R2. Listing B-1 shows the routing information in Router R2 by using the show ip route command.
213
Listing B-1 Routing Table of Router R2 R2#sh ip route Codes: C - connected, S - static, I - IGRP, R - RIP, M - mobile, B - BGP D - EIGRP, EX - EIGRP external, O - OSPF, IA - OSPF inter area N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 E1 - OSPF external type 1, E2 - OSPF external type 2, E - EGP i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, * - candidate default U - per-user static route, o - ODR, P - periodic downloaded static route T - traffic engineered route Gateway of last resort is not set 200.200.200.0/24 is variably subnetted, 8 subnets, 4 masks C 200.200.200.104/30 is directly connected, Hssi2/0/0 O 200.200.200.108/30 [110/20] via 200.200.200.101, Hssi2/0/1 [110/20] via 200.200.200.106, Hssi2/0/0 C 200.200.200.100/30 is directly connected, Hssi2/0/1 S 200.200.200.0/24 is directly connected, Null0 O 200.200.200.1/32 [110/11] via 200.200.200.101, Hssi2/0/1 C 200.200.200.2/32 is directly connected, Loopback0 O 200.200.200.3/32 [110/11] via 200.200.200.106, Hssi2/0/0 C 200.200.200.16/28 is directly connected, FastFastEthernet1/0/0 191.108.0.0/16 is variably subnetted, 3 subnets, 2 masks O 191.108.10.8/30 [110/20] via 200.200.200.106, Hssi2/0/0 O 191.108.10.12/30 [110/20] via 200.200.200.106, Hssi2/0/0 B 191.108.0.0/16 [200/0] via 191.108.10.10 201.201.201.0/24 is variably subnetted, 2 subnets, 2 masks O 201.201.201.8/30 [110/20] via 200.200.200.106, Hssi2/0/0 O 201.201.201.0/28 [110/20] via 200.200.200.106, Hssi2/0/0 B 194.194.194.0/24 [200/0] via 201.201.201.10 B 20.0.0.0/8 [200/0] via 191.108.10.10 B 131.108.0.0/16 [200/0] via 191.108.10.10 B 210.210.0.0/16 [200/0] via 191.108.10.10 The IP routing table is independent of the packet switching mechanism, CEF or route-cache forwarding. It is populated mainly by dynamic routing protocols, connected subnets, and static routes. Before enabling CEF on Router R2, use the show ip cache verbose command to look at the cache table used in the route-cache model. Listing B-2 shows a snapshot of Router R2's route cache table. Listing B-2 A Snapshot of Router R2 Doing Route-Cache Forwarding R2#sh ip cache verbose IP routing cache 7 entries, 1196 bytes 1967 adds, 1960 invalidates, 0 refcounts Minimum invalidation interval 2 seconds, maximum interval 5 seconds, quiet interval 3 seconds, threshold 0 requests Invalidation rate 0 in last second, 0 in last 3 seconds Prefix/Length 20.0.0.0/8-8 4 0F000800 131.108.0.0/16-16 4 0F000800 191.108.1.0/30-30 4 0F000800 191.108.10.8/30-30 4 0F000800 200.200.200.1/32-32 4 0F000800 Age 00:01:14 00:00:49 00:01:47 00:03:19 00:03:22 Interface Hssi2/0/0 Hssi2/0/0 Hssi2/0/0 Hssi2/0/0 Hssi2/0/1 Next Hop 200.200.200.106 200.200.200.106 200.200.200.106 200.200.200.106 200.200.200.101
214
200.200.200.3/32-32 00:03:22 Hssi2/0/0 200.200.200.106 4 0F000800 200.200.200.101/32-30 00:03:19 Hssi2/0/1 200.200.200.101 4 0F000800 200.200.200.20/32-28 00:00:29 FastEthernet1/0/0 200.200.200.106 14 00000C31DD8B00E0B0E2B8430800 210.210.210.0/24-16 00:02:57 Hssi2/0/0 200.200.200.106 4 0F000800 The cache table is always in a state of flux, as it is traffic-driven. A cache entry to a prefix is created only after the first packet to that prefix address space is process-switched. Note the following characteristics of the routecache-based forwarding table: The cache table is classful. Its prefix length can only be 0, 8, 16, 24, or 32. A Class A, B, or C network prefix cannot have a cache prefix length of other than 8, 16, or 24, respectively. The length of a prefix in the cache table is equal to the most specific subnet of that particular prefix's class network in the IP routing table. The cache prefix length of the 200.200.200.0/24 address space, for example, is /32 because at least one route in the 200.200.200.0/24 address space exists with a /32 mask in the routing table.
The length of a prefix is shown in the format x-y in the cache table. Here, x indicates the actual length of the cache entry, and y indicates the length of the matching prefix in the routing table. A cache entry to a prefix also shows the outgoing interface, the next-hop IP address, and the Layer 2 encapsulation. CEF is enabled on Router R2 using the global command ip cef. Listings B-3 and B-4 show Router R2's CEF adjacency table and CEF forwarding table. Listing B-3 CEF Adjacency Table of Router R2 R4506#show adjacency detail Protocol Interface IP Hssi2/0/0
IP
Hssi2/0/1
IP
FastEthernet1/0/0
Address point2point(26) 0 packets, 0 bytes 0F000800 CEF expires: 00:02:47 refresh: 00:00:47 point2point(16) 0 packets, 0 bytes 0F000800 CEF expires: 00:02:01 refresh: 00:00:01 200.200.200.20(16) 0 packets, 0 bytes 00000C31DD8B00E0B0E2B8430800 ARP 00:08:00
The adjacency table shows the Layer 2 encapsulation needed to reach various directly connected routers and end-hosts. Adjacencies are discovered by means of ARP, routing protocols, or map configuration commands used typically on ATM and Frame Relay interfaces. The adjacencies are periodically refreshed to keep them current. Listing B-4 CEF Forwarding Table of Router R2 R2#sh ip cef Prefix 0.0.0.0/32 20.0.0.0/8 131.108.0.0/16 191.108.0.0/16 191.108.10.8/30
215
191.108.10.12/30 194.194.194.0/24 200.200.200.0/24 200.200.200.1/32 200.200.200.2/32 200.200.200.3/32 200.200.200.16/28 200.200.200.16/32 200.200.200.17/32 200.200.200.31/32 200.200.200.100/30 200.200.200.100/32 200.200.200.101/32 200.200.200.102/32 200.200.200.103/32 200.200.200.104/30 200.200.200.104/32 200.200.200.105/32 200.200.200.107/32 200.200.200.108/30 201.201.201.8/30 201.201.201.16/28 210.210.0.0/16 224.0.0.0/4 224.0.0.0/24 255.255.255.255/32
200.200.200.106 200.200.200.106 attached 200.200.200.101 receive 200.200.200.106 attached receive receive receive attached receive 200.200.200.101 receive receive attached receive receive receive 200.200.200.101 200.200.200.106 200.200.200.106 200.200.200.106 200.200.200.106 0.0.0.0 receive receive
The CEF table is stable as long as the topology, as reflected by the routing table, stays the same. All routing table entries have a one-to-one matching entry in the CEF table. CEF entries of various adjacency types in the remaining part of the case study are discussed next. Listings B-5 through B-10 show examples of the various CEF entry types. Listing B-5 An Example of a CEF Entry with Receive Adjacency R4506#sh ip cef 200.200.200.105 200.200.200.105/32, version 6, receive CEF, in addition to maintaining a matching CEF entry for each route in the routing table, carries CEF entries with a Receive adjacency for all connected IP addresses, directed broadcast addresses (the first and the last addresses of each connected subnet) of the router, and general broadcast addresses (0.0.0.0 and 255.255.255.255). 200.200.200.105 is the IP address of the router's Hssi2/0/0 interface. Listing B-6 An Example of a Recursive CEF Entry with Valid Cached Adjacency R4506#sh ip cef 131.108.0.0 131.108.0.0/16, version 20, cached adjacency to Hssi2/0/0 0 packets, 0 bytes via 191.108.10.10, 0 dependencies, recursive next hop 200.200.200.106, Hssi2/0/0 via 191.108.10.8/30 valid cached adjacency The routing table entry for 131.108.0.0 has a next-hop address of 191.108.10.10, which is not a directly connected next hop. It requires a recursive lookup for the next hop for 191.108.10.10 to arrive at a rightconnected next hop of 200.200.200.106 for 131.108.0.0. CEF does a recursive lookup prior to creating a CEF entry for 131.108.10.1. Hence, 131.108.0.0 gets a connected next hop 200.200.200.106 in the CEF table.
216
In the route-cache forwarding model, recursive lookup for a destination, if needed, is done in the process switch path for the first packet being switched to the destination appropriate to the cache entry created. CEF precomputes recursive lookups before packet arrival. This can save some CPU cycles. Listing B-7 An Example of a CEF Entry with Valid Cached Adjacency with Dependencies R2#sh ip cef 191.108.10.10 191.108.10.8/30, version 17, cached adjacency to Hssi2/0/0 0 packets, 0 bytes via 200.200.200.106, Hssi2/0/0, 4 dependencies next hop 200.200.200.106, Hssi2/0/0 valid cached adjacency This CEF entry denotes four dependencies. It indicates that four recursive entries (20.0.0.0/8, 131.108.0.0/16, 191.108.0.0/16, and 210.210.210.0/24) depend on this CEF entry to resolve their next hop. Listing B-8 An Example of a CEF Entry for Equal Cost Paths with Per-Destination Load Sharing R2#sh ip cef 200.200.200.108 200.200.200.108/30, version 12, per-destination sharing 0 packets, 0 bytes via 200.200.200.101, Hssi2/0/1, 0 dependencies traffic share 1 next hop 200.200.200.101, Hssi2/0/1 valid adjacency via 200.200.200.106, Hssi2/0/0, 0 dependencies traffic share 1 next hop 200.200.200.106, Hssi2/0/0 valid adjacency 0 packets, 0 bytes switched through the prefix 200.200.200.108/30 has two equal cost paths and load shares on a per source-destination pair basis. (Note that the term "per-destination sharing" in Listing B-8 is not entirely accurate. It is actually "per sourcedestination pair load sharing.") Load sharing on a per-packet basis is also supported under CEF. Listing B-9 An Example of CEF Entry with Glean Adjacency R2#sh ip cef 200.200.200.16 255.255.255.240 200.200.200.16/28, version 26, attached, connected 0 packets, 0 bytes via FastEthernet1/0/0, 0 dependencies valid glean adjacency 200.200.200.16/28 is a directly connected subnet on the Fast Ethernet interface. Instead of creating a CEF entry for each host on this broadcast/multiaccess subnet, the subnet is installed with a glean adjacency. When sending a packet using a glean adjacency, CEF gleans the adjacency table to get the destination host's MAC address before switching the packet. Listing B-10 An Example of a CEF Entry with Null Adjacency R4506#sh ip cef 200.200.200.0 200.200.200.0/24, version 14, attached 0 packets, 0 bytes via Null0, 0 dependencies valid null adjacency Traffic matching this CEF entry is directed to the Null0 interface.
217
Summary
CEF is the recommended switching mechanism in today's large-scale IP networks and ISP networks. Support for some of the QoS functions can be limited to specific packet-switching modes. In particular, all distributed QoS functions, where QoS functions run on a router's individual line cards instead of on the central processor card, require distributed CEF.
218
QoS-Based Routing
QoS-based routing is a routing mechanism under which paths for flows are determined based on some knowledge of the resource availability in the network as well as the QoS requirements of flows[1]. QoS-based routing calls the following significant extensions: A routing protocol carries metrics with dynamic resource (QoS) availability information (for example, available bandwidth, packet loss, or delay). A routing protocol should calculate not only the most optimal path, but also multiple possible paths based on their QoS availability. Each flow carries the required QoS in it. The required QoS information can be carried in the Type of Service (ToS) byte in the Internet Protocol (IP) header. The routing path for a flow is chosen according to the flow's QoS requirement.
QoS-based routing also involves significant challenges. QoS availability metrics are highly dynamic in nature. This makes routing updates more frequent, consuming valuable network resources and router CPU cycles. A flow could oscillate frequently among alternate QoS paths as the fluid path QoS metrics change. Furthermore, frequently changing routes can increase jitter, the variation in the delay experienced by end users. Unless these concerns are addressed, QoS-based routing defeats its objective of being a value add-on to a QoSbased network. Open Shortest Path First (OSPF) and Intermediate System-to-Intermediate System (IS-IS), the common Interior Gateway Protocols (IGPs) in a service provider network, could advertise a ToS byte along with a linkstate advertisement. But the ToS byte is currently set to zero and is not being used. QoS routing is still a topic under discussion in the standards bodies. In the meantime, OSPF and IS-IS are being extended for Multiprotocol Label Switching (MPLS) traffic engineering (TE) to carry link resource information with each route. These routing protocols still remain destination-based, but each route carries extra resource information, which protocols such as MPLS can use for TE. TE-extended OSPF and IS-IS protocols provide a practical trade-off between the present-day destination-based routing protocols and QoS routing. MPLS-based TE is discussed in Chapter 10, "MPLS Traffic Engineering."
Policy-Based Routing
Routing in IP networks today is based solely on a packet's destination IP address. Routing based on other information carried in a packet's IP header or packet length is not possible using present-day dynamic routing protocols. Policy routing is intended to address this need for flexible routing policies. For traffic destined to a particular server, an Internet service provider (ISP) might want to send traffic with a precedence of 3 on a dedicated faster link than traffic with a precedence of 0. Though the destination is the same, the traffic is routed over a different dedicated link for each IP precedence. Similarly, routing can be based on packet length, source address, a flow defined by the source destination pair and Transmission
219
Control Protocol (TCP)/User Datagram Protocol (UDP) ports, ToS/precedence bits, batch versus interactive traffic, and so on. This flexible routing mode is commonly referred to as policy-based routing. Policy-based routing is not based on any dynamic routing protocol, but it uses the static configuration local to the router. It allows traffic to be routed based on the defined policy, either when specific routing information for the flow destination is unavailable, or by totally bypassing the dynamic routing information. In addition, for policy-routed traffic, you can configure a router to mark the packet's IP precedence. Note Some policy routing functions requiring a route table lookup perform well in the Cisco Express Forwarding (CEF) switching path. CEF is discussed in Appendix B, "Packet Switching Mechanisms." Because CEF mirrors each entry in the routing table, policy routing can use the CEF table without ever needing to make a route table lookup. If netflow accounting is enabled on the interface to collect flow statistics, you should enable the ip route-cache flow accelerate command. For a trade-off of minor memory intake, flow-cache entries carry state and avoid the policy route-map check for each packet of an active flow. Netflow accounting is used to collect traffic flow statistics.
Because the policy routing configuration is static, it can potentially black-hole traffic when the configured next hop is no longer available. Policy routing can use the Cisco Discovery Protocol (CDP) to verify next-hop availability. When policy routing can no longer see the next hop in the CDP table, it stops forwarding the matching packets to the configured next hop and routes those packets using the routing table. The router reverts to policy routing when the next hop becomes available (through CDP). This functionality applies only when CDP is enabled on the interface.
Listing C-1 shows the configuration on the Internet Router (IR) of the e-commerce company to enable the router to route packets based on their IP precedence value. Listing C-1 Configuration on the IR to Route Packets Based on Their IP Precedence Value interface FastEthernet 2/0/1 ip address 211.201.201.65 255.255.255.224 ip policy route-map tasman access-list 101 permit ip any any precedence routine
220
ip ip ip ip ip ip ip
route-map tasman permit 10 match ip address 101 set ip next-hop 181.188.10.14 route-map tasman permit 20 match ip address 102 set ip next-hop 181.188.10.10 The interface FastEthernet2/0/1 is the input interface for all internal traffic. Policy routing is enabled on the input interface. All packets arriving on this interface are policy-routed based on route-map tasman. access-list 101 and access-list 102 are used to match packets with IP precedence values of 0, 1, 2, 3 and 4, 5, 6, 7, respectively. All packets matching access-list 101 are forwarded to the next-hop IP address of 181.188.10.14. All packets matching access-list 102 are forwarded to the next-hop IP address of 181.188.10.10. Listing C-2 shows the relevant show commands for policy routing. Listing C-2 show Commands for Verifying Policy Routing Configuration and Operation IR#show ip policy Interface Route map FastEthernet2/0/1 tasman IR#show route-map tasman route-map tasman, permit, sequence 10 Match clauses: ip address (access-lists): 101 Set clauses: ip next-hop 181.188.10.14 Policy routing matches: 0 packets, 0 bytes route-map tasman, permit, sequence 20 Match clauses: ip address (access-lists): 102 Set clauses: ip next-hop 181.188.10.10 Policy routing matches: 0 packets, 0 bytes The show ip policy command shows the interface(s) performing policy routing for incoming packets along the associated route map for each policy-routed interface. The show route-map tasman command shows the details of the route map tasman and the policy-routed packet statistics for each element (sequence number) of the route map.
221
Listing C-3 Router Configuration to Set Precedence Value Based on the Packet Size interface FastEthernet 4/0/1 ip address 201.201.201.9 255.255.255.252 ip policy route-map tasman route-map tasman permit 10 match length 32 1000 set ip precedence 5 All packets with a minimum and maximum packet size of 32 and 1000 bytes, respectively, are set with an IP precedence of 5. Note A few handy pieces of information on policy-routing configuration are given here: Only one policy route map is allowed per interface. You can enter multiple route-map elements with different combinations of match and set commands, however. You can specify multiple match and set statements in a policy-routing route map. When all match conditions are true, all sets are performed. When more than one parameter is used for a match or a set statement, a match or set happens when any of the parameters is a successful match or a successful set, respectively. match ip address 101 102 is true, for example, when the packet matches against either IP access list 101 or 102. set ip next-hop X Y Z sets the IP next hop for the matched packet to the first reachable next hop. X, Y, and Z are the first, second, and third choices for the IP next hop. The ip policy route-map command is used to policy-route incoming traffic on a router interface. To policy-route router-generated (nontransit) traffic, use the ip local policy route-map command. At this time, policy routing matches packets only through IP access lists or packet length. Here is the evaluation order of commands defining policy routing:
222
This section does not go into interservice-provider QoS policy propagation, as it depends on the negotiated SLAs between service providers, but it does concentrate on propagating QoS policy information for customer networks all over the service provider network. All traffic to or from a customer gets its QoS policy (IP precedence) at the point of entry into the service provider network. As discussed earlier, an edge router of the service provider connecting to a customer can simply set with a QoS policy by writing the packet's IP precedence value based on its service level. The precedence value is used to indicate the service level to the service provider network. Because Internet traffic is asymmetrical, traffic intended for a premium customer might arrive to its service provider network on any of the service provider's edge routers. Therefore, the question here is, how do all the routers in a service provider network recognize incoming traffic to a premium customer and set the packet's IP precedence to a value based on its service level? This section studies ways you can use BGP for QoS policy propagation in such situations. QoS policy propagation using BGP[2] is a mechanism to classify packets based on IP prefix, BGP community, and BGP autonomous system (AS) path information. The supported classification policies include the IP precedence setting and the ability to tag the packet with a QoS class identifier, called a QoS group, internal to the router. After a packet is classified, you can use other QoS features such as Committed Access Rate (CAR) and Weighted Random Early Detection (WRED) to specify and enforce business policies to fit your business model. CAR and WRED are discussed in detail in Chapter 3, "Network Boundary Traffic Conditioners: Packet Classifier, Marker, and Traffic Rate Management," and in Chapter 6, "PerHop Behavior: Congestion Avoidance and Packet Drop Policy." QoS policy propagation using BGP requires CEF. CEF switching is discussed in Appendix B. Any BGP QoS policy from the BGP routing table is passed to the CEF table through the IP routing table. The CEF entry for a destination prefix is tagged with the BGP QoS policy. When CEF-switching a packet, the QoS policy is mapped to the packet as per the CEF table.
223
In Figure C-2, Router BR-3 connects to the premium customer. Therefore, all traffic coming from the premium customer connection gets premium service with the provider network. Premium service is identified by an IP precedence of 4 within the packet header. Internet traffic for the premium customer can arrive on either Router BR-1 or BR-3, as both routers peer with the rest of the Internet. All such Internet traffic on Router BR-1 and Router BR-3 going to the premium customer network needs to be given an IP precedence of 4. Listing C-4 shows how to enable Router BR-3 for a premium customer and BGP policy propagation functionality for premium service. Listing C-4 Enable Router BR-3 with BGP Policy Propagation Functionality for Premium Service and a Premium Customer Connection ip cef interface loopback 0 ip address 200.200.200.3 255.255.255.255
interface Serial4/0/1 ip address 201.201.201.10 255.255.255.252 bgp-policy source ip-prec-map interface Hssi3/0/0 bgp-policy destination ip-prec-map interface Serial4/0/0 bgp-policy destination ip-prec-map ip bgp-community new-format router bgp 109 table-map tasman neighbor 200.200.200.1 remote-as 109 neighbor 201.201.201.10 remote-as 4567 neighbor 201.201.201.10 route-map premium in route-map tasman permit 10 match as-path 1 set ip precedence 4 route-map premium permit 10 set community 109:4
224
ip as-path access-list 1 permit ^4567$ The route-map tasman command on router BR-3 sets a precedence of 4 for all routes with a AS path of 4567. In this case, it is only route 194.194.194.0/24 that belongs to AS 4567. Hence, IP precedence 4 is set on this route in the IP routing table, which is carried over to the CEF table. In addition, routes received on this peering with the premium customer are assigned a community of 109:4 using the route-map premium command such that routers elsewhere can use the community information to assign a policy. Note that the bgp-policy source ip-prec-map command is used on interface Serial4/0/1 so that BGP policy propagation is applied on all premium customer packets. Here, IP precedence mapping is done based on the arriving packet's source address using the precedence value tagged to the source IP address's matching CEF entry. Internet traffic going to the premium customer can enter its service provider network on any of its edge routers with peering connection to other service providers. Hence, QoS policy information regarding a premium customer should be propagated all over the provider network so that the edge routers can set IP precedence based on the QoS policy information. In this example, Internet traffic for the premium customer can arrive on either Router BR-1 or BR-3. Premium customer traffic arriving on interface Hssi3/0/0 and on Serial4/0/0 of Router BR-3 is assigned a precedence of 4. The bgp-policy destination ip-prec-map command is needed on the packets' input interface so that BGP policy propagation is applied on all incoming packets. Here, IP precedence mapping is done based on the packet's destination address using the matching CEF entry's precedence value for the destination address. Listing C-5 shows the relevant BR-1 configuration that enables BGP policy propagation for premium service. Listing C-5 Enable Router BR-1 with BGP Policy Propagation Functionality for Premium Service ip cef interface hssi 3/0/0 bgp-policy destination ip-prec-map ip bgp-community new-format router bgp 109 table-map tasman neighbor 200.200.200.3
remote-as 109
route-map tasman permit 10 match community 101 set ip precedence 4 ip community-list 101 permit :4$ In Listing C-5 of router BR-1, the table-map command uses route-map tasman to assign a precedence of 4 for all BGP routes in the routing table that have a BGP community whose last two bytes are set to 4. Because router BR-3 tags the premium customer route 194.194.194.0/24 with a community 109:4 and exchanges it via IBGP with routers BR-1 and BR-2, the router BR-1 tags the 194.14.194.0/24 in its IP routing table and CEF table with an IP precedence value of 4 The bgp-policy destination ip-prec-map command is needed on the input interface HSSI3/0 of router BR-1 for BGP policy propagation to be applied on the incoming packets from the Internet based on their destination IP address.
Summary
There is a growing need for QoS and traffic engineering in large, dynamic routing environments. The capability of policy-based routing to selectively set precedence bits and route packets based on a predefined flexible policy is becoming increasingly important. At the same time, routing protocols such as OSPF and IS-IS are
225
being addressed for QoS support. TE extends OSPF and IS-IS to carry available resource information along with its advertisements, and it is a step toward full QoS routing. The viability of full QoS routing is still under discussion in the standards bodies. BGP facilitates policy propagation across the entire network. CEF gets this BGP policy information from the routing table and uses it to set a packet policy before forwarding it.
References
1. RFC 2386, "A Framework for QoS-Based Routing in the Internet," E. Crawley, et al. 2. RFC 1771, "Border Gateway Protocol 4 (BGP-4)," Y. Rekhter and T. Li.
226
Reference
1. "RTP: A Transport Protocol for Real-Time Applications," RFC 1889, V. Jacobson, January 1996.
227
References
1. RFC 896, "Congestion Control in IP/TCP Internetworks," John Nagle, 1984. 2. RFC 1191, "Path MTU Discovery," S. Deering, and others, 1990. 3. RFC 1144, "Compressing TCP/IP Headers for Low-Speed Serial Links," V. Jacobson, 1990. 4. RFC 2508, "Compressing IP/UDP/RTP Headers for Low-Speed Serial Links," S. Casner, V. Jacobson, 1999.
228
As you can see in Figure F-1, the real-time voice traffic is placed in the priority queue. The other nonrealtime data traffic can go into one or more normal Weighted Fair Queuing (WFQ) queues. The packets belonging to the nonreal-time traffic are fragmented to secure a minimal blocking delay for the voice traffic. The data traffic fragments are placed in the WFQ queues. Now, CBWFQ with a priority queue can run on the voice and data queues. The maximum blocking delay seen by a voice packet is equal to a fragment's serialization delay. Table F-1 shows the fragment size for a maximum blocking delay of 10 ms based on link speed. Table F-1. Fragment Size for a Maximum Blocking Delay of 10 ms Based on Link Speed Link Speed (in Kbps) Fragment Size (in Bytes) 70 80 160 320 640 1000
229
LFI ensures that voice and similar small-size packets are not unacceptably delayed behind large data packets. It also attempts to ensure that the small packets are sent in a more regular fashion, thereby reducing jitter. This capability allows a network to carry voice and other delay-sensitive traffic along with nontime-sensitive traffic. Note MLPP link-layer fragmentation and interleaving are used in conjunction with CBWFQ using a priority queue to minimize the delay seen by the voice traffic. Listing F-1 shows the configuration for this purpose. Listing F-1 MLPP Link-Layer Fragmentation Configuration class-map premium match <voice packets> policy-map premiumpolicy class premium priority 500 interface serial0 bandwidth 128 no fair-queue ppp multilink interarface serial1 bandwidth 128 no fair-queue ppp multilink interface virtual-template 1 service-policy output premiumpolicy ppp multilink ppp multilink fragment-delay 20 ppp multilink interleave In this example, an MLPP bundle configuration is added on the virtual-template interface. Interfaces Serial0 and Serial1 are made part of the MLPP bundle using the ppp multilink command. Note that CBWFQ with the priority queue is enabled on the virtual-template interface and not on the physical interfaces that are part of MLPP. The ppp multilink fragment-delay 20 command is used to provide a maximum delay bound of 20 ms for the voice traffic. To interleave the voice packets among the fragments of larger packets on an MLPP bundle, the ppp multilink interleave command is used. The CBWFQ policy premiumpolicy is used to provide a strict priority bandwidth for the voice traffic.
References
1. RFC 1717, "The PPP Multilink Protocol (MP)," K. Sklower et al., 1994.
230
IP Precedence Value 0 1 2 3 4 5 6 7
Table G-1. IP Precedence Table IP Precedence Bits IP Precedence Names 000 Routine 001 Priority 010 Immediate 011 Flash 100 Flash Override 101 Critical 110 Internetwork Control 111 Network Control
ToS Byte Value 0 (0x00) 32 (0x20) 64 (0x40) 96 (0x60) 128 (0x80) 160 (0xA0) 192 (0xC0) 224 (0xE0)
Figure G-2 Differentiated Services Code Point (DSCP) Bits in the Differentiated Services (DS) Byte
Defined DSCPs: Default DSCP: 000 000. Class Selector DSCPs: Table G-2. Class Selector DSCPs Class Selector 001 000 010 000 011 000 100 000 101 000 110 000 111 000
DSCP
Expedited Forwarding (EF) per-hop behavior (PHB) DSCP: 101110. Assured Forwarding (AF) PHB DSCPs: Table G-3. Assured Forwarding (AF) PHB DSCPs Drop Precedence Class 1 Class 2 Class 3
Class 4
231
Mapping between IP precedence and DSCP: Table G-4 shows how IP precedence is mapped to DSCP values. Table G-4. IP Precedence to DSCP Mapping IP Precedence 0 8 16 24 32 40 48 56
DSCP
0 1 2 3 4 5 6 7
Table G-5 shows how DSCP is mapped to IP precedence values. Table G-5. DSCP to IP Precedence Mapping IP Precedence 0 1 2 3 4 5 6 7
232