SIGCOMMe Book 2013 V 1
SIGCOMMe Book 2013 V 1
Introduction
Professors often rely on textbooks to teach undergraduate and graduate networking courses. While there are many good introductory textbooks, there are very few books on advanced networking topics that could be suitable to graduate courses in networking. To ll this gap, SIGCOMM Education Committee has launched a community project to develop a high-quality, open-source, edited eBook on Recent Advances in Networking. This eBook will be distributed online via the SIGCOMM website. This eBook is composed of nine chapters chosen after a highly selective review process by the editorial board. The selected chapters cover advanced networking topics and accompanying teaching material (slides and exercises). All the source code of the eBook and the teaching material are kept on a version controlled repository that will be accessible for the entire SIGCOMM community. We expect that releasing such high quality teaching material will be benecial for a large number of students and professors. The teaching material will be updated on a regular basis to reect new advances in our eld. We will also be adding new chapters on more emerging topics in the future volumes. We wish to thank all there authors for providing such high quality chapters. We are also grateful to the reviewers and the editorial board for spending many hours on each chapter to ensure a coherent level amongst all the chapters. We hope you enjoy reading this eBook and nd it a useful resource. Hamed Haddadi, Olivier Bonaventure
Hamed Haddadi and Olivier Bonaventure (editors), Recent Advances in Networking, Volume 1, ACM SIGCOMM eBook, August 2013.
In the chapter on Internet Topology Research Redux we revisit some of the interesting properties of the critical infrastructure at router, AS and PoP level which today underpin how the world is knot together. On top of this ever evolving topology, trafc sources are regulated by end-to-end protocols such as TCP - these too see continual changes, and the chapter on Recent Advances in Transport Protocols covers some of the ways TCP is being enhanced to cope with multihoming, and take advantage of multiple active routes between ends, whether in the data center or in the pocket. We can put together the trafc sources, and the topology under one framework, that of Internet Trafc Matrices. The next chapter introduces a primer to this topic. One general framework for understanding dynamic trafc management in the Internet is to think of the system as one of continual optimization. The chapter on Optimizing and Modeling Dynamics in Networks revisits the goals of fairness and the control problem and its solution space. It has been said that all is fair in love and war, but the only certainties in the world are death and taxes. Life may not be fair, but we can enforce a different kind of fairness on the Internet through trafc pricing. The next chapter discusses Smart Data Pricing (SDP) and covers a range of Economic Solutions to Network Congestion, bringing game theory to bear on the problem where the previous chapter engaged with the weapons of optimization. At a coarser grain (in time and space) we can control trafc by partitioning our network into VPNs - the next chapter looks at the practical tools available to the operator to manage a set of such disjoint systems and their capacity, by focusing on MPLS and Virtual Private Networks. For some time, the capacity of the net has been consumed largely by people downloading or live streaming content. Content Distribution Networks are overlays that allow management of the load. A less rigorous but nearly as popular family of tools exist based on the famous peer-to-peer paradigm. The next chapter looks at how the world has moved on from ghting, to embracing Collaboration Opportunities for Content Providers and Network Infrastructures. The true explosion in Internet access in the developing world (and now dominant form of access in the developed world too) has been through mobile devices (smart phones, tablets and the rest). The next chapter looks at the Design Space of Network Mobility, and how seamlessness is achieved (or at least as close to seamless as we can get today). Finally, there have been sporadic attempts to build provider-less networks that are build out of mobile wireless nodes only. It is quite a challenge, and a number of breakthroughs over the last few years have led to engineering solutions for Enabling Multihop Communication in Spontaneous Wireless Networks, which are starting to look like a viable alternative to managed wireless networks in some niche areas. This is an exciting time to be learning about communications and building and extending systems for communications. Planet Earth is not far away from being 100% connected, and the capacity and functionality that has been achieved over the last few decades is quite astounding. It does not look like we have hit any fundamental limits, nor will for some time to come! Read this book by some of the worlds leading lights in the area of communications and enjoy. Cambridge, UK, Jon Crowcroft August 2013
Editorial Board
Ernst Biersack Olivier Bonaventure Jon Crowcroft Walid Dabbous Bruce Davie Anja Feldmann Timothy Grifn Hamed Haddadi Ramesh Johari Srinivasan Keshav Jean-Yves Le Boudec Jennifer Rexford David Wetherall Walter Willinger
Eurecom Universit e catholique de Louvain University of Cambridge INRIA Sophia Antipolis VMware Technische Universit at Berlin University of Cambridge Queen Mary University of London Stanford University University of Waterloo Ecole polytechnique F ed erale de Lausanne Princeton University University of Washington AT&T Labs Research
Review Board
Mark Allman Emmanuel Baccelli Saleem Bhatti Olivier Bonaventure Gonzalo Camarillo Dah Ming Chiu Costas Courcoubetis Mark Crovella Jon Crowcroft Bruce Davie Benoit Donnet Damien Fay Paul Francis Hamed Haddadi Luigi Iannone Srinivasan Keshav Olaf Maennel Konstantinos Nikitopoulos Costin Raiciu Jennifer Rexford Miguel Rio Matthew Roughan Jean-Louis Rougier Damien Saucez Aman Shaikh Steve Uhlig
International Computer Science Institute INRIA Saclay University of St Andrews Universit e catholique de Louvain Ericsson Research The Chinese University of Hong Kong Athens University of Economics and Business Boston University University of Cambridge VMware Universit e de Li` ege Bournemouth University Max Planck Institute for Software Systems Queen Mary University of London TELECOM ParisTech University of Waterloo Loughborough University University College London Universitatea Politehnica Bucuresti Princeton University University College London University of Adelaide TELECOM ParisTech INRIA Sophia Antipolis AT&T Labs Research Queen Mary University of London
List of Chapters
1. Internet Topology Research Redux Walter Willinger, Matthew Roughan 2. Recent Advances in Reliable Transport Protocols Olivier Bonaventure, Janardhan Iyengar, Costin Raiciu 3. Internet Trafc Matrices: A Primer Paul Tune, Matthew Roughan 4. Optimizing and Modeling Dynamics in Networks Ibrahim Matta 5. Smart Data Pricing (SDP): Economic Solutions to Network Congestion Soumya Sen, Carlee Joe-Wong, Sangtae Ha, Mung Chiang 6. MPLS Virtual Private Networks Luca Cittadini, Giuseppe Di Battista, Maurizio Patrignani 7. Collaboration Opportunities for Content Delivery and Network Infrastructures Benjamin Frank, Ingmar Poese, Georgios Smaragdakis, Anja Feldmann, Bruce M. Maggs, Steve Uhlig, Vinay Aggarwal, and Fabian Schneider 8. The Design Space of Network Mobility Pamela Zave, Jennifer Rexford 9. Enabling Multihop Communication in Spontaneous Wireless Networks Juan Antonio Cordero, Jiazi Yi, Thomas Clausen, Emmanuel Baccelli
1 Introduction
Internet topology research is concerned with the study of the various types of connectivity structures that are enabled by the layered architecture of the Internet. More than a decade of Internet topology research has produced a number of high-prole "discoveries" that continue to fascinate the scientic community, even though (or, especially because) they have been simultaneously touted by different segments of that community as either seminal, controversial, seriously awed, or simply wrong. Among these highly-popularized discoveries are the observed power-law relationships of the Internet topology, the networks scale-free nature, and its extreme vulnerability to attacks that target the highly-connected nodes in its core (i.e., the Achilles heel of the Internet). The purpose of this chapter is to bring order to the current state of Internet topology research and separate the wheat from the chaff. In particular, by relying on carefully vetted data and readily available domain knowledge, we re-examine the reported discoveries and expose them to higher standards with respect to statistical inference and model validation. In the process, we reveal the supercial nature of many of these discoveries and provide alternative solutions that reect networking reality and do not collapse under scrutiny with high-quality data or when examined in detail by domain experts.
W. Willinger, M. Roughan, Internet Topology Research Redux, in H. Haddadi, O. Bonaventure (Eds.), Recent Advances in Networking , (2013), pp. xx-yy. Licensed under a CC-BY-SA Creative Commons license.
as Autonomous Systems (ASes), together these individual networks form what we now call the "public Internet" and are owned by a diverse set of organizations and companies that includes large and small Internet Service Providers (ISPs), transit providers, network service providers, Fortune 500 companies and small businesses, academic and research organizations, content providers, Content Distribution Networks (CDNs), Web hosting companies, and cloud providers. With this transition came an increasing fascination of the research community with a largely economicsdriven connectivity structure commonly referred to as the Internets AS-graph; that is, the logical Internet topology where nodes represent individual ASes and edges reect observed relationships among the ASes (e.g., customer-provider, peer-peer, or sibling-sibling relationship). It is important to note that the AS-graph says little about how two ASes connect with one another at the physical level; in particular, it says nothing about if or how they exchange actual trafc. Nevertheless, starting shortly after 1995, this fascination with the AS-graph has resulted in thousands of research publications covering a range of aspects related to measuring, modeling, and analyzing the AS-level topology of the Internet and its evolution over time [60, 135]. At the application layer, the emergence of the World Wide Web (WWW) in the late 1990 as a killer application generated general interest in exploring the Web-graph, where nodes represent web pages and edges denote hyperlinks [18]. While this overlay network or logical connectivity structure says nothing about how the servers hosting the web pages are connected at the physical or AS level, its scale and dynamics differ drastically from its physical-based or economics-driven underlays a typical Web-graph has billions of nodes and even more edges and is highly dynamic; a large ISPs router-level topology consists of some thousands of routers, and todays AS-level Internet is made up of some 30,000-40,000 actively routed ASes and an order of magnitude more links. Other applications that give rise to their own "overlay" or logical connectivity structure and have attracted some attention among researchers include email and various P2P systems such as Gnutella, Kad, eDonkey, and BitTorrent. More recently, the enormous popularity of Online Social Networks (OSNs) has resulted in a staggering number of research papers dealing with all different aspects of measuring, modeling, analyzing, and designing OSNs. Data from large-scale crawls or, in rare circumstances, OSNprovided data have been used to examine snapshots of many real-world OSNs or OSN-type systems, where the snapshots are generally simple graphs with nodes representing individual users and edges denoting some implicit or explicit friendship relationship among the users.
technology-enabled inter-personal communication at previously unheard of scale. In addition, mathematicians are interested in the different connectivity structures mainly because of their many novel features and properties that tend to require new and creative modeling and analysis methodologies. From the perspective of many computer scientists, the challenges posed by many of these intricate connectivity structures are algorithmic in nature and arise from trying to solve specic problems involving a particular topological structure. For yet another motivation, many physicists turned network scientists see the Internet as one of many examples of large-scale complex networks that awaits the discovery of universal properties that do not depend on system-specic details and advance our understanding of these complex networks irrespective of the domain in which they arose in the rst place.
impression that too many cooks spoil the broth. We hope that in the not-too-distant future, this impression will be replaced by many hand make light work, and we see this chapter as a rst step towards achieving this goal.
1.4 Themes
In writing this chapter there are a number of themes that emerge, and it is our intention to highlight them to bring out in the open the main differences between a detail-oriented engineering approach to Internet topology modeling versus an approach that has become a hallmark of network science and aims at abstracting away as many details as possible to uncover universal laws that govern the behavior of large-scale complex networks irrespective of the domains that specify those networks in the rst place.
Theme 1: When studying highly-engineered systems such as the Internet, details in the form of protocols, architecture, functionality, and purpose matter. Theme 2: When analyzing Internet measurements, examining the hygiene of the available measurements (i.e., an in-depth recounting of the potential pitfalls associated with producing the measurements in question) is critical. Theme 3: When validating proposed topology models, it is necessary to treat network modeling as an exercise in reverse-engineering and not as an exercise in model-tting. Theme 4: When modeling highly-engineered systems such as the Internet, beware of M.L. Menckens quote For every complex problem there is an answer that is clear, simple, and wrong.
2 Primer
We start rst by dening some common ideas, motives, and problems within the scope of network topology modelling.
We can also dene functions of groupings of nodes or edges, though in this case it is not as conceptually obvious why we might. However, an exemplary case is that of on-net where we might dene a function that classies pairs of nodes as on the same subnet or not. Thus, such functions can ascribe meaning to groupings of nodes. Many of the Internet graphs have symmetric links (that is, if i j is a link, then j i is also a link) and so these networks are undirected, but we also need sometimes to represent asymmetric links, and do so with a directed graph or digraph, and we call the links in such a digraph arcs. In the study of network topology we might come across the more generalized graph concepts of the multi-graph and hyper-graph. hypergraph: links connect more than two nodes e.g., where you have a connective medium (rather than a wire), for instance in a wireless network. multigraph or pseudograph: has multiple parallel links between two nodes e.g., it is easy to have two links between two routers. Well exclude these cases unless explicitly stated, but it is worth noting that each of these do apply to particular aspects of the Internet. We say two nodes are connected if a path exists between them, and that a graph is connected if all pairs of nodes are connected. A graph is k node connected if the graph remains connected after the removal of any set of k 1 or fewer nodes (and corresponding links) and k edge connected if the graph remains connected after the removal of any k 1 edges. For an undirected graph G , dene the neighborhood of node i by N i = j | (i , j ) E , i.e., the set of adjacent nodes to i , and we dene the degree of the node to be the number of elements in the neighborhood to be k i = | N i |. In a directed graph, we dene two concepts: the in-degree (the number of links connecting to the link) and the out-degree (the number of links originating from it). in-degree(i ) out-degree(i ) = = {( j , i )|( j , i ) E } , {(i , j )|(i , j ) E } .
We often consider statistics of the degree distribution p k (which gives the probability that a node has degree k ), the average node degree being the most obvious such. It can be easily calculated from the sum of degrees, which has the interesting property k i = 2 | E |,
i N
generally referred to the Handshake lemma. The node-degree distribution provides a common characterization of a graph (though by no means a complete characterization). It is noteworthy, however, that although this distribution is frequently discussed, the concept is somewhat ill-dened. It can be directly measured for a real network, in which
case p k is the probability that a randomly selected node from the measured graph has degree k . However, it is often used in the context of a set of simulated graphs, where it is used to mean the probability that a node in the ensemble of networks has degree k with this probability. The difference is subtle, but it is worth keeping track of such discrepancies. There are many other graph metrics. For instance, the distance2 between two connected nodes in an unweighted graphs is generally dened to be the number of edges in the shortest path connecting them. We can then examine quantities such as the average distance, and the diameter of the network (the maximum distance). And there are many other metrics: assortativity, clustering coefcient, centrality, and so on. They are all attempts to capture the nature of a graph in a small set of measures, and as such provide simpler, seemingly more intuitive ways to consider graphs. For other discussions of these, and comparisons in the context of Internet topologies see [68, 79]. We must be wary though, as it should be clear that the potential for problems is immediate. No small set of numbers can truly represent graphs. For instance, consider the Hamiltonian cycle3 problem. The problem of determining if a network has such a path is well known to be NP-complete, and as such no small set of statistics of the graph will provide a characterization that is sufcient to consider this problem. Thus, these simple statistics must miss important properties of the network. It may be useful to the reader to consider some of the tools that are available for working with graphs. They have different sets of feature, but perhaps the most important is whether they are used in conjunction with a programming language and which one, so we have listed a (no doubt incomplete) set below with some very basic information. MatlabBGL http://www.stanford.edu/~dgleich/programs/matlab_bgl/ Graph libraries for Matlab, using Boost Graph Library (BGL)
http://www.boost.org/doc/libs/1_42_0/libs/graph/doc/index.html
igraph http://igraph.sourceforge.net/ Libraries for working with graphs in R or Python GraphVis http://www.graphviz.org/ Toolkit for visualization of graphs NetworkX http://networkx.lanl.gov/ Python toolkit for working with graphs GDToolkit http://www.dia.uniroma3.it/~gdt/gdt4/index.php OO C++ library for handling and drawing graphs JUNG http://jung.sourceforge.net/ Java universal network/graph framework IGen http://informatique.umons.ac.be/networks/igen/ A toolkit for generating IP network topologies based on network design heuristics.
2 The graph distance has a long history. In mathematics, perhaps the best known example is the Erd os number, which is the
distance of a author from Erd os in the co-authorship graph. In popular culture there is an equivalent: the Bacon number, or the distance between actors in the graph of co-appearances. 3 A Hamiltonian cycle is a path (on the graph) that visits each node exactly once, and then returns to the start point.
Physical networks
Neural network Internet (layer 1-3)
L1 - physical L2 - links L3 - network
Virtual networks
Science Citations Internet Social interactions Internet (layer 7)
WWW email peer-2-peer online social network facebook linked in
Characterised by
Characterised by
Transport vs Information ows: A more subtle differentiator is between what the network carries. Some networks physically transport some type of material (cars, water, ...) whereas the ows in other networks are (almost) pure information (the Internet, ...). The importance of this distinction for networks may be less immediately obvious, but it certainly does have implications. When physical transport is involved in a network, the constraints on that network are likely to be even more stringent, and the ability to change the network even more limited. Costs for changing the road network, for instance, are usually higher than changing the equivalent proportion of a IP network. Within this chapter, we are primarily interested in the Internet, and that includes both physical (OSI layer 1-3) networks, and virtual networks (MPLS, WWW, online social networks, etc.). However, all of the
Transport networks
Cars/buses/pedestrians Distribution
water electricity
Information networks
Social interactions Internet
WWW email etc.
Characterised by
Characterised by
transport of information
networks considered here are information transportation networks. There are other dimensions on which networks could be classied. For instance, by the nature of the transport. Does it come in discrete chunks (e.g., cars, packets, or the post) or continuously (e.g., water or electricity)? Is the transport connection oriented (e.g., the telephone network) or packet oriented (e.g., the Internet)? And there are other general issues we need to deal with: Physical networks are embedded in geography, but logical networks often arent, and yet the same terminology is often applied to each. Connectivity often changes over time, with the time-scale varying depending on the type of network. The Internet is often said to be a network of networks. It is often hard to consider one network in isolation, they have relationships, but the situations is even more complicated than often imagined. peers Networks may be connected to peers, i.e., similar networks that may be competing or cooperating (or both in some cases), e.g., two ISPs operating in the same region. parents Networks may have a parent-child relationship in the sense that one network controls the other, e.g., the SS7 network with respect to traditional telephone network. layers A single network may have multiple layers, each of which can be represented by a different graph, e.g., the physical- vs the network-layers in the Internet. external There is substantial interaction between notionally separated networks, e.g., the power grid and the Internet, both because the Internet uses electricity, but also because spikes in electricity demand could potentially be caused by network ash crowds (certainly TV programs have a very important impact on electricity usage). That brings us naturally to the particular object of discussion here the Internet (and its topology). The term Internet means (many) different things to (many) different people. Even within the networking community, the term is often used ambiguously, leading to misunderstandings and confusion and creating roadblocks for a genuinely scientic treatment of an engineered system that has revolutionized the way we live. While mathematics in the form of graph theory has been equally culpable in adopting the use of this vague nomenclature, the new science of networks has popularized it to the point where phrases like
topology of the Internet or Internet graph have entered the mainstream science literature, even though they are essentially meaningless without precisely-stated denitions. For one, Internet topology could refer to the connectivity structures encountered in any of layers in the protocol stack, or at various levels of aggregation. Common examples are 1. Router-level (layer 3): An often sought topology is the router level. Somewhat ambiguously, this may also be called the network level, or IP level, but network is a heavily overloaded term here, and the IP level can also be ambiguous. For instance, IP level could refer to the way IP addresses are connected, that is it could refer to the interfaces of one router as separate nodes [19], but that is rarely what is useful for network operations or research. We could also add at layer 3, in addition to interface-level topology described above, the subnet-level topology [19, 67, 81, 148, 149], describing the interconnectivity of logical subnets (often described by an IP-level prex), but here we focus on the more commonly considered router level. The router-level graph shows a range of interesting implementation details of a network. This type of information is critical for network management applications, as much of Internet management rests at the IP layer, and it is of great importance for network adversaries. For instance, developing tools to measure network trafc requires an understanding of the router-layer topology, in order to match trafc to links. Similarly trafc engineering, and reliability analyses are carried out at this level. One complication of this layer is that we sometimes wish to obtain the topology extending out to end-hosts, which are not technically routers, but we shall include these in our denition of router-layer topology, unless otherwise specied. 2. Switch-level (layer 2): A single IP layer logical link may hide several layer-2 devices (hubs and switches). The increasing prevalence of Ethernet, and the ability to provide redundancy at reasonable cost, has led to a proliferation of such devices, and most Local Area Networks (LANs) are based around such. Hence, very many networks which have trivial, or simple IP layer topologies have complex and interesting layer-2 topologies. Multi-Protocol Label Switching (MPLS) further complicates the situation by creating logical layer-2 networks without physical devices, often in the form of cliques. Measurements often see only one layer, creating misunderstandings of a networks true resilience and more general graph properties. For instance, layer-2 devices can connect large numbers of routers, making them appear to have higher degree at layer-3 [104] (for more detailed discussion see 3.2.3). 3. Physical-level (layer 1): Below the link layer (layer 2), lies the physical layer. Again, many physical devices may underly a single logical link. Discovery of this layer is of critical importance in network management tasks associated with reliability. In particular, the concept of Shared Risk Link Groups (SRLG) requires knowledge of which links are carried on which bers (using Wavelength Division Multiplexing), in which conduits. If a backhoe digs up a single conduit, it will cause a bundle of bers to fail, and so connections that are in the same SRLG will all fail simultaneously. Clearly redundant links need to be in different SRLG, and discovery of the physical topology is important to ensure that this is the case. 4. PoP-level: A Point-of-Presence (PoP) is a loosely dened grouping of devices, often dened by a metropolitan area. PoP level topologies are quite useful, essentially because these graphs describe the logical structure of the network as the designer intended, rather than its particular implementation in terms of individual routers. Such topologies are ideal for understanding tradeoffs between connectivity and redundancy, and also provide the most essential information to competitors or
10
customers (about where a network is based, or who has the best access network in a region). Network maps are often drawn at this level because it is an easy level for humans to comprehend. 5. Application layer: There has been signicant interest in logical topologies of the application layer, e.g., for the Web (using HTTP , and HTML), and the P2P applications. 6. AS-level: AS topologies have generated much interest in the scientic literature [5, 13, 161], because they appear to show interesting properties (such as power-laws) in common with other un-engineered networks such as biological networks. Also, much data on AS topologies is publicly available. While of interest in the scientic literature, this datas use is confused by many myths and misunderstandings [135]. The data may provide mild competitive benets, in allowing operators to determine who peers with who, but the measured data often comes without attributes that would make the data truly useful in this regard. Finally, it is hard to see how such data could be used in an attack, although much publicized reports such as [161] suggest, incorrectly (see [93]), that the observed structure of the AS graph may lead to an Achilles heel of the Internet. The number of possible topologies we might wish to discover highlights the complexity of this problem, and why discovery is so valuable for network management. In this chapter we will consider the routerlevel topology in detail, and then discuss some of the similarities and differences with respect to AS- and PoP-level topologies. In addition to understanding the Internet network as a simple graph, there are many other features of the graph that one would also wish to know, for instance, its routing, link capacities, and geographic locations. We describe such qualities as graph attributes, and nd that most can either be attributed to edges of the graph, for instance link capacities, link length, routing weights (e.g., for shortest-path routing), link utilizations, link performance (for example, bit-error-rate, delay, loss, jitter, reordering, buffer utilization), link status (up/down), and a links lower layer properties (e.g., number of physical hops), or to the nodes of the graph geographic location, type of node, e.g., brand of router, or version of software, performance measures (e.g., CPU utilization), and node status (up/down). We could further divide this list into intrinsic network properties, such as node location, or link capacity (things that cannot change easily), and extrinsic properties, such as performance, or trafc related properties, which can change dramatically despite there being no change in the underlying network.
11
3 Router-level topology
3.1 A look back
Since the early days of the ARPANET, networking researchers have been interested in drawing the type of networks they designed [71]. An early map of the ARPANET is reproduced in Figure 3 and shows the networks logical connectivity structures in April 1971, when the network as a whole consisted of some 15 nodes and a node was little more than one or two state-of-the-art computers connected via an Interface Message Processor or IMP (the modern equivalent would be a router). In this context, logical structure refers to a detailed drawing of the connectivity among those nodes and of node-specic details (e.g., type of machine), by and large ignoring geography. In contrast, geographic structure refers to a map of the US that includes the actual locations of the networks physical nodes and shows the physical connectivity among those nodes. Such accurate maps could be drawn because at that time, each piece of equipment was expensive and needed to be accounted for, only a few groups with a small number of researchers were involved in the design and installation of the network, and the network changed relatively slowly.
Figure 3: The ARPANET in 1971 (reprinted from [25]; 1990 ACM, Inc. Included here by permission.)
The network quickly grew in size and complexity. For instance, Figure 4 shows the geographic counterpart from 1984 of the ARPANET map depicted in Figure 3. Manually accounting for the increasing number of components quickly became prohibitive and motivated the adoption of automatic strategies for obtaining some of the available connectivity as well as trafc information. A prime example for effectively visualizing this collected information is reproduced from [55] and shown in Figure 5, which depicts a 3D rendering of the (US portion of the) NSFNET around 1991, annotated with trafc-related information. At that time, the NSFNET backbone consisted of some 14 nodes that were interconnected with T1 links as shown and, in turn, connected to a number of different campus networks (e.g., collections of interconnected LANs). However, even though the internal structure of the backbone nodes was well-known (i.e., each node was composed of nine IBM RTs linked by two token rings with an Ethernet interface to attached
12
Figure 4: The early ARPANET (reprinted from [25]; 1990 ACM, Inc. Included here by permission.)
networks), nobody had any longer access to the internals of all the different campus networks and as a result, drawing the 1991 NSFNET equivalent of the ARPANETs logical connectivity structure (Figure 3) was no longer possible. With the decommissioning of the NSFNET in 1995 and the rise of the "public Internet", the researchers ability to obtain detailed connectivity and component information about the internals of the different networks that formed the emerging "network of networks" further diminished and generated renewed interest in the development of abstract, yet informed, models for router-topology evaluation and generation. For example, the Waxman model [155], a variation of the classical Erds-Rnyi random graph model [47] was the rst popular topology generator commonly-used for network simulation studies at the router level. However, it was largely abandoned in the late 1990s in favor of models that attempted to explicitly account for non-random structure as part of the network design. The arguments that favored structure over randomness were largely empirical in nature and reected the fact that the inspection of real-world router-level ISP networks showed clear signs of non-random structures in the form of the presence of backbones, the appearance of hierarchical designs, and the importance of locality. These arguments also favored the notion that a topology generator should reect the design principles in common use; e.g., to achieve some desired performance objectives, the physical networks must satisfy certain connectivity and redundancy requirements, properties which are generally not guaranteed in random network topologies. These principles were, for example, advanced in [23, 164, 165] and were ultimately integrated into the popular Georgia Tech Internetwork Topology Models (GT-ITM) [65]. These more structure-oriented router topology generators were viewed as the state-of-the-art until around 2000 when, in turn, they were largely abandoned in favor of a new class of random graph models whose trademark was the ability to reproduce the newly discovered power-law relationship in the observed connectivity (i.e., node degree) of router-level graphs of the Internet. This discovery was originally reported
13
Figure 5: A visualization of the NSFNET circa 1991 (by Donna Cox and Robert Patterson, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign. See also http://en.wikipedia.org/wiki/ File:NSFNET-traffic-visualization-1991.jpg ).
14
in the seminal paper by Faloutsos et al. [49], who used a router-level graph constructed from data that was collected a few years earlier by Pansiot and Grad [119] for the purpose of obtaining some experimental data on the actual shape of multicast trees in the Internet. The Boston university Representative Internet Topology gEnerator (BRITE) [103] became a popular representative of this new class of models, in part also because it combined the more structure-oriented perspective of the GT-ITM generator with the new focus that emphasized the ability to reproduce certain metrics or statistics of measured router topologies (e.g., node degree distribution). One of the hallmarks of networks that have power-law degree distributions and that are generated according to any of a number of different probabilistic mechanisms (e.g., preferential attachment [13], random graphs with a given expected degree sequence [30], power-law random graphs [3]) is that they can be shown to have a few centrally located and highly connected hubs through which essentially most trafc must ow. When using these models to represent the router-level topology of the Internet, the presence of these highly connected central nodes has been touted the Internets Achilles-heel because network connectivity is highly vulnerable to attacks that target the high-degree hub nodes [5]. It has been similarly argued that these high-degree hubs are a primary reason for the epidemic spread of computer worms and viruses [112, 122]. Importantly, the presence of highly connected central nodes in a network having a power-law degree distribution is the essence of the so-called scale-free network models. They have been a highly popular theme in the study of complex networks, particularly among researchers inspired by statistical physics [4], and have fuelled the rise of a new scientic discipline that has become known as Network Science" [12]. In the process, they have also seen wide-spread use among Internet topology researchers. However, as the general fascination with and popularity of network science in general and scale free network modeling in particular grew, so did the arguments that were voiced by Internet researchers and questioned the appropriateness and relevance of the scale-free modeling approach for studying highlyengineered systems such as the Internets router topology. In fact, at around 2010, when the number of publications in the area of network science reached a new height, the number of papers that were published in the networking research literature and applied scale-free network models to describe or study router-level topologies of the Internet was close to zero. This begs the question What happened?, and the answer provided in the next section is really a classic lesson in how errors of various forms occur and can add up to produce results and claims that create excitement among non-networking researchers, but quickly collapse when scrutinized with real data or examined by domain experts.
15
(i) On routes and multicast trees in the Internet by J.-J. Pansiot and D. Grad (1998) [119] described the original measurement experiment that was performed in mid-1995 and produced data on actual routes taken by packets in the Internet. This data was subsequently used to construct a router graph of the Internet. (ii) On power-law relationships of the Internet topology by M. Faloutsos et al. (1999) [49] reported (among other observations) on the observed power-law relationship in the connectivity of the router-level topology of the Internet measured by Pansiot and Grad [119]. (iii) Error and attack tolerance of complex networks by R. Albert et al. (2000) [5] proposed a scale-free network model to describe the router topology of the Internet and argued for its validity on the basis of the latest ndings by Faloutsos et al. [49]. It touted the new models exemplary predictive power by reporting on the discovery of a fundamental weakness of the Internet (a property that was became known as the Internets "Achilles heel") that went apparently unnoticed by the engineers and researchers who have designed, deployed, and studied this large-scale, critical infrastructure, Fig1 but followed directly from the newly proposed scale-free modeling approach.
102
101
100 101
(a)
(b)
Figure 6: A toy example of a scale-free network of the preferential attachment type (b) generated to match a power-law type node degree distribution (a). (First published in Notices of the American Mathematical Society, Volume 56, No.3 (May 2009): 586-599 [156]. Included here by permission.)
At rst glance, the combination of these three papers appears to show network modeling at its best rmly based on experimental data, following modeling practices steeped in tradition, and discovering surprisingly and previously unknown properties of the modeled network. An example of a toy network resulting from taking the ndings from these seminal papers at face value is shown in Figure 6. However, one of the beauties of studying man-made systems such as the Internet is that because of their highlyengineered architectures, a thorough understanding of its component technologies, and the availability of extensive (but not necessarily very accurate) measurement capabilities they provide a unique setting in which most claims about their properties, structure, and functionality can be unambiguously resolved, though perhaps not without substantial efforts. In the remainder of this section, we will illustrate how in the context of the Internets router topology, applying readily available domain knowledge in the form of original design principles, existing technological constraints, and available measurement methodologies reveals a drastically different picture from that painted in these three seminal papers. In fact, we will
16
expose the specious nature of scale-free network models that may appeal to more mathematically inclined researchers because of their simplicity or generality, but besides having no bearing on the Internets router topology are also resulting in wrong claims about the Internet as a whole. 3.2.2 A rst sanity check: Using publicly available information A rst indication of apparent inconsistencies between the proposed scale-free models for the Internets router topology and the actual Internet comes from the inspection of the router topologies of actual networks that make the details of their network internals publicly available. For example, networks such as Internet2 [77] or GANT [57] show no evidence that there exist any centrally located and highly connected hubs through which essentially most trafc must ow. Instead, what they typically show is the presence of a more or less pronounced backbone network that is fed by tree-like access networks, with additional connections at various places to provide a degree of redundancy and robustness to components failures4 . This design pattern is fully consistent with even just a cursory reading of the most recent product catalogs or white papers published by the main router vendors [32,33,80]. For one, the most expensive and fastest or highest-capacity pieces of equipment are explicitly marketed as backbone routers. Moreover, due to inherent technological limitations in how many packets or bytes a router can handle in a given time interval, even the latest models of backbone routers can support only a small number of very high-bandwidth connections, typically to connect to other backbone routers. At the same time, a wide range of cheaper, slower or lower-capacity products are offered by the different router vendors and are targeted primarily at to support network access. On the access side, a typical router will have many lower-bandwidth connections for the purpose of aggregating customer trafc from the networks edge and subsequently forwarding that trafc towards the backbone. In short, even the latest models advertised by todays router vendors are limited by existing technologies, and even for the top-of-the-line backbone routers, it is technologically infeasible to have hundreds or thousands of high-bandwidth connections. At the same time, while technically feasible, deploying some of the most expensive equipment and conguring it to support hundreds or thousands of low-bandwidth connections would be considered an overall bad engineering decision (e.g., excessively costly, highly inefcient, and causing serious bottlenecks in the network). However, the root cause for these outward signs of a clear mismatch between the modeled and actual router topology of the Internet goes deeper and lies in the original design philosophy of the Internet. As detailed in [34], while the top level goal for the original DARPA Internet architecture was to develop an effective technique for multiplexed utilization of existing interconnected networks", the requirement that Internet communication must continue despite loss of networks or gateways" topped the list of second level goals. To survive in the face of components failing off, the architecture was to mask completely any transient failure, and to achieve this goal, state information which describes an existing connection must be protected. To this end, the architecture adopted the fate-sharing" model that gathers this state information at the endpoints of connections, at the entities that are utilizing the service of the network. Under this model, it is acceptable to lose the state information associated with an entity if, at the same time, the entity itself is lost; that is, there exists no longer any physical path over which any sort of communication with that entity can be achieved (i.e., total partition). Ironically, these original design principles outlined in [34] favor precisely the opposite of what the scale-free modeling approach yields no centrally located and highly connected hubs because their removal makes partitioning the network easy.
4 This is not a universal phenomena. For instance [84] notes that some networks do exhibit hub-like structure, but it is the lack of universality that is important here, as exhibited by these and other counter examples.
17
3.2.3 An in-depth look at traceroute: Examining a popular measurement technique While the above-mentioned empirical, technological, and architectural arguments cast some serious doubts on the scale-free network modeling approach for the router topology of the Internet, they say nothing about the measurements that form the basis of this approach and has given it a sense of legitimacy among scientists in general and networking researchers in particular. To appreciate the full role that measurements play in this discussion, it is informative to revisit the original paper by Pansiot and Grad [119] that describes the measurement experiment, discusses the measurement technique used, and provides a detailed account of the quality of the data that form the basis of the scale-free approach towards modeling the Internets router topology. In essence, [119] describes the rst (at that time) large-scale traceroute campaign performed for the main purpose of constructing a router graph of the Internet from actual Internet routes. Although traceroute-based, the authors of [119] quickly point out that their purpose of using the traceroute tool (i.e., obtaining actual Internet routes to construct a router graph) differed from what V. Jacobson [78] had in mind when he originally designed the tool (i.e., tracing a route from a source to a destination for diagnostic purposes). As a result, a number of serious issues arise that highlight why using the traceroute technique for the purpose of constructing a router graph is little more than an "engineering hack" and can certainly not be called a well-understood "measurement methodology." IP alias resolution problem: One serious problem explained in detail in [119] with using traceroutebased data for constructing router graphs is that the traceroute tool only returns the IP addresses of the interface cards of the routers that the probe packets encountered on their route from the source to their destination. However, most routers have many interface cards, and despite many years of research efforts that have produced a series of increasingly sophisticated heuristics [15, 67, 140], the networking community still lacks rigorous and accurate methods for resolving the IP alias resolution problem; that is, determining whether two different interface IP addresses belong to or can be mapped to the same router. Fig2 While the essence of this problem is illustrated in Figure 7, the impact it can have when trying to map a router topology of an actual network is shown in Figure 8.
(a)
(b)
Figure 7: The IP alias resolution problem. Paraphrasing Fig. 4 of [144], traceroute does not list routers (boxes) along paths but IP addresses of input interfaces (circles), and alias resolution refers to the correct mapping of interfaces to routers to reveal the actual topology. In the case where interfaces 1 and 2 are aliases, (b) depicts the actual topology while (a) yields an inated topology with more routers and links. (First published in Notices of the American Mathematical Society, Volume 56, No.3 (May 2009): 586-599 [156]. Included here by permission.)
18
actual
Actual vs Inferred Node Degrees
25
actual
20
inferred
Count
15
10
inferred
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Node Degree
Figure 8: The IP alias resolution problem in practice. Shown is a comparison between the Abilene/Internet2 topology inferred by Rocketfuel (left) and the actual topology (top right). Rectangles represent routers with interior ovals denoting interfaces. The histograms of the corresponding node degrees are shown in the bottom right plot. (Reprinted from [141]; 2008 ACM. Inc. Included here by permission.)
19
Lesson 1: Due to the absence of accurate and rigorous methods for solving the IP alias resolution problem, the actual values of the connectivity of each router (i.e., node degrees) inferred from traceroute measurements cannot be taken at face value. Opaque Layer-2 clouds: Another serious issue with using generic traceroute-based measurements for construction router graphs is also discussed at length in [119] and illustrated in Figure 9. Being strictly limited to IP or layer-3, the problem with traceroute is that it is incapable of tracing through opaque layer-2 clouds that feature circuit technologies such as Asynchronous Transfer Mode (ATM) or Multiprotocol Label Switching (MPLS). These technologies have the explicit and intended purpose of hiding the networks physical infrastructure from IP , so from the perspective of traceroute, a network that runs these technologies will appear to provide direct connectivity between routers that are separated by local, regional, national, or even global physical network infrastructures. An example of using traceroute to map a network that uses MPLS is depicted in Figure 9 and shows an essentially completely connected graph at Layer 3 with multiple high-degree nodes, even though the physical router topology is very sparse. Similarly, if traceroute encounters an ATM cloud, it falsely discovers a high-degree node that is really a logical entity often an entire network potentially spanning many hosts or great distances rather than a physical node of the Internets router-level topology. Donnet et al. [44] found that at least 30% of the paths they tested traversed an MPLS tunnel. Recent extensions of the ICMP protocol, using traceroute to trace through opaque MPLS clouds have become technically feasible [16], but operators often congure their routers to hide the MPLS tunnels by turning off this option [142]. Even then it may be possible to detect the MPLS tunnels [44], but the inference techniques are not a guarantee, and are quite particular to MPLS, which is not the only technique for creating tunnels, so there may still be some opaque networks to deal with. More to the point, even where such inferences are possible, most data sets do not contain this type of analysis, and most subsequent analyses of the data have ignored the issue. Lesson 2: Due to an inability of the generic traceroute technique to trace through opaque Layer-2 clouds, or understand the connectivity created by Layer-2 devices [104], the inferred high-degree nodes (i.e., routers with a large number of connections) are typically ctitious, an artifact of an imperfect measurement tool. Limited vantage points: We have commented earlier that since a router is fundamentally limited in terms of the number of packets it can process in any time interval, there is an inherent tradeoff in router conguration: it can support either a few high-throughput connections or many low-throughput connections. Thus, for any given router technology, a high-connectivity router in the core reects a poor design decision it will either have poor performance due to its slow connections or be prohibitively expensive relative to other options. Conversely, a good design choice is to deploy cheap high-degree router near the edge of the network and rely on the very technology that supports easy multiplexing of a large number of relatively low-bandwidth links. Unfortunately, neither the original traceroute-based study of Pansiot and Grad [119] nor any of the larger-scale campaigns that were subsequently performed by various network research groups have the ability to detect those actual high-degree nodes. The simple reason is that these campaigns lack access to a sufcient number of vantage points (i.e., sources for launching traceroute probes and targets) in any local end-system to reveal these actual high connectivity patterns at the networks edge. Lesson 3: If there were high-degree nodes in the network, existing router technology relegates them to the edge of the network where no generic traceroute-based measurement campaigns is able to detect them because of a lack of vantage points nearby. There are other issues with large-scale traceroute campaigns that impact the quality of the resulting measurements and have received some attention in the literature. For example, the use of traceroute has been shown to make experimental data susceptible to a type of measurement bias in which some nodes
20
Fig4
(a)
(b)
Figure 9: How traceroute detects ctitious high-degree nodes in the network core. (a) The actual connectivity of an opaque layer-2 cloud, i.e., a router-level network running a technology such as ATM or MPLS (left) and the connectivity inferred by traceroute probes entering the network at the marked router (right). (b) The Rocketfuel-inferred backbone topology of AS3356 (Level3), a Tier-1 Internet service provider and leader in the deployment of MPLS. (Figure (b) reprinted from [144]; 2002 ACM, Inc. Included here by permission.)
21
of the network are oversampled, while others are undersampled. However, while this feature has received considerable attention [1, 91], in the presence of systematic errors due to an inability to perform accurate IP alias resolution or trace through opaque Layer-2 clouds, this work is largely of theoretical interest and of little practical relevance for modeling the Internets router topology. 3.2.4 Just the facts: power-law scaling and router-level topologies When applying lessons 1-3 to the main ndings reported in the seminal papers discussed in 3.2.1 we are faced with the following facts: Fact 1: A very typical but largely ignored fact about Internet-related measurements in general and traceroute measurements in particular is that what we can measure in an Internet-like environment is generally not the same as what we really want to measure (or what we think we actually measure). This is mainly because as a decentralized and distributed system, the Internet lacks a central authority and does not support third-party measurements. Fact 2: A particularly ironic fact about traceroute is that the high-degree nodes it detects in the network core are necessarily ctitious and represent entire opaque layer-2 clouds, and if there are actual high-degree nodes in the network, existing technology relegates them to the edge of the network where no generic traceroute-based measurement experiment will ever detect them. Fact 3: In particular, due to the inherent inability of traceroute to (i) reveal unambiguously the actual connectivity (i.e., node degree) of any router, and (ii) correctly identify even the mere absence or presence of high-degree nodes (let alone their actual values), statistical statements such as those made in [49] claiming that the Internets router connectivity is well described by a power-law distribution (or, for that case, any other type of distribution) cannot be justied with any reasonable degree of statistical condence. Fact 4: Since historical traceroute-based measurements cannot be taken at face value when (mis)using them for inferring router topologies and the inference results obtained from such data cannot be trusted, the claims that have been made about the (router-level) Internet in [5] are without substance and collapse under careful scrutiny. In short, after almost 15 years of examining the idiosyncrasies of the traceroute tool, there exists overwhelming evidence that the sort of generic and raw traceroute measurements that have been used to date to infer the Internets router topology are seriously awed to the point of being essentially of no use for performing scientically sound inferences. Yet, the myth that started with [49]; i.e., the router topology of the Internet exhibits power-law degree distributions persists and continues to be especially popular with researchers that typically work in the eld of network science and show in general little interest in domain-specic "details" such as traceroutes idiosyncrasies. At the same time, it is worthwhile pointing out that most of the above-mentioned aws and shortcomings of traceroute-based measurements are neither new nor controversial with networking researchers. In fact, when discussing the use of the traceroute tool as part of their original measurement experiment, the authors of [119] described many of the issues discussed in this section in great detail and commented on the possible implications that these inherently traceroute-related issues can have for constructing router graphs of the Internet. In this sense, [119] is an early example of an exemplary measurement paper, but unfortunately, it has been largely ignored and essentially forgotten. For one, [49], which critically relies on the data described in [119] for their power law claim for the Internets router topology, fails to recognize the
22
relevance of these issues and does not even comment on them. Moreover, the majority of papers that have appeared in this area after the publication of [49] typically cite only [49] and dont even mention [119]. Traceroute-based measurements are not the only approach for obtaining router-level topologies, just the most commonly presented in the research literature. Network operators can obtain measurements of their own networks using much more accurate methods: for instance, from conguration les [52], or using route monitors [137], but those techniques require privileged access to the network, and so havent been used widely for research. More recently, the mrinfo tool [109] has been used to measure topologies using IGMP (the Internet Group Management Protocol) [105, 120]. IGMP has the advantage that routers that respond provide much more complete information on their interfaces than those responding to traceroutes (so aliasing is less an issue), but there are still coverage problems created by lack of support, or deliberate ltering or rate limiting of responses to the protocol [102].
23
technology constraints are a signicant force shaping network connectivity at the router-level and, in turn, router topology design. Due to hard physical limits, even the most expensive and highest-capacity router models available on the market in any given year operate within a feasible region" and corresponding "efciency frontier" of possible bandwidth-degree combinations; that is, they can be congured to either have only a few high-bandwidth connections and perform at their capacity or have many lowbandwidth connections and tolerate a performance hit due to the overhead that results from the increased connectivity. Similarly, economic considerations also affect network connectivity and router topology design. For example, the cost of installing and operating physical links in a network can often dominate the cost of the overall router infrastructure. In essence, this observation creates enormous practical incentives to design the physical plant of an ISP so as to keep the number of links small and avoid whenever possible long-haul connections due to their high cost. These incentives to share costs via multiplexing impact and are impacted by available router technologies and argue for a design principle for an ISPs router topology that favors aggregating trafc at all levels of network hierarchy, from its periphery all the way to its core. The third and nal key ingredient of the proposed rst-principle alternative to router topology modeling is concerned with the role that randomness plays in this approach. Recall that the traditional approach is typically graph theory-based where randomness is explicit and appears in the form of a series of coin tosses (using potentially bias coins as in the case of scale-free networks of the preferential attachment type) that determine whether or not two nodes (i.e., routers) are connected by a physical link, irrespective of the type of routers involved or link considered. In stark contrast, in our approach, randomness enters in a very different and less explicit manner, namely in terms of the uncertainty that exists about the environment (i.e., the trafc demand that the network is expected to carry). Moreover, irrespective of the model chosen for quantifying this uncertainty, the resulting network design is expected to exhibit strong robustness properties with respect to changes in this environment. When combining all three ingredients to formulate an ISPs router topology design problem, the mathematical modeling language that naturally reects the objectives of an ISP, its need to adhere to existing technology constraints and respect economic considerations, and its desire to operate effectively and efciently in light of the uncertainly in the environment is constrained optimization. Thus, we have changed network modeling from an exercise in model tting into an exercise in reverse-engineering and seek a solution to a constrained optimization problem formulation that captures by and large what the ISP can afford to build, operate, and manage (i.e., economic considerations), satises the hard constraints that technology imposes on the networks physical entities (i.e., routers and links), and is robust to changes in the expected trafc that is supposed to handle 3.3.2 Heuristically optimal router topologies In the process of formulating the design of an ISPs router topology as a constrained optimization problem, we alluded to a synergy that exists between the technological and economic design issues with respect to the network core and the network edge. The all-important objective to multiplex trafc is supported by the types of routers available on the market. In turn, the use of these products re-enforces trafc aggregation everywhere in the network. Thus, the trade-offs that an ISP has to make between what is technologically feasible versus economically sensible can be expected to yield router topologies where individual link capacities tend to increase while the degree of connectivity tends to decrease as one moves from the network edge to its core. This consistent picture with regard to the forces that by and large govern the built-out and provisioning of an ISPs router topology and include aspects such as equipment constraints, link costs, and bandwidth
24
demands suggests that the following type of topology is a reasonably good design for a single ISPs physical plant: (i) Construct a core as a loose mesh of expensive, high-capacity, low-connectivity routers which carry heavily aggregated trafc over high-bandwidth links. (ii) Support this mesh-like core with hierarchical tree-like structures at the edge of the network for the purpose of aggregating trafc from end users via cheaper, lower-capacity, high-connectivity routers. (iii) Augment the resulting structure with additional connections at various selectively-chosen places to provide a degree of redundancy and robustness to component failures. The result is a topology that has a more or less pronounced backbone, which is fed by tree-like access networks, with additional links added for redundancy and resilience. We refer to this design as heuristically optimal to reect its consistency with real design considerations and call the resulting "solutions" heuristically optimal topologies, or HOT for short. Note that such HOT models have been discussed earlier in the context of highly organized/optimized tolerances/tradeoffs [24, 48]. An important aspect of the proposed HOT models is that even though we have formulated the design of an ISPs router topology as a constrained optimization problem that could in principle be solved optimally, we are typically not concerned with a network design that is optimal in a strictly mathematical sense and is also likely to be NP-hard. Instead, our interest is in solutions that are heuristically optimal in the sense that they result in good performance. The main reason for not pursuing optimal solutions more aggressively is the imprecise nature of essentially all ingredients of the constrained optimization problem of interest. For one, it is unrealistic to expect that an ISPs true objective for building out and provisioning its physical infrastructure can be fully expressed in mathematical terms as an objective function. Furthermore, a bewildering number of different types of routers and connections make it practically impossible to account for the nuances of the relevant feasible regions or efciency frontiers. Finally, any stochastic model for describing the expected trafc demand is an approximation of reality or at best based on imprecise forecasts. Given this approximate nature of the underlying constrained optimization problem, we seek solutions that captures by and large what the ISP can afford to build, operate, and manage (i.e., economic considerations), satisfy some of the more critical hard constraints that technology imposes on the networks physical entities (i.e., routers and links), and exhibit strong robustness properties with uctuations in the expected trafc demands (i.e., insensitivity to changes in the uncertain environment). 3.3.3 A toy example of a HOT router topology To illustrate the proposed HOT approach, we use a toy example that is rich enough to highlight the key ingredients of the outlined rst-principles methodology and demonstrate its relevance for router topology modeling as compared to the popular model-tting approach. Its toy nature is mainly due to a number of simplifying assumptions we make that facilitate the problem formulation. For one, by simply equating throughput with revenues, we select as our objective function the maximum throughput that the network can achieve for a given trafc demand and use it as a metric for quantifying the performance of our solutions. Second, considering an arbitrary distribution of end-user trafc demand B i , we assume a gravity model for the unknown trafc demand; that is, assuming shortest-path routing, the demands are given by the trafc matrix element x i , j = B i B j between routers i and j for some constant . Lastly, we consider only one type of router and its associated technologically feasible region; that is, (router degree, router capacity)-pairs that are achievable with the considered router type, and implicitly avoid long-haul connections due to their high cost. The resulting constrained optimization problem can be written in the form max
x i , j such that A X C ,
(1)
25
where X is the vector obtained by stacking all the demands x i , j ; A is the routing matrix obtained by using standard shortest path routing and dened by A k ,l = 1 or 0, depending on whether or not demand l passes through router k ; and C is the vector consisting of the router degree-bandwidths constraints imposed by the Fig5 technologically feasible region of the router at hand.
Fig5
B Bj j
x Bi B j xij ij Bi B j
xij xij
Bi Bi
s.t.
max i ,x jij
i,j
x x x
B C, kki , j Bk , k , k k
max i , j Bi B j
(a)
(b)
Figure 10: Generating networks using constrained optimization. (a) Engineers view network structure as the solution to a design problem that measures performance in terms of the ability to satisfy trafc demand while adhering to node and arc capacity constraints. (b) A network resulting from heuristically optimized tradeoffs (HOT). This network has very different structural and behavioral properties, even when it has the same number of nodes, links, and degree distribution as a scale free network shown in Figure 9. (First published in Notices of the American Mathematical Society, Volume 56, No.3 (May 2009): 586-599 [156]. Included here by permission.)
While all the simplifying assumptions can easily be relaxed to allow for more realistic objective functions, more heterogeneity in the constraints, or more accurate descriptions of the uncertainty in the environment, Figure 10 illustrates the key characteristics inherent in a heuristically optimal solution of such a problem. First, the cost-effective handling of end user demands avoids long-haul connections (due to their high cost) and is achieved through trafc aggregation starting at the edge of the network via the use of high-degree routers that support the multiplexing of many low-bandwidth connections. Second, this aggregated trafc is then sent toward the backbone that consists of the fastest or highest-capacity routers (i.e., having a small number of very high-bandwidth connections) and that forms the networks mesh-like core. The result is a network of the form described earlier: a more or less explicit backbone representing the network core and tree-like access networks surrounding this core, with additional connections as backup in case of failures or congestion. The realism of this reverse-engineering approach to router topology modeling is demonstrated in Figure 11 which shows the router topologies of two actual networks CENIC (circa 2004) and Abeline (circa 2003). 3.3.4 On the (ir)relevance of node degree distributions The above description of our engineering-based rst-principles approach to router topology modeling shows that node degree distributions in general and power low-type node degree distributions in particular
26
U. Memphis
U. Louisville OneNet
Abilene Sunnyvale
OAK dc2
dc1 hpr hpr
SAC
WPI
dc1
SVL
FRE
dc1
Rutgers U. MANLAN
SOL
dc1
CENIC
BAK
dc1
SLO
dc1
UniNet TransPAC/APAN North Texas GigaPoP Texas Tech Texas GigaPoP UT Austin UT-SW Med Ctr.
Houston
Atlanta
hpr
SOX SFGP/ AMPATH Miss State GigaPoP LaNet Florida A&M U. So. Florida
LAX hpr
dc2 dc3 dc1
TUS SDG
dc3 dc1 dc1 hpr
Tulane U.
Figure 4: CENIC and Abilene networks. (Left): CENIC backbone. The CENIC backbone is comprised of two backbone networks in Figure 11: CENIC and Abilene networks. (Left): CENIC backbone. The CENICsystem backbone is comprised of two parallela high performance (HPR) network supporting the University of California and other universities, andbackbone the digital California (DC) network supporting K-12 educational initiatives and local governments. Connectivity within each POP and is provided networks in parallel a high performance (HPR) network supporting the University of California system other by Layer-2 technologies, and connectivity to the network edge is not shown. (Right): Abilene network. Each node represents a universities, and the digital California (DC) network supporting K-12 educational initiatives and local governments. router, and each link represents a physical connection between Abilene and another network. End user networks are represented Connectivity within each POP is provided by Layer-2 technologies, and connectivity to the network edge is not shown. in white, while peer networks (other backbones and exchange points) are represented in gray. Each router has only a few high (Right): Abilene network. Each represents a can router, and eachvirtual link represents physical connection bandwidth connections, however each node physical connection support many connectionsa that give the appearance between of greater Abilene and network. End user networks are represented in white, while peer networks (other backbones and connectivity to another higher levels of the Internet protocol stack. ESnet and GEANT are other backbone networks.
exchange points) are represented in gray. Each router has only a few high bandwidth connections, however each physical
ample, nearly half all users of the Internet in North America connection can of support many virtual connections that give theThe appearance of greater connectivity to higher levels of simple technological and economic considerations listed above still have dial-up connections (generally 56kbps), only about 20% the Internet protocol stack. ESnet and GANT are other backbone networks. (Reprinted from [93]; 2004 ACM, Inc. suggest that a reasonably good design for a single ISPs nethave broadband access (256kbps-6Mbps), and there is only a small work is one in which the core is constructed as a loose mesh of Included herewith by permission.) number of users large (10Gbps) bandwidth requirements [5]. high speed, low connectivity routers which carry heavily aggreAgain, the cost effective handling of such diverse end user trafc gated trafc over high bandwidth links. Accordingly, this meshrequires that aggregation take place as close to the edge as possilike core is supported by a hierarchical tree-like structure at the ble and is explicitly supported by a common feature that these edge edges whose purpose is to aggregate trafc through high connectechnologies have, namely a special ability to support high connectivity. We will refer to this design as heuristically optimal to reect tivity in order to aggregate end user trafc before sending it towards its consistency with real design considerations. the core. Based on variability in population density, it is not only As evidence that this heuristic design shares similar qualitative plausible but somewhat expected that there exist a wide variability features with the real Internet, we consider the real router-level conin the network node connectivity. nectivity of the Internet as it exists for the educational networks of Thus, a closer look at the technological and economic design Abilene and CENIC (Figure 4). The Abilene Network is the Inissues in the network core and at the network edge provides a conternet backbone network for higher education, and it is part of the sistent story with regard to the forces (e.g., market demands, link Internet2 initiative [1]. It is comprised of high-speed connections costs, and equipment constraints) that appear to govern the buildbetween core routers located in 11 U.S. cities and carries approxiout and provisioning of the ISPs core networks. The tradeoffs that mately 1% of all trafc in North America5 . The Abilene backbone an ISP has to make between what is technologically feasible versus is a sparsely connected mesh, with connectivity to regional and lo27 economically sensible can be expected to yield router-level conneccal customers provided by some minimal amount of redundancy. tivity maps where individual link capacities tend to increase while Abilene is built using Juniper T640 routers, which are congured the degree of connectivity tends to decrease as one moves from the to have anywhere from ve connections (in Los Angeles) to twelve network edge to its core. To a rst approximation, core routers connections (in New York). Abilene maintains peering connections tend to be fast (have high capacity), but have only a few highspeed connections; and edge routers are typically slower overall, but have many low-speed connections. Put differently, long-haul links within the core tend to be relatively few in numbers but their capacity is typically high.
5 Of the approximate 80,000 - 140,000 terabytes per month of trafc in 2002 [35], Abilene carried approximately 11,000 terabytes of total trafc for the year [27]. Here, carried trafc refers to trafc that traversed an Abilene router. Since Abilene does not peer with commercial ISPs, packets that traverse an Abilene router are unlikely to have traversed any portion of the commercial Internet.
3.3
are clearly a non-issue and play no role whatsoever in our formulation of an ISP router topology design as a constrained optimization problem. Thus, we achieved our goal of developing a network modeling approach that does not rely in any way on the type of measurements that have informed previous network modeling approaches but have been shown earlier to be of insufcient quality to be trusted to form the basis of any scientically rigorous modeling pursuit. However, even if the available traceroute measurements could be trusted and taken at face value, the popular approach to network modeling that views it as an exercise in model tting is by itself seriously awed, unless it is accompanied by a rigorous validation effort. For example, assuming that the data can be trusted so that a statistic like an inferred node degree distribution is indeed solid and reliable. In this case, who is to say that a proposed models ability to match this or any other commonly considered statistics of the data argues for its validity, which is in essence the argument advanced by traditional approaches that treat network modeling as an exercise in model tting? It is well known in the mathematics literature that there can be many different graph realizations for any particular node degree sequence and there are often signicant structural differences between graphs having the same degree sequence. Thus, two models that match the data equally well with respect to some statistics can still be radically different in terms of other properties, their structures, or their functionality. A clear sign of the rather precarious current state of network-related modeling that is rooted in the almost exclusive focus on model tting is that the same underlying data set can give rise to very different, but apparently equally good models, which in turn can give rise to completely opposite scientic claims and theories concerning one and the same observed phenomenon. Clearly, network modeling and especially model validation ought to mean more than being able to match the data if we want to be condent that the results that we derive from our models are relevant in practice. To illustrate these points, Figure 12 depicts ve representative toy networks, constructed explicitly to have one and the same node degree distribution. This distribution is shown in plot (a) and happens to be the one of our HOT router topology example in Figure 10. While plots (b) and (c) show two scalefree networks constructed according to the preferential attachment method and general random graph method, respectively, plots (d)-(f) are three different HOT examples, including our earlier example in Figure 10 (plot (e)) and a sub-optimal or poorly-engineered HOT topology in (f). While the differences among these ve topologies with identical node degree distributions are already apparent when comparing their connectivity structures, they can be further highlighted by considering both a performance-related and a connectivity-only topology metric. In particular, the performance-related metric Perf(g) for a given network g is dened as Perf(g) = max x i , j s.t. R X C and represents the maximum throughput with gravity ows of the network g . In contrast, the connectivity-only topology metric S(g) is the network likelihood of g dened as S(g) = (i , j )E (g ) i j /s max where i denotes the degree of node i , E (g ) is the set of all edges in g , and s max is a normalization constant. For a justication of using the S (g ) metric to differentiate between random networks having one and the same node degree sequence, we refer to [92]. While computing for each of the ve networks their Perf(g)-value is straightforward, evaluating their network performance requires further care so as to ensure that the different network have the same total cost", where cost is measured in number of routers. When simultaneously plotting network performance versus network likelihood for all ve networks models in Figure 13, a striking contrast is observed. The well-engineered HOT networks (d) and (e) have high performance and low likelihood while the random degree-based networks (b) and (c) have high likelihood but low performance. To contrast, network (f) has both low performance and low likelihood and is proof that networks can be designed to have poor performance. The main reason for the degree-based models to have such poor performance is exactly the presence of the highly connected hubs that create low-bandwidth bottlenecks. The two HOT models mesh-like cores, like real ISP router topologies, aggregate trafc and disperse it across multiple
28
10
rank
10
1
10 0 10
10
degree
(a)
(b)
(c)
(d)
(e)
(f)
Figure 6: Five networks having the same node degree distribution. (a) Common node degree distribution (degree versus rank on log-log scale); (b) Network resulting from preferential attachment; (c) Network resulting fromdegree the GRG method; (d) Heuristically Figure 12: Five networks having the same node degree distribution. (a) Common node distribution (degree versus optimal topology; Abilene-inspired (f) Sub-optimally designed topology. rank on log-log (e) scale); (b) Networktopology; resulting from preferential attachment; (c) Network resulting from the GRG method;
(d) Heuristically optimal topology; (e) Abilene-inspired topology; (f) Sub-optimally designed topology. (Reprinted from [93]; 2004 ACM, Inc. Included here by permission.)
10
10
2
10
10
10
10
Achieved BW (Gbps)
Achieved BW (Gbps)
10
10
10
10
10
10
10
10
10
10 Cumulative Users
10
10
10
0
10
10
10
10
Degree
Degree
(a)
10
2
(b)
10
2
(c)
10
2
10
10
Achieved BW (Gbps)
Achieved BW (Gbps)
10
10
Achieved BW (Gbps)
1
29
10
10
10
10
10
10
10
10
10
Degree
10
10
10
Degree
10
10
Degree
(d)
(e)
(f)
Figure 7: (a) Distribution of end user bandwidths; (b) Router utilization for PA network; (c) Router utilization for GRG network; (d) Router utilization for HOT topology; (e) Router utilization for Abilene-inspired topology; (f) Router utilization for sub-optimal network design. The colorscale of a router on each plot differentiates its bandwidth which is consistent with the routers in Figure 6.
high-bandwidth routers.
5.2
10
12
A Second Exam
Figure 6 shows that grap bution can be very different comes to the engineering de Abileneinspired core network design can s l(g)=0.03 L perf=3.951011 width distributions and that max l(g)=1 bandwidth demands determ 10 perf=1.0810 in the resulting network. To 10 GRG presented in Figure 9, where l(g)=0.65 10 PA perf=1.6410 ent types of variability in en l(g)=0.46 yields different overall node 10 perf=1.1910 Figure 9(a) provides unifor network in Figure 9(b) supp suboptimal are highly variable; and the l(g)=0.07 10 formly low bandwidth to end perf=1.861010 0 0.2 0.4 0.6 0.8 1 spective, not only is there n Relative Likelihood between a network degree d is also no implied relationsh Figure 13: Performance vs. likelihood for each of the topologies in Figure 12, plus other networks (grey dots) having theoverall degree distrib and its Figure 8: Performance vs. Likelihood for each topology, plus same node degree distribution obtained by pairwise random rewiring of links. (Reprinted from [93]; 2004 ACM, Inc. other networks having the same node degree distribution obIncluded here by permission.) tained by pairwise random rewiring of links.
Perfomance (bps)
11 10
6.
DISCUSSION
the real Internet, trafc and disperses it across multi- technological The examples discussed The interpretation of this pictureaggregates is that a careful design process explicitly incorporating constraints can ple yield high-performance topologies, but these the are extremely rare high-bandwidth routers. We calculate distribution offrom end a probabilistic the space of all possible grap graph point of view. In contrast, equivalent networks constructed generic degree-based user bandwidths and routerscale-free utilization when each networkby achieves strained by common macros probabilistic constructions result in more likely, but poorly-performing topologies. Consistent with its best performance. Figure 7 (a) shows that the HOT network law) node degree distributio this, the most likely network (included in Figure 13) has also sub-par performance. This picture can can support users with a wide range of bandwidth requirements, terms of the (relative) likelih be further enhanced when considering alternative performance measures such as the distribution of however the PA and GRG models cannot. Figure 7(d) shows that that avoids the extreme end end user bandwidths and router utilization. As detailed in [93], the heuristically optimal networks (d) routers achieve high utilization in the HOT network, whereas, when lated by graphs resulting fro and (e) achieve high utilization in their core routers and support a wide range of end-user bandwidth the high degree hubs saturate in the PA and GRG networks, all asnodes PA and GRG. Although requirements. In contrast, the random degree-based networks (b) and (c) saturate only their hub the other routers are left under-utilized (Figure 7(b)(c)). The netare specic to each of thes and leave all other routers severely underutilized, thus providing uniformly low bandwidth and poor generated A by these two from degree-based probabilistic when viewed under the len performance toworks their end-users. main lesson this comparison of ve methods different networks with are essentially thefor same in terms of their performance. considered macrosco identical node degree distributions network modeling is that functionality (e.g., performance)rently trumps structure (e.g., connectivity) . That vs. is, connectivity-only metrics are weak discriminators among all graph Performance Likelihood. A striking contrast is observed by and are difcult to discern. of a given size with the same node degree distribution, and it requires appropriate performance-related simultaneously plotting performance versus likelihood for all ve nectivity hubs that provide metrics to separate "the wheat from the chaff." models in Figure 8. The HOT network has high performance and desired power law degree d We explained earlier that onwhile the basis currently available traceroute claims of low likelihood the of PA and GRG networks have measurements, high likesurprising that theorists who power-law relationships have no substance as far as the Internets router topology is concerned. Howlihood but low performance. The interpretation of this picture is erate graphs with power-law ever, by examining available router technologies and models, we have also shown that it is certainly that a careful design process explicitly incorporating technologistatistical descriptions of gl conceivable that the actual node degrees of deployed routers in an actual ISP can span a range of 2-3 cal constraints can yield high-performance topologies, but these are extremely rare from a probabilistic 30 graph point of view. In contrast, equivalent power-law degree distribution networks constructed by generic degree-based probabilistic constructions result in more likely, but poor-performing topologies. The most likely Lmax network (also plotted in Figure 8) has poor performance. This viewpoint is augmented if one considers the process of pairwise random degree-preserving rewiring as a means to explore the space of graphs having the same overall degree distribution. In Figure 8, each point represents a different network obtained by random rewiring. Despite the fact that all of these graphs have the same
tures that are hallmarks of th However, the story chan work performance as a seco as points in the likelihoodlikely graphs that make up have such bad performance a they could reasonably repre ISP or the Internet as a wh simple heuristically designe the tradeoffs between link c demand result in congurati
orders of magnitude; that is, the corresponding node degree distribution exhibits high variability, without necessarily conforming to a power law-type distribution. At the same time, Figure 12 illustrates that irrespective of the type of node degree distribution, graphs with identical node degree distributions can be very different in their structure and differ even more drastically in terms of their functionality (e.g., performance). What is also true is that the same core network design can support many different end-user bandwidth distributions and that by and large, the variability in end-user bandwidth demands determines the variability of the node degrees in the resulting network. To illustrate, consider the simple example presented in Figure 14, where the same network core supports different types of variability in end user bandwidths at the edge (and thus yields different overall node degree distributions). The network in Figure 14(a) provides uniformly high bandwidth to end users; the network in Figure 14(b) supports end user bandwidth demands that are highly variable; and the network in Figure 14(c) provides uniformly low bandwidth to end users. Thus, from an engineering perspective, not only is there not necessarily any implied relationship between a network node degree distribution and its core structure, there is also no implied relationship between a networks core structure and its overall degree distribution. Thus, the proposed engineering-based rst-principles approach to modeling the Internet router topology demysties power law-type node degree distributions altogether by identifying its root cause in the form of high variability in end-user bandwidth demands. In view of such a simple physical explanation of the origins of node degree variability in the Internets router-level topology, Strogatz question, paraphrasing Shakespeares Macbeth, ... power-law scaling, full of sound and fury, signifying nothing? [145] has a resounding afrmative answer.
Node Rank
Node Rank
10 9 10 8 10 7 10 6 10 5 0 10 10 1 10 2 10 3 Cumulative Users
Node Rank
10 2
10 10
10 2
1010
102
1010 109 108 107 106 105 0 10 101 102 103 104 105 Cumulative Users
10
10
10
10 0 0 10
10 1
10 2 Node Degree
10 0 0 10
10 1 10 2 Node Degree
100 0 10
10 1 10 2 Node Degree
(a) (b) (c) Figure 9: Distribution of node degree and end-user bandwidths for several topologies having the same core structure: (a) uniformly high bandwidth end users, (b) highlydegree variable bandwidth end users, (c) uniformly lowtopologies bandwidthhaving end users. Figure 14: Distribution of node and end-user bandwidths for several the same core structure:
(a) uniformly high bandwidth end users, (b) highly variable bandwidth end users, (c) uniformly low bandwidth end
isusers. both simpler and more revealing. While there Inc. is nothing about ties denes a research topic in itself, despite the signicant progress (Reprinted from [93]; 2004 ACM, Included here by permission.) our rst-principles approach that precludes the incorporation of rothat has recently been made in this area by projects such as Rockbustness, doing so would require carefully addressing the networketfuel, Skitter, or Mercator. specic issues related to the design of the Internet. For example, Any work on Internet topology generation and evaluation runs robustness should be dened in terms of impact on network perthe danger of being viewed as incomplete and/or too preliminary if formance, it should be consistent with the various economic and it does not deliver the ultimate product, i.e., a topology generator. 31 technological constraints at work, and it should explicitly include In this respect, our work is not different, but for a good reason. As the network-specic features that yield robustness in the real Intera methodology paper, it opens up a new line of research in identifynet (e.g., component redundancy and feedback control in IP routing causal forces that are either currently at work in shaping largeing). Simplistic graph theoretic notions of connected clusters [4] scale network properties or could play a critical role in determining or resilience [42], while perhaps interesting, are inadequate in adthe lay-out of future networks. This aspect of the work requires dressing the features that matter for the real network. close collaboration with and feedback from network engineers, for These ndings seem to suggest that the proposed rst-principles whom the whole approach seems obvious. At the same time, the approach together with its implications is so immediate, especially paper outlines an approach that is largely orthogonal to the existing from a networking perspective, that it is not worth documenting. literature and can only benet from constructive feedback from the But why then is the networking literature on generating, validatresearch community. In either case, we hope it forms the basis for ing, and understanding network designs dominated by generative a fruitful dialogue between networking researchers and practitionmodels that favor randomness over design and discover strucers, after which the development of a radically different topology tures that should be fully expected to arise from these probabilistic generator looms as an important open research problem.
topology modeling is awed in more than one way, and we have collected and presented this gradually accumulating evidence in this section the underlying measurements are highly ambiguous (3.2), the inferred connectivity structures are erroneous (3.3), and the resulting models are infeasible and/or do not make sense from an engineering perspective because they are either too costly, have extremely poor performance, or cannot be built with from existing technology in the rst place. This section also describes and reports on an alternative design-based approach to router-level topology modeling and generation that has come into focus during the last 5-10 years and represents a clean break with tradition. The most visible sign of this break is the emergence of constrained optimization as new modeling language, essentially replacing the traditional language of random graph theory. While the latter treats nodes and links as largely generic objects and focuses almost exclusively on structural aspects such as connectivity, the former supports a much richer treatment of topologies nodes and links are components with their own structure, constraints, and functionality, and their assembly into a topology that is supposed to achieve a certain overall objective and should do so efciently and effectively within the given constraints on the individual components or the system as a whole is the essence of the constrained optimization-based network design approach. In essence, this approach echoes what was articulated some 15 years ago in [23, 42, 165], but it goes beyond this prior work in terms of empirical evidence, problem formulation, and solution approach. As a result, the described design-based approach has by and large put an end to graph theory-based router topology modeling for the Internet. At the same time, the design-based approach that has been developed and reported in bits and pieces in the existing literature and is presented in this section for the benet of the reader in one piece has also far-reaching implications for Internet topology research in general and router-level topology modeling, analysis, and generation in particular. For one, it shows the largely supercial nature of router-level topology generators that are based on graph-theoretic models. As appealing as they may be to a user because of their simplicity (after all, all a user has to specify is in general the size of the graph), they are by and large of no use for any real application where details like trafc, routing, capacities, or functionality matter. Second, while the design-based approach yields realistic router-level topology models that are inherently generative in nature, it puts at the same time an end to the popular request for a largely generic black-box-type topology generator. Users in real need for synthetic router-level maps have to recognize that this need doesnt come for free. Instead, it comes with the responsibility to provide detailed input in terms of expected customers their geographic dispersion, and the trafc matrix (see [150] for more details) design objectives and constraints, etc. In addition, the level of detail required of a generated ISP router-level topology (e.g., POP-, router-, interface card-level) depends critically on and cannot be separated from the purpose for which these generated maps will be used. Again, this puts a considerable burden on the user of a synthetically generated map and tests her understanding of the relevant issues to a degree unheard of in Internet topology modeling in the past. Third, the explicit focus of the design-based approach on ISPs as crucial decision makers renders the commonly-expressed desire for synthetic router-level maps of the global Internet largely pointless. The Internet is a network of networks, with the sovereign entities being the autonomous systems (ASes). A subset of these ASes that are in the business of providing network service to other ASes or Internet access to end users are owning and operating their networks that together make up much of the physical infrastructure of the global Internet. As a result, a key rst step in understanding the structure and temporal evolution of the Internet at the different physical and logical layers is to study the physical infrastructures of the service and access providers networks and how they react in response to changes in the environment, technology, economy, etc. Finally, once we have a more-or-less complete picture of the router-level topology for the individual
32
ISPs, we can start interconnecting them at common locations, thereby bringing ISP router-level and AS-level topology under one umbrella. In the process, it will be critical to collapse the detailed router-level topologies into their corresponding PoP-level maps which are essentially the geographic maps mentioned in the context of the ARPANET in 3.1 and serve as glue between the detailed router-level topologies and an appropriately dened and constructed AS-level topology of the Internet. For a faithful and realistic modeling of this combined router-, POP-, and AS-level structure of the Internet, it will be important to account for the rich structure that exists in support of network interconnections in practice. This structure includes features such as third-party colocation facilities that house the PoPs of multiple ASes in one and the same physical building. It also includes components of the Internets infrastructure such as Internet eXchange Points (IXPs). This existing structure is inherently non-random but is a reection of the incentives that exist, on the one hand, for network and access providers to build, manage, and evolve their physical infrastructures and, on the other hand, for content providers, CDNs, and cloud providers to establish peering relationships with interested parties. Importantly, neither such structures nor incentives precludes an application of the described constrained optimization-based approach to network design; they merely require being creative with respect to formulating a proper objective function, identifying the nature of the most critical constraints, and being able to pinpoint the main sources of uncertainty in the environment.
3.5 Notes
The primary sources for the material presented in this section are: [93] L. Li, D. Alderson, J.C. Doyle and W. Willinger. A rst principles approach to understanding the Internets router-level topology, Proc. ACM SIGCOMM04, ACM Computer Communication Review 34(4), 2004. [46] J. C. Doyle, D. L. Alderson, L. Li, S. Low, M. Roughan, S. Shalunov, R. Tanaka, and W. Willinger. The "robust yet fragile" nature of the Internet, PNAS 102(41), 2005. [156] W. Willinger, D. Alderson, and J.C. Doyle. Mathematics and the Internet: A Source of Enormous Confusion and Great Potential, Notices of the AMS 56(5), 2009. An excellent short treatment of the discussion in 3.3 about network modeling as an exercise in reverse-engineering vs. as an exercise in model tting can be found in Chapter 10 of the recent book [29] M. Chiang. Networked Life: 20 Questions and Answers. Cambridge University Press, 2012. For additional and more in-depth reading materials we point to [7] Alderson, D., Li, L., Willinger, W., and Doyle, J.C. Understanding Internet Topology: Principles, Models, and Validation, IEEE/ACM Transactions on Networking 13(6): 1205-1218, 2005. [85] B. Krishnamurthy and W. Willinger. What are our standards for validation of measurement-based networking research? Computer Communications 34, 2011. For some "food for thought" regarding topics such as power law distributions and scale-free networks, and network science, we recommend [157] W. Willinger, D. Alderson, J.C. Doyle, and L. Li. More normal than normal: Scaling distributions and complex systems. In: R.G. Ingalls, M.D. Rossetti, J.S. Smith, and B.A. Peters (Editors). IEEE. Proc. of the 2004 Winter Simulation Conference, Piscataway, NJ, 2004.
33
[106] M. Mitzenmacher. A Brief History of Generative Models for Power Law and Lognormal Distributions, Internet Mathematics 1(2):226-251, 2004. [82] E. Fox Keller. Revisiting scale-free networks, BioEssays 27(1)):1060-1068, 2005. [11] A.-L. Barabasi. Scale-free Networks: A Decade and Beyond, Science 325, 2009. [12] A.-L. Barabasi. The network takeover, Nature Physics 8, pp. 14-16, 2012. [107] M. Mitzenmacher. Editorial: The Future of Power Law Research. Internet Mathematics, vol. 2. no. 4, pp. 525-534, 2006.
4 AS-level topology
When trying to establish a precise meaning or interpretation of the use of Internet topology, in much of the existing literature, we nd that the phrase has often been taken to mean a virtual construct or graph created by the Border Gateway Protocol (BGP) routing protocol. Commonly referred to as the inter-domain or Autonomous-System (AS) topology named after the logical blocks (ASes) that are used in BGP to designate the origin and path of routing announcements it is this particular connectivity structure that we focus on in this section, though we will see that the notion of the AS-topology is more slippery than commonly imagined. In particular, we will discuss some of the main issues that arise in the context of studying the Internets AS topology (ranging from proper denitions and interpretations of this construct to measurements) and focus less on modeling-related aspects as they are still in their infancy, especially when compared to the advances in router-topology modeling described in 3.
34
connecting network, and so most such connections result in IP addresses from neighboring ASes appearing locally. Another problem arises from the fact that although an AS is often considered to correspond to a single technical administrative domain, i.e., a network run by one organization, it is common practice for a single organization to manage multiple ASes, each with their own ASN [22]. For instance, Verizon Business (formerly known as UUNET) uses ASNs 701, 702, 703 to separate its E-BGP network into three geographic regions, but runs a single IGP instance throughout its whole network. In terms of dening nodes of a graph, these three networks are all under the same operational administrative control, and hence should be viewed as a single node. On the other hand, as far as ASNs are concerned, they are different and should be treated as three separate nodes. The situation is actually more complex since corporations like Verizon Business own some 200+ ASNs [22] (not all are actually used, though). In many of these cases, a clear boundary between these multiple ASes may not really exist, thus blurring the denition of the meaning of a node in an AS graph. Similar problems can arise when a single AS is managed by multiple administrative authorities which consist of individuals from different corporations. For example, AS 2914 is run partially by NTT/America and partially by NTT/Asia. All this presumes that an AS is a uniform, contiguous entity, but that is not necessarily true [110, 111]. An AS may very well announce different sets of prexes at different exit points of its network, or use BGP to balance trafc across overloaded links (other reasons for heterogeneous congurations are reported in [21]). Figure 15 illustrates the problem. The AS-graph simplies, in some cases grossly, the very complicated structure of the entities involved, which are often heterogeneous, and not necessarily even contiguous either geographically or logically. For all these reasons, it should be clear that modeling an AS as a single atomic node without internal (or external) structure is overly simplistic for most practical problems. Moreover, these issues cannot simply be addressed by moving towards graph representations that can account for some internal node structure (such as in [110]), mainly because BGP is unlikely to reveal sufcient information to infer the internal structure for the purpose of faithful modeling. Moreover, the AS-graph treats ASes as nodes, with connecting edges, but the real situation is much more complex. ASes are complex networks in their own right, and are connected sometimes by multiple edges (Mrindol et al. [105] found that over half of the ASes they studied were connected by multiple links), and sometimes through Internet eXchange Points (IXPs) that connect multiple ASes. In fact, the traditional approach of modeling the AS-level Internet as a simple connected di-graph is an abstraction incapable of capturing important facets of the rich semantics of real-world inter-AS relationships, including different interconnections for different policies and/or different interconnection points [110,111]. The implications of such abstractions need to be recognized before attributing network-specic meaning to ndings derived from the resulting models.
35
IXP
Figure 15: An illustration of the obfuscation of the AS-graph (in the vein of [61]). The graph may appear simple, but hides heterogeneous, non-atomic, dis-contiguous entities and interconnects. At the minimum, this should illustrate the dangers of talking about the Internet graph.
Routing between ASes is very different from routing within ASes and highlights the difference between graph representations that reect reachability" vs. connectivity" information. These differences create interesting problems and opportunities for measurements, some with parallels to the router-level measurement problems and others without any such parallels. 4.2.1 Data-plane vs. control-plane measurements As discussed in 3.2, despite all its deciencies, traceroute has been the method-of-choice for obtaining router-level measurements. As a prime example of an active measurement tool that is conned to the data plane (i.e., probe packets take the same paths as generic data packets), traceroute has also been used to obtain information about the AS topology but has additional problems in this domain. Apart from the already problematic issues (e.g., load-balancing, aliasing, missing data), IP addresses along traceroute paths must now be mapped to ASes. This mapping is even harder than the mapping to routers, not just because the data for doing so is inaccurate or incomplete (e.g., IP to organization allocations may not work because an organization does not directly correspond to an AS), but also because the border of an AS is not well-dened in terms of IP addresses. It is common for a link between two ASes to come from a subnet allocated by one of the ASes, resulting in an interface in the other network with an address that is not its own [100, 101]. The problem is further complicated by variations such as anycast or Multiple Origin ASes [167], which provide yet another set of counter-examples to a straightforward mapping between AS and address space. Some work has concentrated on trying to improve
36
the mapping [120], and these represent technical advances, but it is important to understand that the fundamental difcultly lies in the fact that the boundaries of the business are not equivalent to the AS boundaries. The other major alternative to obtaining information about the AS topology is to collect control plane data in the form of directly measured routing information. The primary example of such control plane data are BGP-derived measurements. BGP is a path-vector routing protocol, and as such each node transmits to its neighbors information about the best path that it knows to a destination. Each node then takes the information it has received about best paths, and computes its own best path, which it transmits to its neighbors. A route monitor receives this information as would any router, and from the transmitted path information, can infer links between ASes. The two best known projects that rely on BGP route monitors, Oregon RouteViews [118], and RIPE (Rseaux IP Europens)s Routing Information Service [131] both use this approach, and each connects to a few dozen different ASes. However, by its very design, BGP is an information-hiding rather than an information-revealing routing protocol. In addition, by its very design, BGP is all about reachability and not connectivity. Using it for mapping the Internet inter-domain topology is a hack, and so it should come as no surprise that it has its own set of problems, including the following: The AS-path information in the announcements is primarily included for loop detection and does not have to correspond to reality. It is easy (and not uncommon) to insert additional ASes into a path for various purposes, e.g., trafc engineering or measurement [21, 35], and moreover, the AS-path does not have to represent the data path. Path-vector protocols do not transmit information on every path in the network. For instance, backup paths may never appear in any routing announcements (unless there is a failure), and so may not be seen by a route monitor. Path-vector protocols only transmit best paths, and so there is a large loss of visibility from any one viewpoint. It is sometimes argued that a large number of viewpoints would alleviate this, but the viewpoint locations are highly biased towards larger networks, and this known vantage point problem" severely biases the possible views of the network [134]. The BGP measurement data being provided by RIPE and RouteViews was originally intended to help debug networks, not for mapping. While this data collections have been invaluable for that intended original purpose, it is unsurprising that it is inadequate when used for a rather different purpose such as mapping the AS Internet. However, when this aspect is carefully taken into account, good work can be done but requires a critical evaluation of the data. Problems arise primarily when this data is used uncritically. Other useful sources of AS-level measurements such as looking glass servers and route registries suffer from similar problems [69, 96], and do so for similar reasons: they werent intended to draw a map of the AS Internet. 4.2.2 Attribute discovery The AS topology may be interesting to scientists in itself, but to be useful to network engineers, the routing policies that accompany it should also be known. It has been common to approximate the range of policies between ASes by a simple set of three relationships: (a) customer-provider, (b) peer-peer, and (c) siblings. This reduction was at least in part motivated by Huston [72, 73] and has been used in various places [146, 153, 159]. While many relationships fall into these three categories, there are frequent exceptions [75, 110, 126], for instance, in the form of partial transit in a particular region [115, 163].
37
Forgetting for the moment the simplication in assuming all policies t this model and the simplications the AS-graph itself makes, the relationships can be represented in the graph by providing simple labels for each edge. Typically, the next step after inferring network topology is to infer policies between ASes. The most common approach to this problem is to assume the universality of the peer-peer, customer-provider, sibling-sibling model, and to infer the policies by nding an allocation of policies consistent with the observed routing [14, 41, 56, 153, 159]. Once relationships are established, a seemingly reasonable next step is to estimate the hierarchical structure as in [146]. However, the effect of large numbers of (biased) missing links has not really been considered in these algorithms. In fact, the tier structure of the Internet seems to be largely an illusion. Recent work has shown that there is little value in the model at present [58, 88]; but, in contrast to the claims of these papers, there is no strong evidence that the situation has actually changed or that the tier model was ever a good model (except maybe in the early stage of the public" Internet in the latter 20th century) particularly in light of the problems in the data. Alternatively, we can infer a generic set of policies consistent with routing observations using a more detailed set of routing measurements [98, 110] and estimate performance by comparing predicted routes to real routes (held back from the inference process). 4.2.3 The missing link" problem: Extent and impact Perhaps the most obvious problem that results from relying on BGP measurement data to map the ASlevel Internet is that there are many missing links in the resulting AS-graph. To illustrate the extent of this problem, years of concentrated research efforts that relied on a combination of improved inference methods and additional data sources [9, 27, 28, 39, 41, 69, 70, 110, 117, 134, 136, 166] have produced a picture of the Internets AS topology that as of 2011 consisted of some 35,000-40,000 ASes (nodes) and about 115,000-135,000 edges (AS links), with about 80,000-90,000 of them being of the customer-provider type and 35,000-45,000 of the peer-peer type. More recently, this supposedly up-to-date and most complete view of the AS-level Internet changed drastically thanks to [2] that relied on ground truth data from one of the largest IXPs in Europe (and worldwide) that had at the time of this study almost 400 member ASes. The main nding of this recent study is that in this single location, the number of actively used AS links of the peer-peer type was more the 50,000 larger than the number of all AS links of the peer-peer type in the entire Internet known as of 2011. Moreover, being extremely conservative when extrapolating from this IXP to the Internet as a whole, [2] shows that there are easily more than 200,000 AS links of the peer-peer type in the entire Internet, more than twice the number of all AS links of the customer-provider type Internet-wide. Importantly, the main reason for this abundance of AS links of the peer-peer type at IXPs is well understood many IXPs, especially the larger ones, offer as free service to their member ASes the use of their route server. This service greatly facilitates the establishment of peer-peer links between the members of an IXP and has become enormously popular with members that have an open" (as compared to restrictive or selective) peering policy. Especially for the larger IXPs, such networks typically constitute the vast majority of IXP member ASes. Figure 16 provides an illustration of the connectivity through this IXP and shows that a majority of its member ASes have an open peering policy (some 300+ members) and establish AS links of the peer-peer type with one another. In short, for many years, researchers have worked with AS-graphs that are typically complete in terms of nodes, but easily miss more than half the edges. Importantly, these graphs have generally a 2:1 ratio of customer-provider type vs. peer-peer type links when a 1:3 ratio is much more likely to reect Internet reality. Clearly, for gaining any economic-based understanding of the AS Internet, getting
38
0 50
Transit
CDN
Host
Ent. net.
Service
Content
Monitor
50
0.0
SISP T2
HCDN
AEN Leaf
1.0
0 20 40 Percentage of HTTP/H
ion by businessFigure types. (b) Scatter-plot of num. ofa peers per member. (c) Fractions of web-traf 16: Scatter-plot of number of peers per member, based on classication of the member ASes in the four business
categories dened above: LISP (Large ISP), SISP (Small ISP), HCDN (Hosting/service and Content Distribution Network)), permission.)
and AEN (Academic and Enterprise Networks), and by tier. (Reprinted from [2]; 2012 ACM, Inc. Included here by Diversity in members: business type, number of peerings, and application mix exemplied by w
classication, we nd that in the LISP group, with a small number of peerings are the tier-1 with a selective peering policy. In the HCDN with a few peerings include some of the large all hosting providers (e.g., for banks or online is less clear for the SISP group. In general, the ber of member ASes that have a large number is testimony for the ease with which member39 (and other) IXP. In fact, the ndings of a recent compelling reasons some 99 % of the surveyed lt of handshake agreements (with symmetric ormal contracts, and an apparent prevalence of g agreements; that is, the exchange of customer s of more than two parties.
incoming trafc, while the opposite is true fo However, there are various exceptions to this HCDNs with signicantly more incoming than LISPs and SISPs where the opposite holds true the signicant diversity in the ratio of incoming more than half of the member ASes that send mo receive most of the trafc. Indeed, there is a 5 the top 50 member ASes according to bytes s member ASes according to bytes received. We can also examine how similar or dissim plication mix (see Section 2) is across all the For example, when computing for each memb of HTTP/HTTPS trafc relative to the total nu and received, we nd in Figure 4(c) that this app
that ratio approximately correct is paramount because it is directly impacting how money ows in the Internet while in a customer-provider relationship, the former pays the latter for bandwidth, peer-peer relationships are typically settlement-free (i.e., no money is exchanged between the involved parties). Besides their immediate economic impact, the above missing edges cause also signicant problems in inferring the AS graph. For instance, it is a requirement that a network be multi-homed to obtain an ASN. This means the AS needs to intent to connect to at least two upstream providers. In this sense a single-homed stub-AS does not exist. Without any doubt, there are exceptions to this rule. However, the second link is often a backup link which is invisible to BGP outside of the immediate connection, because of BGPs information hiding5 . Thus, it may appear as if a large number of ASes are single-homed stubs. In [117], the authors separate the missing links into hidden and invisible. Whereas the latter are links that are missing from the data for structural reasons (i.e., it is not just a question of quantity (i.e., numbers of monitors) but quality (i.e., location of monitor)), the hidden links may be found with enough measurements (over time, or multiple viewpoints). In [134] the authors extend that by dividing links into a number of classes based on their observability. The missing link" problem in the AS context is much more serious than if those links were missing at random. In particular, the bias in the type of links that are missing [134] is critical when calculating some metrics on the graph, such as distances, precisely because such links are often designed to cut down on the number of ASes trafc must traverse. The missing data is also crucial for understanding reliability: for instance, papers such as [5] that argue that high-degree nodes create vulnerabilities in the Internet ignore the backup links that are invisible in these dataset, but obviously crucial when studying the resilience of the network.
40
Graph business relationship physical link-level connectivity graph BGP routing graph policy graph trafc graph
Edge Annotation subsidiary, partner, customer,... link capacity BGP policies trafc volumes
Graph Type directed graph multi- hyper-graph multigraph undirected graph directed multigraph directed graph
to being a multigraph, as it is very common for two ASes to be connected by multiple links and in different geographic locations [94, 114, 143]. The idea is clearly illustrated by Figure 1 in [94], which shows a pancake diagram of the North American Internet backbone. Perhaps the reason this critical aspect of the topology is typically ignored is that it is very hard to measureBGP monitor data is in general blind to this facet of the topology. In addition, this graph should really be a hypergraph. A single edge can connect multiple ASes, for example through an IXP [9, 75, 160]. One might argue that they are joined by a switch/router, each using point-to-point links, but in at least some cases, that switch has no place in a AS graph (i.e., it has no ASN). The graphs edges could be usefully annotated with link capacity and potentially other features such as geographic location. Connectivity graph: this graph indicates that layer-2 connectivity exists between two ASNs. In many cases the layer-2 connectivity between ASNs would be congruent with the layer-1 connectivity, but with recent advances in network virtualization this may not hold for long [154]. BGP routing graph: the edges in this graph indicate pairs of ASes that have an active BGP session exchanging routing information (i.e., a BGP session that is in the established state [130]). Policy graph: the edges in this graph are the same as those in the BGP routing graph, but include directed policy annotations [62]. We dene this separately from the BGP routing graph because it may require a multigraph to allow for policy differences between different regions. Trafc graph: it is the same as the BGP routing graph, but the edges are annotated with the amount of trafc exchanged between the corresponding ASes. This is hardly a complete set of possibilities, but already we can see the potential complexity here. Nevertheless, it appears unusual for studies to even dene precisely what graph they examine (exceptions being papers such as [60, 117] where the BGP routing graph is explicitly considered). In Table 1, we list some of the possible graphs, and their basic properties. There is no clean 1:1 mapping between network and organization and AS [22, 75], and so it is highly non-trivial to map between these graphs, and they are certainly not equivalent.
41
signicantly, and in the majority of such successful papers there is no need to exploit the graph view of the network. Examples include: (a) The discovery of slow convergence and persistence oscillation in routing protocols [64, 86, 87, 89, 90, 151, 152]. (b) Understanding of the impacts (positive and negative) of route ap dampening [97, 124]. (c) Determining how much address space and how many ASNs are being actively used [74]. (d) Looking for routing Bogons often related to Internet address hijacking [17, 40, 50, 128, 147]. (e) Debugging network problems [20, 53, 133]. On the measurement side, there have also been many advancements towards improving our view of AS topology. For instance: 1. As BGP routing changes, often multiple potential paths are explored and these paths (which are unlikely to actually be used as a nal choice) can show some of the alternative routes available in the network [166], and thus a more complete topology. 2. Missing edges can be found using additional datasets, e.g., RIRs and looking glasses [27, 69, 70, 166], or IXP data [9, 69, 70, 136], though care must be exercised with any additional dataset. 3. A routing beacon [21, 89, 99] is just a router that advertises and withdraws certain prexes on a regular schedule. Examination of the observed announcements and withdrawals by various route monitors then allows estimates of protocol behavior such as convergence time. 4. Route poisoning prevents announcement from reaching certain parts of the Internet. As with beacons, it allows one to examine the behavior of BGP in a more controlled manner. This is perhaps the only way to see (some) backup paths, or to understand whether an ISP uses default routing [21, 35]. 5. There are also attempts to not just estimate the topology but derive some quality measure for the resultant AS-graph [76, 134, 158]. There is often an unfortunate side-effect to some of these types of measurement in form of a Heisenberglike uncertainty principle. That is, it is not clear whether observed changes are due to the microphenomenon of path exploration or macro-phenomena of link changes, new entrants, etc. The longer we make observations, the more complete they may seem, but we then do not know whether all of those links existed at the same time. Such uncertainty principles appear to be present in a number of Internet measurement contexts [132] where we trade off accuracy of the measurements against time localization. In any case, this approach does not overcome the structural bias mentioned earlier. At the same time, the above-mentioned and other advances on the measurement side suggest that the missing link problem may be improved, providing more complete AS graphs. However, there is a profound need (illustrated by the above) for better data accuracy measurements, and better response to data quality issues from subsequent users of the data. Obvious ways to improve are to conduct sensitivity analysis (of results) to missing or incorrect input data. In addition, it is to be hoped that more controlled experiments are conducted (i.e., experiments that have a control sample against which the experimental data can be compared) in order to precisely derive which factors of interest affect which variables. Controls allow one to discriminate alternative explanations for results, and prevent the affects of one confounding factor drowning out the affects of others (see [21,99]). This is basic tenet of the scientic method, but seems to have been ignored in this area of research. Most studies have been observational, and while there is a valid role for such experiments, for instance in epidemiology, they are intrinsically harder to interpret. Lastly, another aspect of this richer set of AS topologies is that it should be obvious by now that economic or commercial objectives by and large determine and shape the structure and evolution of
42
their real-world counterparts, and that these constructs are once again naturally expressed through optimization rather than random graph models, though in this case the optimization problems may come from game theory or economics rather than mathematical programming.
4.5 Notes
The primary sources for the material presented in this section are [135] M. Roughan, W. Willinger, O. Maennel, D. Perouli, and R. Bush. 10 Lessons from 10 Years of Measuring and Modeling the Internets Autonomous Systems, in: IEEE Journal on Selected Areas in Communications 29(9):1810-1821, 2011. [2] B. Ager, N. Chatzis, A. Feldmann, N. Sarrar, S. Uhlig, and W. Willinger. Anatomy of a large European IXP , in: Proc. ACM SIGCOMM12, ACM Computer Communication Review 42(4), 2012. and they contain lengthier discussions of many of the issues touched upon here. For additional and more in-depth reading materials (in addition to the references indicated throughout) we point to [26] H. Chang. Modeling the Internets Inter-Domain Topology and Trafc Demand Based on Internet Business Characterization, PhD Thesis, University of Michigan, 2006. [43] B.Donnet and T. Friedman, Internet Topology Discovery: A Survey, IEEE Communications Surveys & Tutorials, 9(4), pp.56-69, 2007. [38] A. Dhamdhere. Understanding the Evolution of the AS-level Internet Ecosystem, PhD Thesis, Georgia Institute of Technology, 2008. [68] H. Haddadi, M. Rio, G. Iannaccone, A. Moore and R. Mortier, Network topologies: inference, modeling and generation, IEEE Communications Surveys, 10(2), 2008. [39] A. Dhamdhere and C. Dovrolis. Twelve Years in the Evolution of the Internet Ecosystem, in: IEEE/ACM Transactions on Networking 19(5), 2011.
5 PoP-level topology
When designing or reconguring the physical infrastructure of an ISP, network operators are often guided by a design principle that emphasizes hierarchy [31,59,108]. There are two main reasons for implementing hierarchical network designs: scalability and simplicity. Compared to non-hierarchical designs, hierarchical networks can often be built at scale, mainly because hierarchy makes a network easier to visualize a key feature towards making it easier to manage. The situation is analogous to modularity in programming languages ideally it allows consideration of network components in isolation. A common form of hierarchy in IP networks is based on the concept of the PoP (or Point of Presence). A PoP is a loosely dened term. Some providers may use the term to mean a physical building (housing a group of routers, switches and other devices), whereas others mean a metropolitan area where service is provided. However it is dened, though, it is a useful construct because it describes the logical structure of the network as the designer intended, rather than its particular implementation in terms of individual routers. Moreover, irrespective of the meaning, PoPs have an explicit geography (e.g., street address or
43
city/metropolitan area). This then leads to our third major category of Internet topology the PoP-level topology. PoP-level topologies are ideal for understanding tradeoffs between connectivity and redundancy, and also provide the most essential information to competitors or customers (about where a network is based, or who has the best access network in a region). Additional reasons why the PoP-level view of networks is interesting include Network maps are often drawn at this level because it is an easy level for humans to comprehend. Network optimization is often conducted at this level because the problem size is generally reasonable (e.g., dozens of PoPs as compared to potentially hundreds of routers) and because inter-PoP links are much more expensive than intra-PoP links. The internal design of PoPs is almost completely determined by simple templates [31, 59, 108]. Networks change less frequently at the PoP level than at the router level [138]; and The PoP level is the more interesting level for many activities because it is less dependent on the details of protocol implementations, router vendor and model, and other technological details.
The last point is subtle but important for modelling. For instance, when using a network as part of a simulation, one would like to have a network that is invariant to the method being tested. If a network designer might change his/her network in response to a new protocol, say a routing or trafc engineering algorithm, then the test will be ambiguous if it uses existing networks as models. PoP-level networks are less sensitive to these details than router-level networks, because routers impose physical and technological constraints that are almost completely dependent on the details of the router vendor, model and even the version of software running on the router. Two examples of PoP-level topologies are depicted in Figure 17, showing the structure of two of the largest research networks (Abilene, and GANT) in the world.
44
45
city, or metropolitan area, the eyeballs (i.e., end users) connected to the PoP will be spread over some area [129]. However, if the researcher is willing to diligently mine various data sources, there is hope of at least being able to geolocate PoPs as they house potentially hundreds or thousands of IP addresses and reside in locations with known physical addresses (e.g., carrier hotels) [8].
46
topologies). We next discuss this property in more detail. 5.3.1 From PoP-level to router-level connectivity Given a PoP-level network, there is an additional interesting question: Can we map this network down to the router level? The GT-ITM model addressed this through random generation of its subnetworks, but in practice the design process of network engineers in this case is a text-book application of repeating patterns [31, 59, 108] and hence anything but random. The main reason for following this design process is that network designers often apply cookie cutter methods to design networks as a whole or the internals of PoPs, though that term unnecessarily trivializes the importance of repeated patterns. Repetition makes network operations vastly simpler: the management of two PoPs requires the same skills. Equipment can be bulk purchased, debugging is easier, and adding new PoPs is simpler. Finally, networks based on templated design lead to simple design methodologies. For instance, the inter-PoP level network topology can be optimized relatively simply, as details such as redundancy will be supplied by the provision of pairs of redundant routers in each PoP, with redundant links between them. Design often refers to the graph topology of router interconnections, but templated design can extend to other details, such as physical conguration within racks, connections with external networks, or additional servers such as Domain Name Service or Network Management Systems. This type of design can be mathematically described using graph-products, though for more details, we invite the reader to consult [121]. 5.3.2 From PoP-level to AS-level connectivity: The pancake-view of the Internet Until now we have only really discussed the PoP-level topology of a single network. However, there is considerable interest in how these networks interconnect. The most prominent and commonly-accepted view of the Internet is as a network of networks or ASes in the AS-graph representation discussed in 4. A much neglected and rarely-mentioned representation is the pancake view where we consider each network to be a layer and where the different layers (networks) are stacked on top of one another to form a pancake-like structure [94]. To show where the different networks inter-connect, we add links across layers; intra-network connectivity is shown as links within each layer. For a set of peer networks, one advantage of this pancake view is that these networks often cover similar geographic areas and inter-connect in multiple locations, but at a limited set of cities (determined either by where private interconnects are seen as commercially viable, or where IXPs are available). Importantly, depending on the types of networks, many of them host their PoPs in one and the same commercial colocation facilities whose street addresses are generally known6 . As such, the pancake view allows one to visualize not only this connectivity inside and between such providers but also the geography of their PoP-level topologies. However, as far as we are aware, there has been no signicant work studying this pancake view together with the different inter-connections, other than noting that it exists. The dearth of studies and models perhaps stems from the problems in obtaining the measurements necessary for constructing this view (see the discussions in 4), but it is perhaps one of the most interesting areas for future Internet topology research.
6 One problem in establishing such a view lies in the limitations of current IP geolocation services [125].
47
5.5 Notes
The primary source for the material presented in this section (and a much lengthier discussion of many of the issues) is [84] S.Knight, H.Nguyen, N.Falkner, R.Bowden, M.Roughan, The Internet topology zoo. IEEE Journal on Selected Areas in Communications (JSAC) 29, 9, 1765-1775, October 2011. For additional and more in-depth reading materials (in addition to the references indicated throughout) we point to [139] Y.Shavitt and N.Zilberman, Geographical Internet PoP-level maps. In Proceedings of the 4th international conference on Trafc Monitoring and Analysis (Berlin, Heidelberg), TMA12, Springer-Verlag, pp. 121-124, 2012. [121] E.Parsonage, H.Nguyen, R.Bowden, K.Knight, N.Falkner, M.Roughan, Generalized graph products for network design and analysis. In 19th IEEE International Conference on Network Protocols (ICNP) (Vancouver, CA), October 2011.
48
[138] Y.Shavitt and N.Zilberman, A structural approach for PoP geo-location. In IEEE Infocom 2010. [162] K.Yoshida, Y.Kikuchi, M.Yamamoto, Y.Fujii, K.Nagami, I.Nakagawa and H.Esaki, Inferring PoP-level ISP topology through end-to-end delay measurement. In PAM, pp. 35-44, 2009.
6 Conclusion
This chapter has aimed at clarifying the state of the art in Internet topology measurement and modelling, and correcting a number of clear and present aws in reasoning. As we outlined in the introduction, we can see a number of themes recurring at multiple levels of hierarchy in topology modelling: Theme 1: When studying highly-engineered systems such as the Internet, details in the form of protocols, architecture, functionality, and purpose matter. Theme 2: When analyzing Internet measurements, examining the hygiene of the available measurements (i.e., an in-depth recounting of the potential pitfalls associated with producing the measurements in question) is critical. Theme 3: When validating proposed topology models, it is necessary to treat network modeling as an exercise in reverse-engineering and not as an exercise in model-tting. Theme 4: When modeling highly-engineered systems such as the Internet, beware of M.L. Menckens quote For every complex problem there is an answer that is clear, simple, and wrong. We have not tried to survey the entire literature in this area, and we apologize to those whose work has not appeared here, but there are other extant surveys mentioned at the relevant points throughout this chapter, for specic components of the work. We also have not tried to critique every model, but rather tried to provide general guidance about modelling. It is intended that the readers could themselves critique existing and new models based on the ideas presented here. In addition, we do not claim to have covered every type of topology associated with the Internet. Specically, we have avoided topologies at the applications layer, for instance those associated with the WWW or online social networks. We made this choice simply because these topologies are (despite being Internet topologies) profoundly different from the topologies we have included. They are almost purely virtual whereas all of the networks considered here have a physical component, which leads to the arguments for optimization as their underlying construction. An important open problem in this context is the role that societal-related factors play over more economic- or technology-based drivers in the formation and evolution of these virtual topologies. Finally, in each section, we have aimed at illuminating some of the current problems and identifying hopefully fruitful directions for future research in this area.
References
[1] A CHLIOPTAS , D., C LAUSET, A., K EMPE , D., AND M OORE , C. On the bias of traceroute sampling: or, power-law degree distributions in regular graphs. In ACM Symposium on Theory of Computing (STOC) (2005), ACM, pp. 694703. [2] A GER , B., C HATZIS , N., F ELDMANN , A., S ARRAR , N., U HLIG , S., AND W ILLINGER , W. Anatomy of a large European IXP. In ACM SIGCOMM (2012).
49
[3] A IELLO, W., C HUNG , F., AND L U , L. A random graph model for massive graphs. In ACM Symposium on Theory of Computing (STOC) (2000), ACM, pp. 171180. [4] A LBERT, R., AND B ARABSI , A.-L. Statistical mechanics of complex networks. Reviews of Modern Physics 74 (2002), 4797. [5] A LBERT, R., H.J EONG , AND B ARABSI , A.-L. Error and attack tolerance of complex networks. Nature 406 (2000), 378382. [6] A LDERSON , D., AND D OYLE ., J. Contrasting views of complexity and their implications for networkcentric infrastructures. IEEE Transactions on Systems, Man, and Cybernetics Part A 40, 4 (2010). [7] A LDERSON , D., L I , L., W ILLINGER , W., AND D OYLE , J. C. Understanding Internet topology: principles, models, and validation. IEEE/ACM Transactions on Networking 13 (December 2005), 12051218. [8] Internet Atlas. http://atlas.wail.wisc.edu/about-us.jsp. [9] AUGUSTIN , B., K RISHNAMURTHY, B., AND W ILLINGER , W. IXPs: Mapped? In ACM SIGCOMM Internet Measurement Conference (IMC) (2009), pp. 336349. [10] B ARABSI , A.-L. Linked: How Everything Is Connected to Everything Else and What it Means for Business, Science, and Everyday Life. Perseus Publishing, Cambridge, MA, 2002. [11] B ARABSI , A.-L. Scale-free networks: A decade and beyond. Science 325 (2009), 412413. [12] B ARABSI , A.-L. The network takeover. Nature Physics 8 (2012), 1416. [13] B ARABSI , A.-L., AND A LBERT, R. Emergence of scaling in random networks. Science 286 (1999). [14] B ATTISTA , G., PATRIGNANI , M., AND P IZZONIA , M. Computing the types of the relationships between autonomous systems. In IEEE INFOCOM (2003). [15] B ENDER , A., S HERWOOD, R., AND S PRING , N. Fixing Allys growing pains with velocity modeling. In ACM SIGCOMM Internet Measurement Conference (IMC) (2008). [16] B ONICA , R., G AN , D., TAPPAN , D., AND P IGNATARO, C. ICMP extensions for multiprotocol label switching. IETF, Network Working Group, Request for Comments: 4950, August 2007. [17] B OOTHE , P., H IEBERT, J., AND B USH , R. How prevalent is prex hijacking on the Internet? NANOG 36 (February 2006). [18] B RODER , A., K UMAR , R., M AGHOUL , F., R AGHAVAN , P., R AJAGOPALAN , S., S TATA , R., T OMKINS , A., AND W IENER , J. Graph structure in the web. Computer Networks 33, 1-6 (June 2000), 309320. [19] B ROIDO, A., AND CLAFFY, K . Internet topology: connectivity of IP graphs. In SPIE International Symposium on Convergence of IT and Communication (August 2001), pp. 172187. [20] B USH , R., H IEBERT, J., M AENNEL , O., R OUGHAN , M., AND U HLIG , S. Testing the reachability of (new) address space. In ACM SIGCOMM workshop on Internet network management (INM07) (2007), pp. 236241.
50
[21] B USH , R., M AENNEL , O., R OUGHAN , M., AND U HLIG , S. Internet optometry: assessing the broken glasses in Internet reachability. In ACM SIGCOMM Internet Measurement Conference (IMC) (2009), pp. 242253. [22] C AI , X., H EIDEMANN , J., K RISHNAMURTHY, B., AND W ILLINGER , W. Towards an AS-to-organization map. In ACM SIGCOMM Internet Measurement Conference (IMC) (2010). [23] C ALVERT, K., D OAR , M., AND Z EGURA , E. Modeling Internet topology. IEEE Communications Magazine 35, 6 (June 1997), 160163. [24] C ARLSON , J., AND D OYLE , J. Complexity and robustness. Proceedings of the National Academy of Sciences of the USA (PNAS) 99, Suppl. 1 (2002), 25382545. [25] C ERF, V., AND K AHN , B. Selected ARPANET maps. ACM SIGCOMM Computer Communications Review (CCR) 20 (1990), 81110. [26] C HANG , H. Modeling Internets Inter-Domain Topology and Trafc Demand Based on Internet Business Characterization. PhD thesis, University of Michigan, 2006. [27] C HANG , H., G OVINDAN , R., J AMIN , S., S HENKER , S. J., AND W ILLINGER , W. Towards capturing representative AS-level Internet topologies. Computer Networks 44, 6 (2004), 737755. [28] C HEN , K., C HOFFNES , D., P OTHARAJU , R., C HEN , Y., B USTAMANTE , F., PAI , D., AND Z HAO, Y. Where the sidewalks ends: Extending the Internet AS graph using traceroutes from P2P users. In ACM SIGCOMM CoNEXT (2009). [29] C HIANG , M. Networked Life: 20 Questions and Answers. Cambridge University Press, 2012. [30] C HUNG , F., AND L U , L. The average distance in a random graph with given expected degrees. Internet Mathematics 1, 1 (2004), 91113. [31] ISP network design, Cisco Systems. ISOC ISP/IXP Workshop, 2005. [32] Building cost effective and scalable CORE networks using an elastic architecture. Cisco white paper, 2013. http://www.cisco.com/en/US/prod/collateral/routers/ps5763/ white_paper_c11-727983.pdf. [33] Converged transport architecture: Improving scale and efciency in service provider. Cisco white paper, 2013. http://www.cisco.com/en/US/prod/collateral/routers/ps5763/ white_paper_c11-728242.pdf. [34] C LARKE , D. The design principles of the DARPA Internet protocols. ACM SIGCOMM Computer Communications Review (CCR) 25, 1 (January 1995). [35] C OLITTI , L. Internet Topology Discovery Using Active Probing. PhD thesis, University di Roma Tre, 2006. [36] BGP cost community, Cisco Systems, 2005. http://www.cisco.com/en/US/docs/ios/ 12_0s/feature/guide/s_bgpcc.html. [37] D ASU , T., AND J OHNSON , T. Exploratory Data Mining and Data Cleaning. Wiley, New York, 2003.
51
[38] D HAMDHERE , A. Understanding the Evolution of the AS-level Internet Ecosystem. PhD thesis, Georgia Institute of Technology, 2008. [39] D HAMDHERE , A., AND D OVROLIS , C. Twelve years in the evolution of the Internet ecosystem. IEEE/ACM Transactions on Networking 19, 5 (2011), 14201433. [40] D IAZ , J. GIZMODO: Chinas Internet hijacking uncovered, 2010. http://gizmodo.com/ 5692217/chinas-secret-internet-hijacking-uncovered. [41] D IMITROPOULOS , X., K RIOUKOV, D., F OMENKOV, M., H UFFAKER , B., H YUN , Y., KC CLAFFY, AND R ILEY, G. AS relationships: Inference and validation. ACM SIGCOMM Computer Communications Review (CCR) 37, 1 (2007), 2940. [42] D OAR , M. B., AND N EXION , A. A better model for generating test networks. In IEEE GLOBECOM (1996), pp. 8693. [43] D ONNET, B., AND F RIEDMAN , T. Internet topology discovery: A survey. IEEE Communications Surveys & Tutorials 9 (2007), 5669. [44] D ONNET, B., L UCKIE , M., M RINDOL , P., AND PANSIOT, J.-J. Revealing MPLS tunnels obscured from traceroute. ACM SIGCOMM Computer Communications Review (CCR) (April 2012). http: //www.sigcomm.org/ccr/papers/2012/April/2185376.2185388. [45] D OROGOVTSEV, S., AND M ENDES , J. Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, 2003. [46] D OYLE , J. C., A LDERSON , D. L., L I , L., L OW, S., R OUGHAN , M., S HALUNOV, S., TANAKA , R., AND W ILLINGER , W. The robust yet fragile nature of the Internet. Proceedings of the National Academy of Sciences of the USA (PNAS) 102, 41 (October 2005), 14497502. [47] E RDS , P., AND R NYI , A. On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences 5 (1960), 1761. [48] FABRIKANT, A., KOUTSOUPIAS , E., AND PAPADIMITRIOU , C. Heuristically optimized trade-offs: A new paradigm for power laws in the internet. In Automata, Languages and Programming, vol. 2380 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2002, pp. 781781. [49] FALOUTSOS , M., FALOUTSOS , P., AND FALOUTSOS , C. On power-law relationships of the Internet topology. In ACM SIGCOMM (1999), pp. 251262. [50] F EAMSTER , N., J UNG , J., AND B ALAKRISHNAN , H. An empirical study of bogon route advertisements. ACM SIGCOMM Computer Communications Review (CCR) 35, 1 (2005), 6370. [51] F ELDMAN , D., AND S HAVITT, Y. Automatic large scale generation of Internet PoP level maps. In IEEE GLOBECOM (2008), pp. 24262431. [52] F ELDMANN , A., G REENBERG , A., L UND, C., R EINGOLD, N., AND R EXFORD, J. Netscope: Trafc engineering for IP networks. IEEE Network Magazine (March/April 2000), 1119. [53] F ELDMANN , A., M AENNEL , O., M AO, Z. M., B ERGER , A., AND M AGGS , B. Locating Internet routing instabilities. In ACM SIGCOMM (2004).
52
[54] F ORTZ , B., AND T HORUP, M. Internet trafc engineering by optimizing OSPF weights. In IEEE INFOCOM (2000), pp. 519528. [55] F RAZER , K. Merits history, the NSFNET backbone project 1987-1995, Merit Network, Inc. http: //www.livinginternet.com/doc/merit.edu/phenom.html. [56] G AO, L. On Inferring Autonomous System Relationships in the Internet. Global Telecommunications Internet Mini-Conference (2000). [57] GANT.
http://www.geant.net/Network/The_Network/Pages/ Network-Topology.aspx.
[58] G ILL , P., A RLITT, M., L I , Z., AND M AHANTI , A. The attening Internet topology: Natural evolution, unsightly barnacles or contrived collapse? In Passive and Active Measurement Conference (PAM) (2008), pp. 110. http://www.springerlink.com/content/ 1255p8g3k6766242/fulltext.pdf. [59] G ILL , V. Analysis of design decisions in a 10G backbone.
www.nanog.org/meetings/
nanog34/presentations/gill.pdf.
[60] G OVINDAN , R., AND R EDDY, A. An analysis of Internet inter-domain topology and route stability. In IEEE INFOCOM (1997), pp. 850857. [61] G RIFFIN , T. G. Understanding the Border Gateway Protocol (BGP). ICNP Tutorial, http://www. cl.cam.ac.uk/~tgg22/talks/, 2002. [62] G RIFFIN , T. G. The stratied shortest-paths problem (invited paper). In International Conference on Communications Systems & Networks (COMSNETS) (January 2010). [63] G RIFFIN , T. G., AND H USTON , G. BGP wedgies. RFC 4264, 2005. [64] G RIFFIN , T. G., AND W ILFONG , G. An analysis of the MED oscillation problem in BGP. In IEEE International Conference on Network Protocols (ICNP) (2002). [65] GT-ITM: Georgia Tech Internetwork Topology Models. projects/gtitm/.
http://www.cc.gatech.edu/
[66] Modeling topology of large internetworks, 2000. http://www.cc.gatech.edu/projects/ gtitm/. [67] G UNES , M., AND S ARAC , K. Resolving IP aliases in building traceroute-based Internet maps. IEEE/ACM Transactions on Networking 17, 6 (2009), 17381751. [68] H ADDADI , H., R IO, M., I ANNACCONE , G., M OORE , A., AND M ORTIER , R. Network topologies: inference, modeling and generation. IEEE Communications Surveys 10, 2 (2008). [69] H E , Y., S IGANOS , G., FALOUTSOS , M., AND K RISHNAMURTHY, S. V. A systematic framework for unearthing the missing links: Measurements and impact. In USENIX Symposium on Networked Systems Design and Implementation (NSDI) (April 2007). [70] H E , Y., S IGANOS , G., FALOUTSOS , M., AND K RISHNAMURTHY, S. V. Lord of the links: A framework for discovering missing links in the Internet topology. IEEE/ACM Transactions on Networking 17, 2 (2009), 391404.
53
[71] H EART, F., M C K ENZIE , A., M C QUILLIAN , J., AND WALDEN , D. ARPANET completion report. Tech. rep., Bolt, Beranek and Newman, Burlington, MA, 1978. http://www.cs.utexas.edu/ users/chris/DIGITAL_ARCHIVE/ARPANET/DARPA4799.pdf. [72] H USTON , G. Peering and settlements - part I. The Internet Protocol Journal 2, 1 (March 1999). [73] H USTON , G. Peering and settlements - part II. The Internet Protocol Journal 2, 2 (June 1999). [74] H USTON , G. IPv4 address report, 2007. http://www.potaroo.net/tools/ipv4/index. html. [75] H YUN , Y., B ROIDO, A., AND K . C . CLAFFY. Traceroute and BGP AS path incongruities. Tech. rep., UCSD CAIDA, 2003. http://www.caida.org/publications/papers/2003/ASP/. [76] Internet AS-level topology construction & analysis. http://topology.neclab.eu/. [77] Internet 2. http://www.internet2.edu/pubs/networkmap-connectors-participants. pdf. [78] J ACOBSON , V. Traceroute. ftp://ftp.ee.lbl.gov/traceroute.tar.gz, 1989-04. [79] J AMAKOVIC , A., AND U HLIG , S. On the relationships between topological metrics in real-world networks. Networks and Heterogeneous Media 3, 2 (June 2008), 345359. [80] Evolving backbone networks using with an MPLS supercore. Juniper Networks white paper, 2013. http://www.juniper.net/us/en/local/pdf/whitepapers/2000392-en.pdf. [81] K ARDES , H., O Z , T., AND G UNES , M. H. Cheleby: A subnet-level Internet topology mapping system. In International Conference on Communications Systems & Networks (COMSNETS) (2012). [82] K ELLER , E. F. Revisiting scale-free networks. BioEssays 27, 1 (2005), 10601068. [83] K NIGHT, S., FALKNER , N., N GUYEN , H., T UNE , P., AND R OUGHAN , M. I can see for miles: Revisualizing the Internet. IEEE Network 26, 6 (2012), 2632. [84] K NIGHT, S., N GUYEN , H., FALKNER , N., B OWDEN , R., AND R OUGHAN , M. The Internet topology zoo. IEEE Journal on Selected Areas in Communications (JSAC) 29, 9 (October 2011), 17651775. [85] K RISHNAMURTHY, B., AND W ILLINGER , W. What are our standards for validation of measurementbased networking research? ACM SIGMETRICS Performance Evaluation Review 36 (2008), 6469. [86] L ABOVITZ , C., A HUJA , A., B OSE , A., AND J AHANIAN , F. Delayed Internet routing convergence. In ACM SIGCOMM (2000). [87] L ABOVITZ , C., A HUJA , A., AND J AHANIAN , F. Experimental study of Internet stability and wide-area network failures. In International Symposium on Fault-Tolerant Computing (FTCS) (1999). [88] L ABOVITZ , C., I EKEL -J OHNSON , S., M C P HERSON , D., O BERHEIDE , J., AND J AHANIAN , F. Internet inter-domain trafc. In ACM SIGCOMM (2010), pp. 7586. [89] L ABOVITZ , C., M ALAN , R., AND J AHANIAN , F. Internet routing stability. In ACM SIGCOMM (1997).
54
[90] L ABOVITZ , C., M ALAN , R., AND J AHANIAN , F. Origins of Internet routing instability. In IEEE INFOCOM (1999). [91] L AKHINA , A., B YERS , J., C ROVELLA , M., AND X IE , P. Sampling biases in IP topology measurements. In IEEE INFOCOM (April 2003). [92] L I , L., A LDERSON , D., D OYLE , J., AND W ILLINGER , W. Towards a theory of scale-free graphs: Denitions, properties, and implications. Internet Mathematics 2, 4 (2005), 431523. [93] L I , L., A LDERSON , D., W ILLINGER , W., AND D OYLE , J. A rst-principles approach to understanding the Internets router-level topology. In ACM SIGCOMM (2004), pp. 314. [94] L ILJENSTAM , M., L IU , J., AND N ICOL , D. Development of an Internet backbone topology for large-scale network simulations. In Winter Simulation Conference (2003). [95] M ADHYASTHA , H. V., I SDAL , T., P IATEK , M., D IXON , C., A NDERSON , T., K RISHNAMURTHY, A., AND V ENKATARAMANI , A. iPlane: An information plane for distributed services. In USENIX Symposium on Operating Systems Design and Implementation (OSDI) (November 2006). [96] M AHADEVAN , P., K RIOUKOV, D., F OMENKOV, M., D IMITROPOULOS , X., CLAFFY, K . C ., AND VAHDAT, A. The Internet AS-level topology: three data sources and one denitive metric. ACM SIGCOMM Computer Communications Review (CCR) 36 (January 2006), 1726. [97] M AO, Z., G OVINDAN , R., VARGHESE , G., AND K ATZ , R. Route ap dampening exacerbates Internet routing convergence. In ACM SIGCOMM (2002). [98] M AO, Z., Q IU , L., WANG , J., AND Z HANG , Y. On AS-level path inference. In ACM SIGMETRICS (2005). [99] M AO, Z. M., B USH , R., G RIFFIN , T. G., AND R OUGHAN , M. BGP beacons. In ACM SIGCOMM Internet Measurement Conference (IMC) (October 2003). [100] M AO, Z. M., R EXFORD, J., WANG , J., AND K ATZ , R. Towards an accurate AS-level traceroute tool. In ACM SIGCOMM (August 2003). [101] M ARCHETTA , P., DE D ONATO, W., AND P ESCAPE , A. Detecting third-party addresses in traceroute traces with IP timestamp option. In Passive and Active Measurement Conference (PAM) (2013). [102] M ARCHETTA , P., M ERINDOL , P., D ONNET, B., P ESCAPE , A., AND PANSIOT, J. Quantifying and mitigating IGMP ltering in topology discovery. In IEEE GLOBECOM (2012), pp. 18711876. [103] M EDINA , A., L AKHINA , A., M ATTA , I., AND B YERS , J. BRITE: an approach to universal topology generation. In Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (2001), pp. 346353. [104] M RINDOL , P., D ONNET, B., B ONAVENTURE , O., AND PANSIOT, J.-J. On the impact of layer-2 on node degree distribution. In ACM SIGCOMM Internet Measurement Conference (IMC) (2010), pp. 179191. [105] M RINDOL , P., VAN DEN S CHRIECK , V., D ONNET, B., B ONAVENTURE , O., AND PANSIOT, J.-J. Quantifying ASes multiconnectivity using multicast information. In ACM SIGCOMM Internet Measurement Conference (IMC) (2009), pp. 370376.
55
[106] M ITZENMACHER , M. A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1 (2001), 226251. [107] M ITZENMACHER , M. Editorial: The future of power law research. Internet Mathematics 2, 4 (2006), 525534. [108] M ORRIS , M. Network design templates.
www.networkworld.com/community/blog/
http://www.ripe.net/ripe/docs/
[117] O LIVEIRA , R., P EI , D., W ILLINGER , W., Z HANG , B., AND Z HANG , L. The (in)completeness of the observed Internet AS-level structure. IEEE/ACM Transactions on Networking 18, 1 (2010), 109122. [118] University of Oregon Route Views Archive Project. www.routeviews.org. [119] PANSIOT, J.-J., AND G RAD, D. On routes and multicast trees in the internet. ACM SIGCOMM Computer Communications Review (CCR) 28, 1 (1998), 4150. [120] PANSIOT, J.-J., M RINDOL , P., D ONNET, B., AND B ONAVENTURE , O. Extracting intra-domain topology from mrinfo probing. In Passive and Active Measurement Conference (PAM) (2010), pp. 8190. [121] PARSONAGEQUE , E., N GUYEN , H. X., B OWDEN , R., K NIGHT, S., FALKNER , N. J., AND R OUGHAN , M. Generalized graph products for network design and analysis. In IEEE International Conference on Network Protocols (ICNP) (October 2011). [122] PASTOR-S ATORRAS , R., AND V ESPIGNANI , A. Epidemic spreading in scale-free networks. Physical Review Letters 86, 14 (2001), 32003203. [123] PASTOR-S ATORRAS , R., AND V ESPIGNANI , A. Evolution and Structure of the Internet: A Statistical Physics Approach. Cambridge University Press, 2004.
56
[124] P ELSSER , C., M AENNEL , O., M OHAPATRA , P., B USH , R., AND PATEL , K. Route ap damping made useful. In Passive and Active Measurement Conference (PAM) (2011). [125] P OESE , I., U HLIG , S., K AAFAR , M. A., D ONNET, B., AND G UEYE , B. IP geolocation databases: unreliable? ACM SIGCOMM Computer Communications Review (CCR) 41, 2 (April 2011), 5356. [126] Q IU , S. Y., M CDANIEL , P. D., AND M ONROSE , F. Toward valley-free inter-domain routing. In IEEE International Conference on Communications (2007).
[127] QUOITIN , B., VAN DEN S CHRIECK , V., F RAN GOIS , P., AND B ONAVENTURE , O. IGen: generation of router-level internet topologies through network design heuristics. In International Teletrafc Congress (2009), pp. 18.
[128] R AMACHANDRAN , A., AND F EAMSTER , N. Understanding the network-level behavior of spammers. In ACM SIGCOMM (2006), pp. 291302. [129] R ASTI , A. H., M AGHAREI , N., R EJAIE , R., AND W ILLINGER , W. Eyeball ASes: From geography to connectivity. In ACM SIGCOMM Internet Measurement Conference (IMC) (2010). [130] R EKHTER , Y., AND L I , T. A border gateway protocol (BGP-4). RFC 4271, January 2006. [131] Ripe NCC: routing information service. http://www.ripe.net/projects/ris/. [132] R OUGHAN , M. Fundamental bounds on the accuracy of network performance measurements. In ACM SIGMETRICS (June 2005), pp. 253264. [133] R OUGHAN , M., G RIFFIN , T., M AO, M., G REENBERG , A., AND F REEMAN , B. IP forwarding anomalies and improving their detection using multiple data sources. In ACM SIGCOMM Workshop on Network Troubleshooting (September 2004), pp. 307312. [134] R OUGHAN , M., T UKE , J., AND M AENNEL , O. Bigfoot, Sasquatch, the Yeti and other missing links: what we dont know about the AS graph. In ACM SIGCOMM Internet Measurement Conference (IMC) (October 2008). [135] R OUGHAN , M., W ILLINGER , W., M AENNEL , O., P EROULI , D., AND B USH , R. 10 lessons from 10 years of measuring and modeling the Internets autonomous systems. IEEE Journal on Selected Areas in Communications (JSAC) 29 (2011), 18101821. [136] S ANCHEZ , M., OTTO, J., B ISCHOF, Z., C HOFFNES , D., B USTAMANTE , F., K RISHNAMURTHY, B., AND W ILLINGER , W. Dasu: Pushing experiments to the Internets edge. In 10th USENIX NSDI (2013). [137] S HAIKH , A., AND G REENBERG , A. Experience in black-box OSPF measurement. In ACM SIGCOMM Internet Measurement Conference (IMC) (2001), pp. 113125. [138] S HAVITT, Y., AND Z ILBERMAN , N. A structural approach for PoP geo-location. In IEEE INFOCOM (2010). [139] S HAVITT, Y., AND Z ILBERMAN , N. Geographical Internet PoP-level maps. In 4th international conference on Trafc Monitoring and Analysis (2012), TMA12, pp. 121124. [140] S HERRY, J., K ATZ -B ASSETT, E., P IMENOVA , M., M ADHYASTHA , H., A NDERSON , T., AND K RISH NAMURTHY, A. Resolving IP aliases with prespecied timestamps. In ACM SIGCOMM Internet Measurement Conference (IMC) (2010).
57
[141] S HERWOOD, R., B ENDER , A., AND S PRING , N. DisCarte: a disjunctive Internet cartographer. In ACM SIGCOMM (August 2008). [142] S OMMERS , J., E RIKSSON , B., AND B ARFORD, P. On the prevalence and characteristics of MPLS deployments in the open internet. In ACM SIGCOMM Internet Measurement Conference (IMC) (2011). [143] S PRING , N., M AHAJAN , R., AND A NDERSON , T. Quantifying the causes of path ination. In ACM SIGCOMM (2003). [144] S PRING , N., M AHAJAN , R., AND W ETHERALL , D. Measuring ISP topologies with Rocketfuel. In ACM SIGCOMM (August 2002). [145] S TROGATZ , S. Romanesque networks. Nature 433 (2005). [146] S UBRAMANIAN , L., A GARWAL , S., R EXFORD, J., AND K ATZ , R. Characterizing the Internet hierarchy from multiple vantage points. In IEEE INFOCOM (2002), vol. 2, pp. 618627. [147] The Team Cymru bogon reference page. http://www.cymru.com/Bogons/. [148] T OZAL , M. E., AND S ARAC , K. TraceNET: an Internet topology data collector. In ACM SIGCOMM Internet Measurement Conference (IMC) (2010), pp. 356368. [149] T OZAL , M. E., AND S ARAC , K. Estimating network layer subnet characteristics via statistical sampling. In 11th international IFIP TC 6 conference on Networking Volume Part I (2012), IFIP12, pp. 274288. [150] T UNE , P., AND R OUGHAN , M. SIGCOMM eBook on Recent Advances in Networking, vol. 1. ACM, 2013, ch. Internet Trafc Matrices: A Primer. [151] VARADHAN , K., G OVINDAN , R., AND E STRIN , D. Persistent route oscillations in inter-domain routing. Tech. rep., 96-631, USC/ISI, 1996. [152] VARADHAN , K., G OVINDAN , R., AND E STRIN , D. Persistent route oscillations in inter-domain routing. Computer Networks (March 2000). [153] WANG , F., AND G AO, L. On inferring and characterizing Internet routing policies. In ACM SIGCOMM Internet Measurement Conference (IMC) (October 2003). [154] WANG , Y., K ELLER , E., B ISKEBORN , B., VAN DER M ERWE , J., AND R EXFORD, J. Virtual routers on the move: live router migration as a network-management primitive. In ACM SIGCOMM (2008), pp. 231242. [155] WAXMAN , B. M. Routing of multipoint connections. IEEE Journal on Selected Areas in Communications (JSAC) 6, 9 (December 1988), 16171622. [156] W ILLINGER , W., A LDERSON , D., AND D OYLE , J. Mathematics and the Internet: A source of enormous confusion and great potential. Notices of the AMS 56, 5 (2009), 586599. http://www.ams.org/ notices/200905/rtx090500586p.pdf. [157] W ILLINGER , W., A LDERSON , D., D OYLE , J. C., AND L I , L. More "normal" than normal: scaling distributions and complex systems. In 36th cWinter simulation Conference (2004), pp. 130141.
58
[158] W INTER , R. Modeling the Internet routing topology with a known degree of accuracy in less than 24h. In ACM/IEEE/SCS Workshop on Principles of Advanced and Distributed Simulation (PADS) (2009). [159] X IA , J., AND G AO, L. On the evaluation of AS relationship inferences. In Globecom (2004). [160] X U , K., D UAN , Z., Z HANG , Z.-L., AND C HANDRASHEKAR , J. On properties of Internet exchange points and their impact on AS topology and relationship. LNCS 3042 (2004), 284295. http: //www.springerlink.com/content/umu88q1eewk0e8yy/fulltext.pdf. [161] Y OOK , S.-H., H.J EONG , AND B ARABSI , A.-L. Modeling the Internets large-scale topology. Proceedings of the National Academy of Sciences of the USA (PNAS), 99 (2002), 1338213386. [162] Y OSHIDA , K., K IKUCHI , Y., YAMAMOTO, M., F UJII , Y., N AGAMI , K., N AKAGAWA , I., AND E SAKI , H. Inferring PoP-level ISP topology through end-to-end delay measurement. In Passive and Active Measurement Conference (PAM) (2009), pp. 3544. [163] Y OSHINOBU , M. What makes our policy messy, 2010. http://www.attn.jp/maz/p/c/ bgpworkshop200904/bgpworkshop-policy.pdf. [164] Z EGURA , E., C ALVERT, K., AND B HATTACHARJEE , S. How to model an internetwork. In IEEE INFOCOM (1996), vol. 2, pp. 594602. [165] Z EGURA , E. W., C ALVERT, K. L., AND D ONAHOO, M. J. A quantitative comparison of graph-based models for Internet topology. IEEE/ACM Transactions on Networking 5 (December 1997), 770783. [166] Z HANG , B., L IU , R., M ASSEY, D., AND Z HANG , L. Collecting the Internet AS-level topology. ACM SIGCOMM Computer Communications Review (CCR) (January 2005). [167] Z HAO, X., P EI , D., WANG , L., M ASSEY, D., M ANKIN , A., W U , S. F., AND Z HANG , L. An analysis of BGP multiple origin AS (MOAS) conicts. In ACM SIGCOMM Internet Measurement Workshop (IMW) (2001), pp. 3135.
59
Abstract Transport protocols play a critical role in todays Internet. This chapter rst looks at the recent of the reliable transport protocols. It then explains the growing impact of middleboxes on the evolvability of these protocols. Two recent protocol extensions, Multipath TCP and Minion, which were both designed to extend the current Transport Layer in the Internet are then described.
1 Introduction
The rst computer networks often used ad-hoc and proprietary protocols to interconnect different hosts. During the 1970s and 1980s, the architecture of many of these networks evolved towards a layered architecture. The two most popular ones are the seven-layer OSI reference model [119] and the ve-layer Internet architecture [27]. In these architectures, the transport layer plays a key role. It enables applications to reliably exchange data. A transport protocol can be characterized by the service that it provides to the upper layer (usually the application). Several transport services have been dened: a connectionless service a connection-oriented bytestream service a connection-oriented message-oriented service a message-oriented request-response service an unreliable delivery service for multimedia applications The connectionless service is the simplest service that can be provided by a transport layer protocol. The User Datagram Protocol (UDP) [87] is an example of a protocol that provides this service. Over the years, the connection-oriented bytestream service has proven to be the transport layer service used by most applications. This service is currently provided by the Transmission Control Protocol (TCP) [89] in the Internet. TCP is the dominant transport protocol in todays Internet, but other protocols have provided similar services [60]. Several transport protocols have been designed to support multimedia applications. The Real-Time Transport protocol (RTP) [101], provides many features required by multimedia applications. Some of the functions provided by RTP are part of the transport layer while others correspond to the presentation layer of the OSI reference model. The Datagram Congestion Control Protocol (DCCP) [67] is another protocol that provides functions suitable for applications that do not require a fully reliable service.
Parts
of the text in this chapter have appeared in the following publications by the same author(s): [17], [117], [94], [78] and [61].
60
C. Raiciu, J. Iyengar, O. Bonaventure, Recent Advances in Reliable Transport Protocols, in H. Haddadi, O. Bonaventure (Eds.), Recent Advances in Networking, (2013), pp. 59-106. Licensed under a CC-BY-SA Creative Commons license.
The rest of this chapter is organized as follows. We rst describe the main services provided by the transport layer in section 2. This will enable us to look back at the evolution of the reliable Internet transport protocols during the past decades. Section 3 then describes the organization of todays Internet, the important role played by various types of middleboxes, and the constraints that these middleboxes impose of the evolution of transport protocols. Finally, we describe the design of two recent TCP extensions, both of which evolve the transport layer of the Internet while remaining backward compatible with middleboxes. Multipath TCP, described in section 4, enables transmission of data segments within a transport connection over multiple network paths. Minion, described in section 5, extends TCP and SSL/TLS [35] to provide richer services to the applicationunordered message delivery and multi-streamingwithout changing the protocols wire-format.
2.2
The connection-oriented service is both more complex and also more frequently used. TCP and SCTP are examples of current Internet protocols that provide this service. Older protocols like TP4 or XTP [108] also provide a connection-oriented service. The connection-oriented service can be divided in three phases : the establishment of the connection the data transfer the release of the connection
61
2.2.1
Connection establishment
The rst objective of the transport layer is to multiplex connections initiated by different applications. This requires the ability to unambiguously identify different connections on the same host. TCP uses four elds that are present in the IP and TCP headers to uniquely identify a connection: the source IP address the destination IP address the source port the destination port The source and destination addresses are the network layer addresses (e.g. IPv4 or IPv6 in the case of TCP) that have been allocated to the communicating hosts. When a connection is established by a client, the destination port is usually a well-known port number that is bound to the server application. On the other hand, the source port is often chosen randomly by the client [69]. This random selection of the source port by the client has some security implications as discussed in [5]. Since a TCP connection is identied unambiguously by using this four-tuple, a client can establish multiple connections to the same server by using different source ports on each of these connections. The classical way of establishing a connection between two transport entities is the three-way handshake which is used by TCP [89]. This three handshake was mainly designed to deal with host crashes. It assumes that the underlying network is able to guarantee that a packet will never remain inside the network for longer than the Maximum Segment Lifetime (MSL)1 . Furthermore, a host should not immediately reuse the same port number for subsequent connections to the same host. The TCP header contains ags that specify the role of each segment. For example, the ACK ag indicates that the segment contains a valid acknowledgment number while the SYN ag is used during the three-way handshake. To establish a connection, the original TCP specication [89] required the client to send a TCP segment with the SYN ag sent, including an initial sequence number extracted from a clock. According to [89], this clock had to be implemented by using a 32-bit counter incremented at least once every 4 microseconds and after each TCP connection establishment attempt. While this solution was sufcient to deal with random host crashes, it was not acceptable from a security viewpoint [48]. When a clock is used to generate the initial sequence number for each TCP connection, an attacker that wishes to inject segments inside an established connection could easily guess the sequence number to be used. To solve this problem, modern TCP implementations generate a random initial sequence number [48]. An example of the TCP three-way handshake is presented in gure 1. The client sends a segment with the SYN ag set (also called a SYN segment). The server replies with a segment that has both the SYN and the ACK ags set (a SYN+ACK segment). This SYN+ACK segment contains contains a random sequence number chosen by the server and its acknowledgment number is set to the value of the initial sequence number chosen by the client incremented by one2 . The client replies to the SYN+ACK segment with an ACK segment that acknowledges the received SYN+ACK segment. This concludes the three-way handshake and the TCP connection is established. The duration of the three-way handshake is important for applications that exchange small amount of data such as requests for small web objects. It can become longer if losses occur because TCP can only
1 The 2 TCPs
number.
default MSL duration is 2 minutes [89]. acknowledgment number always contains the next expected sequence number and the SYN ag consumes one sequence
62
!"#$%&'()*+,$-./012-
!"#3456$789(!"#$$%&'(:;<=$-./012456$%&'(!"#$$789($%&'(
Figure 1: TCP three-way handshake rely on its retransmission timer to recover from the loss of a SYN or SYN+ACK segment. When a client sends a SYN segment to a server, it can only rely on the initial value of its retransmission timer to recover from losses3 . Most TCP/IP stacks have used an initial retransmission timer set to 3 seconds [18]. This conservative value was chosen in the 1980s and conrmed in the early 2000s [81]. However, this default implies that with many TCP/IP stacks, the loss of any of the rst two segments of a three-way handshake will cause a delay of 3 seconds on a connection that may normally be shorter than that. Measurements conducted on large web farms showed that this initial timer had a severe impact on the performance perceived by the end users [24]. This convinced the IETF to decrease the recommended initial value for the retransmission timer to 1 second [82]. Some researchers proposed to decrease even more the value of the retransmission timer, notably in datacenter environments [111]. Another utilization of the three-way handshake is to negotiate options that are applicable for this connection. TCP was designed to be extensible. Although it does not carry a version eld, in contrast with IP for example, TCP supports the utilization of options to both negotiate parameters and extend the protocol. TCP options in the SYN segment allow to negotiate the utilization of a particular TCP extension. To enable a particular extension, the client places the corresponding option inside the SYN segment. If the server replies with a similar option in the SYN+ACK segment, the extension is enabled. Otherwise, the extension is disabled on this particular connection. This is illustrated in gure 1. Each TCP option is encoded by using a Type-Length-Value (TLV) format, which enables a receiver to silently discard the options that it does not understand. Unfortunately, there is a limit to the maximum number of TCP options that can be placed inside the TCP header. This limit comes from the Data Offset eld of the TCP header that indicates the position of the rst byte of the payload measured as an integer number of four bytes word starting from the beginning of the TCP header. Since this eld is encoded in four bits, the TCP header cannot be longer than 60 bytes, including all options. This size was considered to be large enough by the designers of the TCP protocol, but is becoming a severe limitation to the extensibility of TCP. A last point to note about the three-way handshake is that the rst TCP implementations created state upon reception of a SYN segment. Many of these implementations also used a small queue to store the TCP connections that had received a SYN segment but not yet the third ACK. For a normal TCP connection, the
3 If the client has sent packets earlier to the same server, it might have stored some information from the previous connection [109, 11] and use this information to bootstrap its initial timer. Recent Linux TCP/IP stacks preserve some state variables between connections.
63
delay between the reception of a SYN segment and the reception of the third ACK is equivalent to a roundtrip-time, usually much less than a second. For this reason, most early TCP/IP implementations chose a small xed size for this queue. Once the queue was full, these implementations dropped all incoming SYN segments. This xed-sized queue was exploited by attackers to cause denial of service attacks. They sent a stream of spoofed SYN segment4 to a server. Once the queue was full, the server stopped accepting SYN segments from legitimate clients [37]. To solve this problem, recent TCP/IP stacks try to avoid maintaining state upon reception of a SYN segment. This solution is often called syn cookies. The principles behind syn cookies are simple. To accept a TCP connection without maintaining state upon reception of the SYN segment, the server must be able to check the validity of the third ACK by using only the information stored inside this ACK. A simple way to do this is to compute the initial sequence number used by the server from a hash that includes the source and destination addresses and ports and some random secret known only by the server. The low order bits of this hash are then sent as the initial sequence number of the returned SYN+ACK segment. When the third ACK comes back, the server can check the validity of the acknowledgment number by recomputing its initial sequence number by using the same hash [37]. Recent TCP/IP stacks use more complex techniques to deal notably with the options that are placed inside the SYN and need to be recovered from the information contained in the third ACK that usually does not contain any option. At this stage, it is interesting to look at the connection establishment scheme used by the SCTP protocol [105]. SCTP was designed more than two decades after TCP and thus has beneted from several of the lessons learned from the experience with TCP. A rst difference between TCP and SCTP are the segments that these protocols use. The SCTP header format is both simpler and more extensible than the TCP header. The rst four elds of the SCTP header (Source and Destination ports, Verication tag and Checksum) are present in all SCTP segments. The source and destination ports play the same role as in TCP. The verication tag is a random number chosen when the SCTP connection is created and placed in all subsequent segments. This verication tag is used to prevent some forms of packet spoong attacks [105]. This is an improvement compared to TCP where the validation of a received segment must be performed by checking the sequence numbers, acknowledgment numbers and other elds of the header [47]. The SCTP checksum is a 32 bits CRC that provides stronger error detection properties than the Internet checksum used by TCP [107]. Each SCTP segment can contain a variable number of chunks and there is no apriori limit to the number of chunks that appear inside a segment, except that a segment should not be longer than the maximum packet length of the underlying network layer. The SCTP connection establishment uses several of these chunks to specify the values of some parameters that are exchanged. A detailed discussion of all these chunks is outside the scope of this document and may be found in [105]. The SCTP four-way handshake uses four segments as shown in gure 2. The rst segment contains the INIT chunk. To establish an SCTP connection with a server, the client rst creates some local state for this connection. The most important parameter of the INIT chunk is the Initiation tag. This value is a random number that is used to identify the connection on the client host for its entire lifetime. This Initiation tag is placed as the Verication tag in all segments sent by the server. This is an important change compared to TCP where only the source and destination ports are used to identify a given connection. The INIT chunk may also contain the other addresses owned by the client. The server responds by sending an INIT-ACK chunk. This chunk also contains an Initiation tag chosen by the server and a copy of the Initiation tag chosen by the client. The INIT and INIT-ACK chunks also contain an initial sequence number. A key difference between TCPs three-way handshake and SCTPs four-way hand4 An IP packet is said to be spoofed if it contains a source address which is different from the IP address of the sending host. Several techniques can be used by network operators to prevent such attacks [41], but measurements show that they are not always deployed [14].
64
!"!#$%!"#$%&'()*
Figure 2: The four-way handshake used by SCTP shake is that an SCTP server does not create any state when receiving an INIT chunk. For this, the server places inside the INIT-ACK reply a State cookie chunk. This State cookie is an opaque block of data that contains information from the INIT and INIT-ACK chunks that the server would have had stored locally, some lifetime information and a signature. The format of the State cookie is exible and the server could in theory place almost any information inside this chunk. The only requirement is that the State cookie must be echoed back by the client to conrm the establishment of the connection. Upon reception of the COOKIE-ECHO chunk, the server veries the signature of the State cookie. The client may provide some user data and an initial sequence number inside the COOKIE-ECHO chunk. The server then responds with a COOKIE-ACK chunk that acknowledges the COOKIE-ECHO chunk. The SCTP connection between the client and the server is now established. This four-way handshake is both more secure and more exible than the three-way handshake used by TCP. 2.2.2 Data transfer
Before looking at the techniques that are used by transport protocols to transfer data, it is useful to look at their service models. TCP has the simplest service model. Once a TCP connection has been established, two bytestreams are available. The rst bytestream allows the client to send data to the server and the second bytestream provides data transfer in the opposite direction. TCP guarantees the reliable delivery of the data during the lifetime of the TCP connection provided that it is gracefully released. SCTP provides a slightly different service model [79]. Once an SCTP connection has been established, the communicating hosts can access two or more message streams. A message stream is a stream of variable length messages. Each message is composed of an integer number of bytes. The connection-oriented service provided by SCTP preserves the message boundaries. This implies that if an application sends a message of N bytes, the receiving application will also receive it as a single message of N bytes. This is in contrast with TCP that only supports a bytestream. Furthermore, SCTP allows the applications to use multiple streams to exchange data. The number of streams that are supported on a given connection is negotiated during connection establishment. When multiple streams have been negotiated, each application can send data over any of these streams and SCTP will deliver the data from the different streams independently without any head-of-line blocking. While most usages of SCTP may assume an in-order delivery of the data, SCTP supports unordered
65
delivery of messages at the receiver. Another extension to SCTP [106] supports partially-reliable delivery. With this extension, an SCTP sender can be instructed to expire data based on one of several events, such as a timeout, the sender can signal the SCTP receiver to move on without waiting for the expired data. This partially reliable service could be useful to provide timed delivery for example. With this service, there is an upper limit on the time required to deliver a message to the receiver. If the transport layer cannot deliver the data within the specied delay, the data is discarded by the sender without causing any stall in the stream. To provide a reliable delivery of the data, transport protocols rely on various mechanisms that have been well studied and discussed in the literature : sequence numbers, acknowledgments, windows, checksums and retransmission techniques. A detailed explanation of these techniques may be found in standard textbooks [16, 99, 40]. We assume that the reader is familiar with them and discuss only some recent changes. TCP tries to pack as much data as possible inside each segment [89]. Recent TCP stacks combine this technique with Path MTU discovery to detect the MTU to be used over a given path [73]. SCTP uses a more complex but also more exible strategy to build its segments. It also relies on Path MTU Discovery to detect the MTU on each path. SCTP then places various chunks inside each segment. The control chunks, that are required for the correct operation of the protocol, are placed rst. Data chunks are then added. SCTP can split a message in several chunks before transmission and also the bundling of different data chunks inside the same segment. Acknowledgments allow the receiver to inform the sender of the correct reception of data. TCP initially relied exclusively on cumulative acknowledgments. Each TCP segment contains an acknowledgment number that indicates the next sequence number that is expected by the receiver. Selective acknowledgments were added later as an extension to TCP [74]. A selective acknowledgment can be sent by a receiver when there are gaps in the received data. A selective acknowledgment is simply a sequence of pairs of sequence numbers, each pair indicating the beginning and the end of a received block of data. SCTP also supports cumulative and selective acknowledgments. Selective acknowledgments are an integral part of SCTP and not an extension which is negotiated at the beginning of the connection. In SCTP, selective acknowledgments are encoded as a control chunk that may be placed inside any segment. In TCP, selective acknowledgments are encoded as TCP options. Unfortunately, given the utilization of the TCP options (notably the timestamp option [63]) and the limited space for options inside a TCP segment, a TCP segment cannot report more than three blocks of data. This adds some complexity to the handling and utilization of selective acknowledgments by TCP. Current TCP and SCTP stacks try to detect segment losses as quickly as possible. For this, they implement various heuristics that allow to retransmit a segment once several duplicate acknowledgments have been received [40]. Selective acknowledgment also aid to improve the retransmission heuristics. If these heuristics fail, both protocols rely on a retransmission timer whose value is xed in function of the round-trip time measured over the connection [82]. Last but not least, the transport protocols on the Internet perform congestion control. The original TCP congestion control scheme was proposed in [62]. Since then, it has evolved and various congestion control schemes have been proposed. Although the IETF recommends a single congestion control scheme [8], recent TCP stacks support different congestion control schemes and some allow the user to select the preferred one. A detailed discussion of the TCP congestion control schemes may be found in [3]. SCTPs congestion control scheme is largely similar to TCPs congestion control scheme. Additional details about recent advances in SCTP may be found in [21]. [36] lists recent IETF documents that are relevant for TCP. [40] contains a detailed explanation of some of the recent changes to the TCP/IP protocol stack.
66
Figure 3: The four-way handshake used to close a TCP connection 2.2.3 Connection release
This phase occurs when either both hosts have exchanged all the required data or when one host needs to stop the connection for any reason (application request, lack of resources, . . . ). TCP supports two mechanisms to release a connection. The main one is the four-way handshake. This handshake uses the FIN ag in the TCP header. Each host can release its own direction of data transfer. When the application wishes to gracefully close a connection, it requests the TCP entity to send a FIN segment. This segment marks the end of the data transfer in the outgoing direction and the sequence number that corresponds to the FIN ag (which consumes one sequence number) is the last one to be sent over this connection. The outgoing stream is closed as soon as the sequence number corresponding to the FIN ag is acknowledged. The remote TCP entity can use the same technique to close the other direction [89]. This graceful connection release has one advantage and one drawback. On the positive side, TCP provides a reliable delivery of all the data provided that the connection is gracefully closed. On the negative side, the utilization of the graceful release forces the TCP entity that sent the last segment on a given connection to maintain state for some time. On busy servers, a large number of connections can remain for a long time [39]. To avoid maintaining such state after a connection has been closed, web servers and some browsers send a RST segment to abruptly close TCP connections. In this case, the underlying TCP connection is closed once all the data has been transferred. This is faster, but there is no guarantee about the reliable delivery of the data. SCTP uses a different approach to terminante connections. When an application requests a shutdown of a connection, SCTP performs a three-way handshake. This handshake uses the SHUTDOWN, SHUTDOWN-ACK and SHUTDOWN-COMPLETE chunks. The SHUTDOWN chunk is sent once all outgoing data has been acknowledged. It contains the last cumulative sequence number. Upon reception of a SHUTDOWN chunk, an SCTP entity informs its application that it cannot accept anymore data over this connection. It then ensures that all outstanding data have been delivered correctly. At that point, it sends a SHUTDOWN-ACK to conrm the reception of the SHUTDOWN segment. The three-way handshake completes with the transmission of the SHUTDOWN-COMPLETE chunk [105]. SCTP also provides the equivalent to TCPs RST segment. The ABORT chunk can be used to refuse a connection, react to the reception of an invalid segment or immediately close a connection (e.g. due to lack
67
!"#$%&'()
!"#$%&'(*+,-) !"#$%&'(*,&./01$1!
2.3
The request-response service has been a popular service since the 1980s. At that time, many request-response applications were built above the connectionless service, typically UDP [15]. A request-response application is very simple. The client sends a request to a server and blocks waiting for the response. The server processes the request and returns a response to the client. This paradigm is often called Remote Procedure Call (RPC) since often the client calls a procedure running on the server. The rst implementations of RPC relied almost exclusively on UDP to transmit the request and responses. In this case, the size of the requests and responses was often restricted to one MTU. In the 1980s and the beginning of the 1990s, UDP was a suitable protocol to transport RPCs because they were mainly used in Ethernet LANs. Few users were considering the utilization of RPC over the WAN. In such networks, CSMA/CD regulated the access to the LAN and there were almost no losses. Over the years, the introduction of Ethernet switches has both allowed Ethernet networks to grow in size but also implied a growing number of packet losses. Unfortunately, RPC running over UDP does not deal efciently with packet losses because many implementation use large timeouts to recover for packet losses. TCP could deal with losses, but it was considered to be too costly for request-response applications. Before sending a request, the client must rst initiate the connection. This requires a three-way handshake and thus wastes one round-trip-time. Then, TCP can transfer the request and receive the response over the established connection. Eventually, it performs a graceful shutdown of the connection. This connection release requires the exchange of four (small) segments, but also forces the client to remain in the TIME WAIT state for a duration of 240 seconds, which limits the number of connections (and thus RPCs) that it can establish with a given server. TCP for Transactions or T/TCP [19] was a rst attempt to enable TCP to better support request/response applications. T/TCP solved the above problem by using three TCP options. These options were mainly used to allow each host to maintain an additional state variable, Connection Count (CC) that is incremented by one for every connection. This state variable is sent in the SYN segment and cached by the server. If a SYN received from a client contains a CC that is larger than the cached one, the new connection is immediately established and data can be exchanged directly (already in the SYN). Otherwise, a normal three-way handshake is used. The use of this state variable allowed T/TCP to reduce the duration of the TIME WAIT state. T/TCP used SYN and FIN ags in the segment sent by the client and returned by the
68
!"#$%$&'($)**+,-$.-/$ !"#%012$%$&'($)**+,-3456$
Figure 5: TCP fast open server, which led to a two segment connection, the best solution from a delay viewpoint for RPC applications. Unfortunately, T/TCP was vulnerable to spoong attacks [32]. An attacker could observe the Connection Count by capturing packets. Since the server only checked that the value of the CC state variable contained in a SYN segment was higher than the cached one, it was easy to inject new segments. Due to this security problem, T/TCP is now deprecated. Improving the performance of TCP for request/response applications continued to be an objective for many researchers. However, recently the focus of the optimizations moved from the LANs that were typical for RPC applications to the global Internet. The motivation for several of the recent changes to the TCP protocol was the perceived performance of TCP with web search applications [24]. A typical web search is also a very short TCP connection during which a small HTTP request and a small HTTP response are exchanged. A rst change to TCP was the increase of the initial congestion window [25]. For many years, TCP used an initial window between 2 and 4 segments [6]. This was smaller than the typical HTTP response from a web search engine [24]. Recent TCP stacks use an initial congestion window of 10 segments [25]. Another change that has been motivated by web search applications is the TCP Fast Open (TFO) extension [91]. This extension can be considered as a replacement for T/TCP. TCP fast open also enables a client to send data inside the SYN segment. TCP fast open relies on state sharing between the client and the server, but the state is more secure than the simple counter used by T/TCP. To enable the utilization of TCP fast open, the client must rst obtain a cookie from the server. This is done by sending a SYN segment with the TFO cookie request option. The server then generates a secure cookie by encrypting the IP address of the client with a local secret [91]. The encrypted information is returned inside a TFO cookie option in the SYN+ACK segment. The client caches the cookie and associates it with the servers IP address. The subsequent connections initiated by the client will benet from TCP fast open. The client includes the cached cookie and optional data inside its SYN segment. The server can validate the segment by decrypting its cookie. If the cookie is valid, the server acknowledges the SYN and the data that it contains. Otherwise,
69
the optional data is ignored and a normal TCP three-way handshake is used. This is illustrated in gure 5.
3 Todays Internet
The TCP/IP protocol suite was designed with the end-to-end principle [100] in mind. TCP and SCTP are no exception to this rule. They both assume that the network contains relays that operate only at the physical, datalink and network layers of the reference models. In such an end-to-end Internet, the payload of an IP packet is never modied inside the network and any transport protocol can be used above IPv4 or IPv6. Today, this behavior corresponds to some islands in the Internet like research backbones and some university networks. Measurements performed in enterprise, cellular and other types of commercial networks reveal that IP packets are processed differently in deployed networks [58, 115]. In addition to the classical repeaters, switches and routers, currently deployed networks contain various types of middleboxes [22]. Middleboxes were not part of the original TCP/IP architecture and they have evolved mainly during the last decade. A recent survey in enterprise networks reveals that such networks contain sometimes as many middleboxes as routers [103]. A detailed survey of all possible middleboxes is outside the scope of this chapter, but it is useful to study the operation of some important types of middleboxes to understand their impact on transport protocols and how transport protocols have to cope with them.
3.1
Firewalls
Firewalls perform checks on the received packets and decide to accept or discard them based on congured security policies. Firewalls play an important role in delimiting network boundaries and controlling incoming and outgoing trafc in enterprise networks. In theory, rewalls should not directly affect transport protocols, but in practice, they may block the deployment of new protocols or extensions to existing ones. Firewalls can either lter packets on the basis of a white list, i.e. an explicit list of allowed communication ows, or a black list, i.e. an explicit list of all forbidden communication ows. Most enterprise rewalls use a white list approach. The network administrator denes a set of allowed communication ows, based on the high-level security policies of the enterprise and congures the low-level ltering rules of the rewall to implement these policies. With such a whitelist, all ows that have not been explicitly dened are forbidden and the rewall discards all packets that do not match an accepted communication ow. This unfortunately implies that a packet that contains a different Protocol than the classical TCP, ICMP and UDP protocols will usually not be accepted by such a rewall. This is a major hurdle for the deployment of new transport protocols like SCTP. Some rewalls can perform more detailed verication and maintain state for each established TCP connection. Some of these stateful rewalls are capable of verifying whether a packet that arrives for an accepted TCP connection contains a valid sequence number. For this, the rewall maintains state for each TCP connection that it accepts and when a new data packet arrives, it veries that it belongs to an established connection and that its sequence number ts inside the advertised receive window. This verication is intended to protect the hosts that reside behind the rewall from packet injection attacks despite the fact that these hosts also need to perform the same verication. Stateful rewalls may also limit the extensibility of protocols like TCP. To understand the problem, let us consider the large windows extension dened in [63]. This extension xes one limitation of the original TCP specication. The TCP header [89] includes a 16-bits eld that encodes the receive window in the TCP header. A consequence of this choice is that the standard TCP cannot support a receive window larger
70
than 64 KBytes. This is not large enough for high bandwidth networks. To allow hosts to use a larger window, [63] changes the semantics of the receive window eld of the TCP header on a per-connection basis. [63] denes the WScale TCP option that can only be used inside the SYN and SYN+ACK segments. This extension allows the communicating hosts to maintain their receive window as a 32 bits eld. The WScale option contains as parameter the number of bits that will be used to shift the 32-bits window to the right before placing the lower 16 bits in the TCP header. This shift is used on a TCP connection provided that both the client and the server have included the WScale option in the SYN and SYN+ACK segments. Unfortunately, a stateful rewall that does not understand the WScale option, may cause problems. Consider for example a client and a server that use a very large window. During the three-way handshake, they indicate with the WScale option that they will shift their window by 14 bits to the right. When the connection starts, each host reserves 217 bytes of memory for its receive window5 . Given the negotiated shift, each host will send in the TCP header a window eld set to 0000000000000100. If the stateful rewall does understand the WScale option used in the SYN and SYN+ACK segments, it will assume a window of 4 bytes and will discard all received segments. Unfortunately, there are still today stateful rewalls6 that do not understand this TCP option dened in 1992. Stateful rewalls can perform more detailed verication of the packets exchanged during a TCP connection. For example, intrusion detection and intrusion prevention systems are often combined with trafc normalizers [113, 53]. A trafc normalizer is a middlebox that veries that all packets obey the protocol specication. When used upstream of an intrusion detection system, a trafc normalizer can for example buffer the packets that are received out-of-order and forward them to the IDS once they are in-sequence.
71
also the IP and TCP checksums. The NAT performs this modication transparently in both directions. It is important to note that a NAT can only change the header of the transport protocol that it supports. Most deployed NATs only support TCP [49], UDP [9] and ICMP [104]. Supporting another transport protocol on a NAT requires software changes [55] and few NAT vendors implement those changes. This often forces users of new transport protocol to tunnel their protocol on top of UDP to traverse NATs and other middleboxes [85, 110]. This limits the ability to innovate in the transport layer. NAT can be used transparently by most Internet applications. Unfortunately, some applications cannot easily be used over NATs [57]. The textbook example of this problem is the File Transfer Protocol (FTP) [90]. An FTP client uses two types of TCP connections : a control connection and data connections. The control connection is used to send commands to the server. One of these is the PORT command that allows to specify the IP address and the port numbers that will be used for the data connection to transfer a le. The parameters of the PORT command are sent using a special ASCII syntax [90]. To preserve the operation of the FTP protocol, a NAT can translate the IP addresses and ports that appear in the IP and TCP headers, but also as parameters of the PORT command exchanged over the data connection. Many deployed NATs include Application Level Gateways (ALG) [57] that implement part of the application level protocol and modify the payload of segments. For FTP, after translation a PORT command may be longer or shorter than the original one. This implies that the FTP ALG needs to maintain state and will have to modify the sequence/acknowledgment numbers of all segments sent over a connection after having translated a PORT command. This is transparent for the FTP application, but has inuenced the design of Multipath TCP although recent FTP implementations rarely use the PORT command[7].
3.3 Proxies
The last middlebox that we cover is the proxy. A proxy is a middlebox that resides on a path and terminates TCP connections. A proxy can be explicit or transparent. The SOCKS5 protocol [71] is an example of the utilization of an explicit proxy. SOCKS5 proxies are often used in enterprise networks to authenticate the establishment of TCP connections. A classical example of transparent proxies are the HTTP proxies that are deployed in various commercial networks to cache and speedup HTTP requests. In this case, some routers in the network are congured to intercept the TCP connections and redirect them to a proxy server [75]. This redirection is transparent for the application, but from a transport viewpoint, the proxy acts as a relay between the two communicating hosts and there are two different TCP connections7 . The rst one is initiated by the client and terminates at the proxy. The second one is initiated by the proxy and terminates at the server. The data exchanged over the rst connection is passed to the second one, but the TCP options are not necessarily preserved. In some deployments, the proxy can use different options than the client and/or the server.
72
There are two broad types of middlebox studies. The rst type of study uses ground-truth topology information from providers and other organizations. While accurate, these studies may overestimate the inuence of middleboxes on end-to-end trafc because certain behavior is only triggered in rare, corner cases (e.g. an intrusion prevention system may only affect trafc carrying known worm signatures). These studies do not tell us exactly what operations are applied to packets by middleboxes. One recent survey study has found that the 57 enterprise networks surveyed deploy as many middleboxes as routers [103]. Active measurements probe end-to-end paths to trigger known behaviors of middleboxes. Such measurements are very accurate, pinpointing exactly what middleboxes do in certain scenarios. However, they offer a lower bound of middlebox deployments, as they may pass through other middleboxes without triggering them. Additionally, such studies are limited to probing a full path (cannot probe path segments) thus cannot tell how many middleboxes are deployed on a path exhibiting middlebox behavior. The data surveyed below comes from such studies. Network Address Translators are easy to test for: the trafc source needs to compare its local source address with the source address of its packets reaching an external site (e.g. whats my IP). When the addresses differ, a NAT has been deployed on path. Using this basic technique, existing studies have shown that NATs are deployed almost universally by (or for) end-users, be they home or mobile: Most cellular providers use them to cope with address space shortage and to provide some level of security. A study surveying 107 cellular networks across the globe found that 82 of them used NATs [115]. Home users receive a single public IP address (perhaps via DHCP) from their access providers, and deploy NATs to support multiple devices. The typical middlebox here is the wireless router, an access point that aggregates all home trafc onto the access link. A study using peer-to-peer clients found that only 12% of the peers have a public IP address [30]. Testing for stateless rewalls is equally simple: the client generates trafc using different transport protocols and port numbers, and the server acks it back. As long as some tests work, the client knows the endpoint is reachable, and that the failed tests are most likely due to rewall behavior. There is a chance that the failed tests are due to the stochastic packet loss inherent in the Internet; that is why tests are interleaved and run multiple times. Firewalls are equally widespread in the Internet, being deployed by most cellular operators [115]. Home routers often act as rewalls, blocking new protocols and even existing ones (e.g. UDP) [30]. Most servers also deploy rewalls to restrict in-bound trafc [103]. Additionally, many modern operating systems come with default-on rewalls. For instance, the Windows 7 rewall explicitly asks users to allow incoming connections and to whitelist applications allowed to make outgoing connections. Testing for explicit proxies can be done by comparing the segments that leave the source with the ones arriving at the destination. A proxy will change sequence numbers, perhaps segment packets differently, modify the receive window, and so forth. Volunteers from various parts of the globe ran TCPExposure, a tool that aims to detect such proxies and other middlebox behavior [58]. The tests cover 142 access networks (including cellular, DSL, public hotspots and ofces) in 24 countries. Proxying behavior was seen on 10% of the tested paths[58]. Beyond basic reachability, middleboxes such as trafc normalizers and stateful rewalls have some expectations on the packets they see: what exactly do these middleboxes do, and how widespread are they? The same study sheds some light into this matter: Middlebox behavior depends on the ports used. Most middleboxes are active on port 80 (HTTP trafc).
73
A third of paths keep TCP ow state and use it to actively correct acknowledgments for data the middlebox has not seen. To probe this behavior, TCPExposure sends a few TCP segments leaving a gap in the sequence number, while the server acknowledgment also covers the gap. 14% of paths remove unknown options from SYN packets. These paths will not allow TCP extensions to be deployed. 18% of paths modify sequence numbers; of these, a third seem to be proxies, and the other are most likely rewalls than randomize the initial sequence number to protect vulnerable end hosts against in-window injection attacks.
3.5
Deploying a new IP protocol requires a lot of investment to change all the deployed hardware. Experience with IPv6 after 15 years since it has been standardized is not encouraging: only a minute fraction of the Internet has migrated to v6, and there are no signs of it becoming the Internet anytime soon. Changing IPv4 itself is in theory possible with IP options. Unfortunately, it has been known for a while now that IP options are not an option [42]. This is because existing routers implement forwarding in hardware for efciency reasons; packets carrying unknown IP options are treated as exceptions that are processed in software. To avoid a denial-of-service on routers CPUs, such packets are dropped by most routers. In a nutshell, we cant really touch IP - but have we also lost our ability to change transport protocols as well? The high level picture emerging from existing middlebox studies is that of a network that is highly tailored to todays trafc to the point it is ossied: changing existing transport protocols is challenging as it needs to carefully consider middlebox interactions. Further, deploying new transport protocols natively is almost impossible. Luckily, changing transport protocols is still possible, albeit great care must be taken when doing so. Firewalls will block any trafc they do not understand, so deploying new protocols must necessarily use existing ones just to get through the network. This observation has recently lead to the development of Minion, a container protocol that enables basic connectivity above TCP while avoiding its in-order, reliable bytestream semantics at the expense of slightly increased bandwidth usage. We discuss Minion in Section 5. Even changing TCP is very difcult. The semantics of TCP are embedded in the network fabric, and new extensions must function within the connes of these semantics, or they will fail. In section 4 we discuss Multipath TCP, another recent extension to TCP, that was designed explicitly to be middlebox compatible.
4 Multipath TCP
Todays networks are multipath: mobile devices have multiple wireless interfaces, datacenters have many redundant paths between servers and multi-homing has become the norm for big server farms. Meanwhile, TCP is essentially a single path protocol: when a TCP connection is established, it is bound to the IP addresses of the two communicating hosts. If one of these addresses changes, for whatever reason, the connection fails. In fact a TCP connection cannot even be load-balanced across more than one path within the network, because this results in packet reordering and TCP misinterprets this reordering as congestion and slows down. This mismatch between todays multipath networks and TCPs single-path design creates tangible problems. For instance, if a smartphones WiFi interface loses signal, the TCP connections associated with it
74
stall - there is no way to migrate them to other working interfaces, such as 3G. This makes mobility a frustrating experience for users. Modern datacenters are another example: many paths are available between two endpoints, and equal cost multipath routing randomly picks one for a particular TCP connection. This can cause collisions where multiple ows get placed on the same link, hurting throughput - to such an extent that average throughput is halved in some scenarios. Multipath TCP (MPTCP) [13] is a major modication to TCP that allows multiple paths to be used simultaneously by a single connection. Multipath TCP circumvents the issues above and several others that affect TCP. Changing TCP to use multiple paths is not a new idea: it was originally proposed more than fteen years ago by Christian Huitema in the Internet Engineering Task Force (IETF) [59], and there have been a half-dozen more proposals since then to similar effect. Multipath TCP draws on the experience gathered in previous work, and goes further to solve issues of fairness when competing with regular TCP and deployment issues due to middleboxes in todays Internet.
Figure 6: Multipath TCP handshake: multiple subows can be added and removed after the initial connection is setup and connection identiers are exchanged. The initial MPTCP connection setup is shown graphically in the top part of Figure 6, where the segments are regular TCP segments carrying new multipath TCP-related options (shown in green). At this point the Multipath TCP connection is established and the client and server can exchange TCP segments via the 3G path. How could the mobile device also send data through this Multipath TCP session over its WiFi interface? Naively, it could simply send some of the segments over the WiFi interface. However most ISPs will drop these packets, as they would have the source address of the 3G interface. Perhaps the client could tell the server the IP address of the WiFi interface, and use that when it sends over WiFi? Unfortunately this will rarely work: rewalls and similar stateful middleboxes on the WiFi path expect to see a SYN segment before they see data segment. The only solution that will work reliably is to perform a regular three-way handshake on the WiFi path before sending any packets that way, so this is what Multipath TCP does. This handshake carries the MP JOIN TCP option, providing information to the server that can securely identify the correct connection to associate this additional subow with. The server replies with MP JOIN in the SYN+ACK, and the new subow is established (this is shown in the bottom part of Figure 6). An important point about Multipath TCP, especially in the context of mobile devices, is that the set of subows that are associated to a Multipath TCP connection is not xed. Subows can be dynamically added and removed from a Multipath TCP connection throughout its lifetime, without affecting the bytestream transported on behalf of the application. If the mobile device moves to another WiFi network, it will receive a new IP address. At that time, it will open a new subow using its newly allocated address and tell the server that its old address is not usable anymore. The server will now send data towards the new address.
76
These options allow mobile devices to easily move through different wireless connections without breaking their Multipath TCP connections [80]. Multipath TCP also implements mechanisms that allows to inform the remote host of the addition/removal of addresses even when an endpoint operates behind a NAT, or when a subow using a different address family is needed (e.g. IPv6). Endpoints can send an ADD ADDR option that contains an address identier together with an address. The address identier is unique at the sender, and allows it to identify its addresses even when it is behind a NAT. Upon receiving an advertisement, the endpoint may initiate a new subow to the new address. An address withdrawal mechanism is also provided via the REMOVE ADDR option that also carries an address identier. Data Transfer. Assume now that two subows have been established over WiFi and 3G: the mobile device can send and receive data segments over both. Just like TCP, Multipath TCP provides a bytestream service to the application. In fact, standard applications can function over MPTCP without being aware of it - MPTCP provides the same socket interface as TCP. Since the two paths will often have different delay characteristics, the data segments sent over the two subows will not be received in order. Regular TCP uses the sequence number in the TCP header to put data back into the original order. A simple solution for Multipath TCP would be to just reuse this sequence number as is. Unfortunately, this simple solution would create problems with some existing middleboxes such as rewalls. On each path, a middlebox would only see half of the packets, so it would observe many gaps in the TCP sequence space. Measurements indicate that some middleboxes react in strange ways when faced with gaps in TCP sequence numbers [58]. Some discard the out-of-sequence segments while others try to update the TCP acknowledgments in order to recover some of these gaps. With such middleboxes on a path, Multipath TCP cannot safely send TCP segments with gaps in the TCP sequence number space. On the other hand, Multipath TCP also cannot send every data segment over all subows: that would be a waste of resources. To deal with this problem, Multipath TCP uses its own sequence numbering space. Each segment sent by Multipath TCP contains two sequence numbers: the subow sequence number inside the regular TCP header and an additional Data Sequence Number (DSN). This solution ensures that the segments sent on any given subow have consecutive sequence numbers and do not upset middleboxes. Multipath TCP can then send some data sequence numbers on one path and the remainder on the other path; the DSN will be used by the Multipath TCP receiver to reorder the bytestream before it is given to the receiving application. Before we explain the way the Data Sequence Number is encoded, we rst need to discuss two other key parts of Multipath TCP that are affected by the additional sequence number spaceow control and acknowledgements. Flow Control. TCPs receive window indicates the number of bytes beyond the sequence number from the acknowledgment eld that the receiver can buffer. The sender is not permitted to send more than this amount of additional data. Multipath TCP also needs to implement ow control, although segments can arrive over multiple subows. If we inherit TCPs interpretation of receive window, this would imply an MPTCP receiver maintains a pool of buffering per subow, with receive window indicating per-subow buffer occupancy. Unfortunately such an interpretation can lead to deadlocks: 1. The next segment that needs to be passed to the application was sent on subow 1, but was lost. 2. In the meantime subow 2 continues delivering data, and lls its receive window.
77
3. Subow 1 fails silently. 4. The missing data needs to be re-sent on subow 2, but there is no space left in the receive window, resulting in a deadlock. The correct solution is to generalize TCPs receive window semantics to MPTCP. For each connection a single receive buffer pool should be shared between all subows. The receive window then indicates the maximum data sequence number that can be sent rather than the maximum subow sequence number. As a segment resent on a different subow always occupies the same data sequence space, deadlocks cannot occur. The problem for an MPTCP sender is that to calculate the highest data sequence number that can be sent, the receive window needs to be added to the highest data sequence number acknowledged. However the ACK eld in the TCP header of an MPTCP subow must, by necessity, indicate only subow sequence numbers to cope with middleboxes. Does MPTCP need to add an extra data acknowledgment eld for the receive window to be interpreted correctly? Acknowledgments. The answer is positive: MPTCP needs and uses explicit connection-level acknowledgments or DATA ACKs. The alternative is to infer connection-level acknowledgments from subow acknowledgments, by using a scoreboard maintained by the sender that maps subow sequence numbers to data sequence numbers. Unfortunately, MPTCP segments and their associated ACKs will be reordered as they travel on different paths, making it impossible to correctly infer the connection-level acknowledgments from subow-level ones [94]. Encoding. We have seen that in the forward path we need to encode a mapping of subow bytes into the data sequence space, and in the reverse path we need to encode cumulative data acknowledgments. There are two viable choices for encoding this additional data: Send the additional data in TCP options. Carry the additional data within the TCP payload, using a chunked or escaped encoding to separate control data from payload data. For the forward path there arent compelling arguments either way, but the reverse path is a different matter. Consider a hypothetical encoding that divides the payload into chunks where each chunk has a TLV (type-length-value) header. A data acknowledgment can then be embedded into the payload using its own chunk type. Under most circumstances this works ne. However, unlike TCPs pure ACK, anything embedded in the payload must be treated as data. In particular: It must be subject to ow control because the receiver must buffer data to decode the TLV encoding. If lost, it must be retransmitted consistently, so that middleboxes can track sequence state correctly8 If packets before it are lost, it might be necessary to wait for retransmissions before the data can be parsed - causing head-of-line blocking. Flow control presents the most obvious problem for the chunked payload encoding. Figure 7 provides an example. Client C is pipelining requests to server S; meanwhile Ss application is busy sending the large
8 TCP
proxies re-send the original content they see a retransmission with different data.
78
'
$$$t%&#out$$$
Figure 7: Flow Control on the path from C to S inadvertently stops the data ow from S to C response to the rst request so it isnt yet ready to read the subsequent requests. At this point, Ss receive buffer lls up. S sends segment 10, C receives it and wants to send the DATA ACK, but cannot: ow control imposed by Ss receive window stops him. Because no DATA ACKs are received from C, S cannot free his send buffer, so this lls up and blocks the sending application on S. Ss application will only read when it has nished sending data to C, but it cannot do so because its send buffer is full. The send buffer can only empty when S receives the DATA ACK from C, but C cannot send this until Ss application reads. This is a classic deadlock cycle. As no DATA ACK is received, S will eventually time out the data it sent to C and will retransmit it; after many retransmits the whole connection will time out. The conclusion is that DATA ACKs cannot be safely encoded in the payload. The only real alternative is to encode them in TCP options which (on a pure ACK packet) are not subject to ow control. The Data Sequence Mapping. If MPTCP must use options to encode DATA ACKs, it is simplest to also encode the mapping from subow sequence numbers to data sequence numbers in a TCP option. This is the data sequence mapping or DSM. At rst glance it seems the DSM option simply needs to carry the data sequence number corresponding to the start of the MPTCP segment. Unfortunately middleboxes and interfaces that implement TSO or LRO make this far from simple. Middleboxes that re-segment data would cause a problem. TCP Segmentation Ofoad (TSO) hardware in the network interface card (NIC) also re-segments data and is commonly used to improve performance. The basic idea is that the OS sends large segments and the NIC re-segments them to match the receivers MSS. What does TSO do with TCP options? A test of 12 NICs supporting TSO from four different vendors showed that all of them copy a TCP option sent by the OS on a large segment into all the split segments [94]. If MPTCPs DSM option only listed the data sequence number, TSO would copy the same DSM to more than one segment, breaking the mapping. Instead the DSM option must say precisely which subow bytes map to which data sequence numbers. But this is further complicated by middleboxes that modify the initial sequence number of TCP connections and consequently rewrite all sequence numbers (many rewalls behave like this). Instead, the DSM option must map the offset from the subows initial sequence number
79
1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------+---------------+-------+----------------------+ | Kind | Length |Subtype| (reserved) |F|m|M|a|A| +---------------+---------------+-------+----------------------+ | Data ACK (4 or 8 octets, depending on flags) | +--------------------------------------------------------------+ | Data Sequence Number (4 or 8 octets, depending on flags) | +--------------------------------------------------------------+ | Subflow Sequence Number (4 octets) | +-------------------------------+------------------------------+ | Data-level Length (2 octets) | Checksum (2 octets) | +-------------------------------+------------------------------+ Figure 8: The Data Sequence Signal option in MPTCP that carries the Data Sequence Mapping information, the Data ACK, the Data FIN and connection-fall-back options. to the data sequence number, as the offset is unaffected by sequence number rewriting. The option must also contain the length of the mapping. This is robust - as long as the option is received, it does not greatly matter which packet carries it, so duplicate mappings caused by TSO are not a problem. Dealing with Content-Modifying Middleboxes. Multipath TCP and content-modifying middleboxes (such as application-level NATs, e.g. for FTP) have the potential to interact badly. In particular, due to FTPs ASCII encoding, re-writing an IP address in the payload can necessitate changing the length of the payload. Subsequent sequence and ACK numbers are then xed up by the middlebox so they are consistent from the point of view of the end systems. Such length changes break the DSM option mapping - subow bytes can be mapped to the wrong place in the data stream. They also break every other possible mapping mechanism, including chunked payloads. There is no easy way to handle such middleboxes. That is why MPTCP includes an optional checksum in the DSM mapping to detect such content changes. If an MPTCP host receives a segment with an invalid DSM checksum, it rejects the segment and triggers a fallback process: if any other subows exists, MPTCP terminates the subow on which the modication occurred; if no other subow exists, MPTCP drops back to regular TCP behavior for the remainder of the connection, allowing the middlebox to perform rewriting as it wishes. This fallback mechanism preserves connectivity in the presence of middleboxes. For efciency reasons, MPTCP uses the same 16-bit ones complement checksum used in the TCP header. This allows the checksum over the payload to be calculated only once. The payload checksum is added to a checksum of an MPTCP pseudo header covering the DSM mapping values and then inserted into the DSM option. The same payload checksum is added to the checksum of the TCP pseudo-header and then used in the TCP checksum eld. MPTCP allows checksums to be disabled for high performance environments such as data-centers where there is no chance of encountering such an application-level gateway. The fall-back-to-TCP process, triggered by a checksum failure, can also be triggered in other circumstances. For example, if a routing change moves an MPTCP subow to a path where a middlebox removes DSM options, this also triggers the fall-back procedure.
80
Connection Release. Multipath TCP must allow a connection to survive even though its subows are coming and going. Subows in MPTCP can be torn down by means of a four-way handshake as regular TCP owsthis ensures MPTCP allows middleboxes to clear their state when a subow is not used anymore. MPTCP uses an explicit four-way handshake for connection tear-down indicated by a DATA FIN option. The DATA FIN is MPTCPs equivalent to TCPs FIN, and it occupies one byte in the data-sequence space. A DATA ACK will be used to acknowledge the receipt of the DATA FIN. MPTCP requires that the segment(s) carrying a DATA FIN must also have the FIN ag set - this ensures all subows are also closed when the MPTCP connection is being closed. For reference, we show the wire format of the option used by MPTCP for data exchange in Figure 8. This option encodes the Data Sequence Mapping, Data ACK, Data FIN and the fall-back options. The ags specify which parts of the option are valid, and help reduce option space usage.
Figure 9: A scenario which shows the importance of weighting the aggressiveness of subows (Reprinted from [117]. Included here by permission). and push most of its trafc there; this will decrease the load on the congested link and increase it on the less congested one. If a large enough fraction of ows are multipath, congestion will spread out evenly across collections of links, creating resource pools: links that act together as if they are a single, larger capacity link shared by all ows. This effect is called resource pooling [116]. Resource pooling brings two major benets, discussed in the paragraphs below. Increased Fairness. Consider the example shown in Figure 10: congestion balancing ensures that all ows have the same throughput, making the two links of 20 pkt/s act like a single pooled link with capacity 40 pkt/s shared fairly by the four ows. If more MPTCP ows would be added, the two links would still behave as a pool, sharing capacity fairly among all ows. Conversely, if we remove the Multipath TCP ow, the links no longer form a pool, and the throughput allocation is unfair as the TCP connection using the top path gets twice as much throughput as the TCP connections using the bottom path. Increased Throughput. Consider the somewhat contrived scenario in Fig.11, and suppose that the three links each have capacity 12Mb/s. If each ow split its trafc evenly across its two paths subow would get 4Mb/s hence each ow would get 8Mb/s. But if each ow used only the one-hop shortest path, it could get 12Mb/s: this is because two-hop paths consume double the resources of one-hop paths, and in a congested network it makes sense to only use the one-hop paths. In an idle network, however, using all available paths is much better: consider the case when only the blue connection is using the links. In this case this connection would get 24Mb/s throughput; using the one hop path alone would only provide 12Mb/s. In summary, the endpoints need to be able to dynamically decide which paths to use based on conditions in the network. A solution has been devised in the theoretical literature on congestion control, independently by Kelly and Voice [65] and Han et al. [52]. The core idea is that a multipath ow should shift all its trafc onto the least-congested path. In a situation like Fig. 11 the two-hop paths will have higher drop probability than the one-hop paths, so applying the core idea will yield the efcient allocation. Surprisingly it turns out that this can be achieved by doing independent congestion control at endpoints. Multipath TCP Congestion Control. The theoretical work on multipath congestion control [65, 52] assumes a rate-based protocol, with exponential increases of the rate. TCP, in contrast, is a packet-based protocol, sending w packets every round-trip time (i.e. the rate is w/RT T ); a new packet is sent only when an acknowledgment is received, conrming that an existing packet has left the network. This property is called ACK-clocking and is nice because it has good stability properties: when congestion occurs round-trip times increase (due to buffering), which automatically reduces the effective rate [62].
82
Figure 10: Two links each with capacity 20 pkts/s. The top link is used by a single TCP connection, and the bottom link is used by two TCP connections. A Multipath TCP connection uses both links. Multipath TCP pushes most of its trafc onto less congested top link, making the two links behave like a resource pool of capacity 40 pkts/s. Capacity is divided equally, with each ow having throughput 10 pkts/s. (Reprinted from [117]. Included here by permission) Converting a theoretical rate-based exponential protocol to a practical packet-based protocol fair to TCP turned out to be more difcult than expected. There are two problems that appear [117]: When loss rates are equal on all paths, the theoretical algorithm will place all of the window on one path or the other, not on boththis effect was termed appiness and it appears because of the discrete (stochastic) nature of packet losses which are not captured by the differential equations used in theory. The ideal algorithm always prefers paths with lower loss rate, but in practice these may have poor performance. Consider a mobile phone with WiFi and 3G links: 3G links have very low loss rates and huge round-trip times, resulting in poor throughput. WiFi is lossy, has shorter round-trip times and
Figure 11: A scenario to illustrate the importance of choosing the less-congested path. (Reprinted from [117]. Included here by permission.)
83
typically offers much better throughput. In this common case, a perfect controller would place all trafc on the 3G path, violating the second goal (deployability). The pragmatic choice is to sacrice some load-balancing ability to ensure greater stability and to offer incentives for deployment. This is what Multipath TCP congestion control does. Multipath TCP congestion control is a series of simple changes to the standard TCP congestion control mechanism. Each subow has its own congestion window, that is halved when packets are lost, as in standard TCP [117]. Congestion balancing is implemented in the increase phase of congestion control: here Multipath TCP will allow less congested subows to increase proportionally more than congested ones. Finally, the total increase of Multipath TCP across all of its subows is dynamically chosen in such a way that it achieves the rst and second goals above. The exact algorithm is described below and it satises the goals weve discussed: Upon ACK on subow r, increase the window wr by min(a/wtotal ), 1/wr ). Here Upon loss on subow r, decrease the window wr by wr /2. a = wtotal
2 maxr wr /RT Tr , ( r wr /RT Tr )2
(1)
wr is the current window size on subow r and wtotal is the sum of windows across all subows. The algorithm biases the increase towards uncongested paths: these will receive more ACKs and will increase accordingly. However, MPTCP does keep some trafc even on the highly congested paths; this ensures stability and allows it to quickly detect when path conditions improve. a is a term that is computed dynamically upon each packet drop. Its purpose is to make sure that MPTCP gets at least as much throughput as TCP on the best path. To achieve this goal, a is computed by estimating how much TCP would get on each MPTCP path (this is easy, as round-trip time and loss-rates estimates are known) and ensuring that MPTCP in stable state gets at least that much. A detailed discussion on the design of the MPTCP congestion control algorithm is provided in [117]. For example, in the three-path example above, the ow will put 45% of its weight on each of the less congested path and 10% on the more congested path. This is intermediate between regular TCP (33% on each path) and a perfect load balancing algorithm (0% on the more congested path) that is impossible to implement in practice. The window increase is capped at 1/wr , which ensures that the multipath ow can take no more capacity on either path than a single-path TCP ow would. In setting the a parameter, Multipath TCP congestion control uses subow delays to compute the target rate. Using delay for congestion control is not a novel idea: TCP Vegas [20], for instance, performs congestion control only by tracking RTT values. Vegas treats increasing RTTs as a sign of congestion, instead of relying on packet losses as regular TCP does. That is why Vegas loses out to regular TCP at shared bottlenecks, and is probably the reason for its lack of adoption. MPTCP does not treat delay as congestion: it just uses it to gure out the effective rate of a TCP connection on a given path. This allows MPTCP to compete fairly with TCP. Alternative Congestion Controllers for Multipath TCP. The standardized Multipath TCP congestion control algorithm chooses a trade-off between load balancing, stability and the ability to quickly detect available capacity. The biggest contribution of this work is the clearly dened goals for what multipath congestion control should do, and an instantiation that achieves (most of) the stated goals in practice.
84
9 8 7
Wi-Fi loss
Full-MPTCP
Application Handover
Goodput [Mbps]
6 5
4 3 2 1 0 1.0 0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 12: (Mobility) A mobile device is using both its WiFi and 3G interfaces, and then the WiFi interface fails. We plot the instantaneous throughputs of Multipath TCP and application-layer handover. (Reprinted from [80]; c 2012, ACM. Included here by permission.) This research area is relatively new, and it is likely that more work will lead to better algorithmsif not generally applicable, then at least tailored to some practical use-cases. A new and interesting congestion controller called Opportunistic Linked Increases Algorithm (OLIA) has already been proposed [66] that offers better load balancing with seemingly few drawbacks. We expect this area to be very active in the near future; of particular interest are designing multipath versions of high-speed congestion control variants deployed in practice, such as Cubic or Compound TCP.
3G throughput
Instantaneous Average
WiFi throughput
Instantaneous Average
Figure 13: (3G to WiFi Handover) A Linux distribution is downloaded by a mobile user while on the Bucharest subway. 3G coverage is ubiquitous. WiFi is available only in stations. Figure 12 compares application layer-handover (with HTTP-range) against Multipath TCP. The gure shows a smooth handover with Multipath TCP, as data keeps owing despite the interface change. With application-layer handover there is a downtime of 4 seconds where the transfer stops-this is because it takes time for the application to detect the interface down event, and it takes time for 3G to ramp up. Multipath TCP enables unmodied mobile applications to survive interface changes with little disruption. Selected apps can be modied today to support handover, but their performance is worse than with MPTCP. A more detailed discussion of the utilization of Multipath TCP in WiFi/3G environments may be found in [80]. We also present a real mobility trace in Figure 13, where a mobile user downloads a Linux distro on his laptop while travelling on the Bucharest underground. The laptop uses a 3G dongle to connect to the cellular network and its WiFi NIC to connect to access points available in stations. The gure shows the download speeds of the two interfaces during a 30 minute underground ride. 3G throughput is stable: the average is 2.3Mbps (shown with a dotted line on the graph), and the instantaneous throughput varies inversely proportional with the distance to the 3G cell. WiFi throughput is much more bursty: in some stations the throughput soars to 40Mbps, while in others it is zero, as the laptop doesnt manage to associate and obtain an IP address quickly enough. The average WiFi throughput (3.3Mbps) is higher than the average 3G throughput. While MPTCP uses both interfaces in this experiment, using just WiFi when it is available may be preferable as it reduces 3G data bills and reduces the load of the cellular network. This is easy to implement with MPTCP, as it supports a simple prioritization mechanism that allows the client to inform the server to send data via preferred subow(s) 9 .
9 MPTCP
sends an MP PRIO option to inform the remote end about changes in priority for the subows.
86
1000 900 800 700 600 500 400 300 200 100 0 0
Throughput (Mb/s)
500
1000
2000
2500
3000
Figure 14: (Datacenter load-balancing) This graph compares standard TCP with MPTCP with two and four ows, when tested on an EC2 testbed with 40 instances. Each host uses iperf sequentially to all other hosts. We plot the performance of all ows (Y axis) in increasing order of their throughputs (X axis). (Reprinted from [92], c ACM, 2011. Included here by permission) Datacenters. We also show results from running Multipath TCP in a different scenario: the EC2 datacenter. Like most datacenters today, EC2 uses a redundant network topology where many paths are available between any pair of endpoints, and where connections are placed randomly onto available paths. In EC2, 40 machines (or instances) ran the Multipath TCP kernel. A simple experiment was run where every machine measured the throughput sequentially to every other machine using rst TCP, then Multipath TCP with two and with four subows. Figure 14 shows the sorted throughputs measured over 12 hours. The results show that Multipath TCP brings signicant improvements compared to TCP in this scenario. Because the EC2 network is essentially a black-box , it is difcult to pinpoint the root cause for the improvements; however, a detailed analysis of the cases where Multipath TCP can help and why in is presented in [92].
that are more robust. Of course, some changes may be needed at layer 2 to support such load balancing. Network topologies may be designed differently if MPTCP is the transport protocol. An example is GRIN [4], a work that proposes to change existing datacenter networks by randomly interconnecting servers in the same rack directly, using their free NIC ports. This allows a server to opportunistically send trafc at speeds larger than its access link by having its idle neighbor relay some trafc on its behalf. With a minor topology change and MPTCP, GRIN manages to better utilize the datacenter network core. The mechanisms that allow MPTCP to use different IPs in the same transport connection (i.e. the connection identier) also allow connections to be migrated across physical machines with different IP addresses. One direct application is seamless virtual machine migration: an MPTCP-enabled guest virtual machine can just resume its connections after it is migrated. On the application side many optimizations are possible. Applications like video and audio streaming only need a certain bitrate to ensure user satisfaction - they rarely fully utilize the wireless capacity. Our mobility graph in the previous section shows that, in principle, WiFi has enough capacity to support these apps, even if it is only available in stations. A simple optimization would be to download as much of the video as possible while on WiFi, and only use 3G when the playout buffer falls below a certain threshold. This would ensure a smooth viewing experience while pushing as much trafc as possible over WiFi. The examples shown in this section are only meant to be illustrative, and their potential impact or even feasibility is not clear at this point. Further, the list is not meant to be exhaustive. We have only provided it to show that Multipath TCP benets go beyond than increasing throughput, and could shape both the lower and the upper layers of the protocol stack in the future.
5 Minion
As the Internet has grown and evolved over the past few decades, congestion control algorithms and extensions such as MPTCP continue to reect an evolving TCP. However, a proliferation of middleboxes such as Network Address Translators (NATs), Firewalls, and Performance Enhancing Proxies (PEPs), has arguably stretched the waist of the Internet hourglass upwards from IP to include TCP and UDP [98, 45, 86], making it increasingly difcult to deploy new transports and to use anything but TCP and UDP on the Internet. TCP [89] was originally designed to offer applications a convenient, high-level communication abstraction with semantics emulating Unix le I/O or pipes: a reliable, ordered bytestream, through an end-to-end channel (or connection). As the Internet has evolved, however, applications needed better abstractions from the transport. We start this section by examining how TCPs role in the network has evolved from a communication abstraction to a communication substrate, why its in-order delivery model makes TCP a poor substrate, why other OS-level transports have failed to replace TCP in this role, and why UDP is inadequate as the only alternative substrate to TCP.
Figure 15: Todays de facto transport layer is effectively split between OS and application code. (Reprinted from [78]. Included here by permission.) Instead of building directly atop traditional OS-level transports such as TCP or UDP, however, todays applications frequently introduce additional transport-like protocol layers at user-level, typically implemented via application-linked libraries. Examples include the ubiquitous SSL/TLS [35], media transports such as RTP [101], and experimental multi-streaming transports such as SST [44], SPDY [1], and MQ [2]. Applications increasingly use HTTP or HTTPS over TCP as a substrate [86]; this is also illustrated by the W3Cs WebSocket interface [114], which offers general bidirectional communication between browser-based applications and Web servers atop HTTP and HTTPS. In this increasingly common design pattern, the transport layer as a whole has in effect become a stack of protocols straddling the OS-application boundary. Figure 15 illustrates one example stack, representing Googles experimental Chrome browser, which inserts SPDY for multi-streaming and TLS for security at application level, atop the OS-level TCP. One can debate whether a given application-level protocol ts some denition of transport functionality. The important point, however, is that todays applications no longer need, or expect, the underlying OS to provide convenient communication abstractions: an application simply links in libraries, frameworks, or middleware offering the abstractions it desires. What todays applications need from the OS is not convenience, but an efcient substrate atop which application-level libraries can build the desired abstractions.
5.2
While TCP has proven to be a popular substrate for application-level transports, using TCP in this role converts its delivery model from a blessing into a curse. Application-level transports are just as capable as the kernel of sequencing and reassembling packets into a logical data unit or frame [29]. By delaying any segments delivery to the application until all prior segments are received and delivered, however, TCP imposes a latency tax on all segments arriving within one round-trip time (RTT) after any single lost segment. This latency tax is a fundamental byproduct of TCPs in-order delivery model, and is irreducible, in that an application-level transport cannot claw back the time a potentially useful segment has wasted in TCPs
89
buffers. The best the application can do is simply to expect higher latencies to be common. A conferencing application can use a longer jitter buffer, for example, at the cost of increasing user-perceptible lag. Network hardware advances are unlikely to address this issue, since TCPs latency tax depends on RTT, which is lower-bounded by the speed of light for long-distance communications.
5.3
All standardized OS-level transports since TCP, including UDP [87], RDP [112], DCCP [67], and SCTP [105], support out-of-order delivery. The Internets evolution has created strong barriers against the widespread deployment of new transports other than the original TCP and UDP, however. These barriers are detailed elsewhere [98, 45, 86], but we summarize two key issues here. First, adding or enhancing a native transport built atop IP involves modifying popular OSes, effectively increasing the bar for widespread deployment and making it more difcult to evolve transport functionality below the red line representing the OS API in Figure 15. Second, the Internets original dumb network design, in which routers that see only up to the IP layer, has evolved into a smart network in which pervasive middleboxes perform deep packet inspection and interposition in transport and higher layers. Firewalls tend to block anything unfamiliar for security reasons, and Network Address Translators (NATs) rewrite the port number in the transport header, making both incapable of allowing trafc from a new transport without explicit support for that transport. Any packet content not protected by end-to-end security such as TLSthe yellow line in Figure 15has become fair game for middleboxes to inspect and interpose on [95], making it more difcult to evolve transport functionality anywhere below that line.
90
Figure 16: Minion architecture (Reprinted from [78]. Included here by permission.)
5.6
Minion is an architecture and protocol suite designed to provide efcient unordered delivery built atop existing transports. Minion itself offers no high-level abstractions: its goal is to serve applications and higher application-level transports, by acting as a packhorse carrying raw datagrams as reliably and efciently as possible across todays diverse and change-averse Internet.
91
Figure 16 illustrates Minions architecture. Applications and higher application-level transports link in and use Minion in the same way as they already use existing application-level transports such as DTLS [97], the datagram-oriented analog of SSL/TLS [35]. In contrast with DTLSs goal of layering security atop datagram transports such as UDP or DCCP, Minions goal is to offer efcient datagram delivery atop any available OS-level substrate, including TCP. Minion consists of several application-level transport protocols, together with a set of optional enhancements to end hosts OS-level TCP implementations. Minions enhanced OS-level TCP stack, called uTCP (unordered TCP), includes sender- and receiverside API features supporting unordered delivery and prioritization, as detailed later in this section. These enhancements affect only the OS API through which application-level transports such as Minion interact with the TCP stack, and make no changes to TCPs wire protocol. Minions application-level protocol suite currently consists of the following main components: uCOBS is a protocol that implements a minimal unordered datagram delivery service atop either unmodied TCP or uTCP, using Consistent-Overhead Byte Stufng, or COBS encoding [23] to facilitate out-of-order datagram delimiting and prioritized delivery, as described later in Section 5.8.3. uTLS is a modication of the traditionally stream-oriented TLS [35], offering a secure, unordered datagram delivery service atop TCP or uTCP. The wire-encoding of uTLS streams is designed to be indistinguishable in the network from conventional, encrypted TLS-over-TCP streams (e.g., HTTPS), offering a maximally conservative design point that makes no network-visible changes below the yellow line in Figure 16. Section 5.8.3 describes uTLS. Minion adds shim layers atop OS-level datagram transports, such as UDP and DCCP, to offer applications a consistent API for unordered delivery across multiple OS-level transports. Since these shims are merely wrappers for OS transports already offering unordered delivery, this paper does not discuss them in detail. Minion currently leaves to the application the decision of which protocol to use for a given connection: e.g., uCOBS or uTLS atop TCP/uTCP, or OS-level UDP or DCCP via Minions shims. Other ongoing work explores negotiation protocols to explore the protocol conguration space dynamically, optimizing protocol selection and conguration for the applications needs and the networks constraints [46]. Many applications already incorporate simple negotiation schemes, howevere.g., attempting a UDP connection rst and falling back to TCP if that failsand adapting these mechanisms to engage Minions protocols according to application-dened preferences and decision criteria should be straightforward.
Minion still offers incremental performance benets, since uTCPs sender-side and receiver-side enhancements are independent. A uCOBS or uTLS connection atop a mixed TCP/uTCP endpoint-pair benets from uTCPs sender-side enhancements for datagrams sent by the uTCP endpoint, and the connection benets from uTCPs receiver-side enhancements for datagrams arriving at the uTCP host. Addressing the challenge of network-compatibility with middleboxes that lter new OS-level transports and sometimes UDP, Minion offers application-level transports a continuum of substrates representing different tradeoffs between suitability to the applications needs and compatibility with the network. An application can use unordered OS-level transports such as UDP, DCCP [67], or SCTP [105], for paths on which they operate, but Minion offers an unordered delivery alternative wire-compatible not only with TCP, but with the ubiquitous TLS-over-TCP streams on which HTTPS (and hence Web security and E-commerce) are based, likely to operate in almost any network environment purporting to offer Internet access.
A conventional TCP receiver delivers data in-order to the receiving application, holding back any data that is received out of order. uTCP modies the TCP receive path, enabling a receiving application to request immediate delivery of data that is received by uTCP, both in order and out of order. uTCP makes two modications to a conventional TCP receiver. First, whereas a conventional TCP stack delivers received data to the application only when prior gaps in the TCP sequence space are lled, the uTCP receiver makes data segments available to the application immediately upon receipt, skipping TCPs usual reordering queue. The data the uTCP stack delivers to the application in successive application reads may skip forward and backward in the transmitted byte stream, and uTCP may even deliver portions of the transmitted stream multiple times. uTCP guarantees only that the data returned by each application read corresponds to some contiguous sequence of bytes in the senders transmitted stream. Second, when servicing an applications read, the uTCP receiver also delivers the logical offset of the rst returned byte in the senders original byte streaminformation that a TCP receiver must maintain to arrange received segments in order. Figure 17 illustrates uTCPs receive-side behavior, in a simple scenario where three TCP segments arrive in succession: rst an in-order segment, then an out-of-order segment, and nally a segment lling the gap
93
Figure 17: Delivery behavior of (a) standard TCP, and (b) uTCP, upon receipt of in-order and out-of-order segments. (Reprinted from [78]. Included here by permission.) between the rst two. With uTCP, the application receives each segment as soon as it arrives, along with the sequence number information it needs to reconstruct a complete internal view of whichever fragments of the TCP stream have arrived. 5.8.2 uTCP: Sender-Side Modications
While uTCPs receiver-side enhancements address the latency tax on segments waiting in TCPs reordering buffer, TCPs sender-side queue can also introduce latency, as segments the application has already written to a TCP socketand hence committed to the networkwait until TCPs ow and congestion control allow their transmission. Many applications can benet from the ability to late-bind their decision on what to send until the last possible moment, and also from being able to transmit a message of higher priority that bypasses any lower priority messages in the sender-side queue. A uTCP sender allows a sending application to specify a tag with each application write, which the uTCP sender currently interprets as a priority level. Instead of unconditionally placing the newly-written data at the tail of the send queue as TCP normally would, uTCP inserts the newly-written data into the send queue just before any lower-priority data in the send queue not yet transmitted. With these modications to a TCP stack, none of which require changes to the TCP wire-format, uTCP offers an interface which, while not convenient for applications, is powerful. In the next Section, we discuss how we build a userspace library that uses this interface that provides a simple unordered delivery service, unordered delivery of encrypted messages, and logically separate data streams within a single uTCP connection. 5.8.3 Datagrams atop uTCP
Applications built on datagram substrates such as UDP generally assume the underlying layer preserves datagram boundaries. TCPs stream-oriented semantics do not preserve any application-relevant frame bound94
aries within a stream, however. Both the TCP sender and network middleboxes can and do coalesce TCP segments or re-segment TCP streams in unpredictable ways [58]. Atop uTCP, a userspace library can reconstruct contiguous fragments in the received data stream using the metadata sequence number information that uTCP passes along at the receiver. However, providing unordered message delivery service atop uTCP requires delimiting application messages in the bytestream. While record delimiting is commonly done by application protocols such as HTTP, SIP, and many others, a key property that we require to provide a true unordered delivery service is that a receiver must be able to extract a given message independently of other messages. That is, as soon as a complete message is received, the message delimiting mechanism must allow for extraction of the message from the bytestream fragment, without relying on the receipt of earlier messages. Minion implements self-delimiting messages in two ways: 1. To encode application datagrams efciently, the userspace library employs Consistent-Overhead Byte Stufng, or COBS [23] to delimit and extract messages. COBS is a binary encoding which eliminates exactly one byte value from a records encoding with minimal bandwidth overhead. To encode an application record, COBS rst scans the record for runs of contiguous marker-free data followed by exactly one marker byte. COBS then removes the trailing marker, instead prepending a non-marker byte indicating the run length. A special run-length value indicates a run of 254 bytes not followed by a marker in the original data, enabling COBS to divide arbitrary-length runs into 254-byte runs encoded into 255 bytes each, yielding a worst-case expansion of only 0.4%. 2. The userspace library coaxes out-of-order delivery from the existing TCP-oriented TLS wire format, producing an encrypted datagram substrate indistinguishable on the wire from standard TLS connections. TLS [35] already breaks its communication into records, encrypts and authenticates each record, and prepends a header for transmission on the underlying TCP stream. TLS was designed to decrypt records strictly in-order, however, creating challenges which the userspace library overcomes [78]. Run on port 443, our encrypted stream atop uTCP is indistinguishable from HTTPSregardless of whether the application actually uses HTTP headers, since the HTTP portion of HTTPS streams are TLS-encrypted anyway. Deployed this way, Minion effectively offers an end-to-end protected substrate in the HTTP as the new narrow waist philosophy [86].
5.9
Minions unordered delivery service benets a number of applications; we refer the reader to detailed discussion and experiments in [78]. Of these applications, we briey discuss how Minion ts within ongoing efforts to develop a next-generation transport for the web, such as SPDY[1] and the Internet Engineering Task Force (IETF)s HTTP/2.0 (httpbis) effort. Developing a next-generation HTTP requires either submitting to TCPs latency tax for backward compatibility, as with SPDYs use of TLS/TCP, or developing and deploying new transports atop UDP, neither of which, as we discussed earlier in this section, is a satisfying alternative. Minion bridges this gap and demonstrates that it is possible to obtain unordered delivery from wirecompatible TCP and TLS streams with surprisingly small changes to TCP stacks and application-level code. These protocols offer latency-sensitive applications performance benets comparable to UDP or DCCP, with the compatibility benets of TCP and TLS. Without discounting the value of UDP and newer OS-level transports, Minion offers a more conservative path toward the performance benets of unordered delivery, which we expect to be useful to applications that use TCP for a variety of pragmatic reasons.
95
6 Conclusion
The Transport Layer in the Internet evolved for nearly two decades, but it has been stuck for over a decade now. A proliferation of middleboxes in the Internet, devices in the network that look past the IP header, has shifted the waist of the Internet hourglass upward from IP to include UDP and TCP, the legacy workhorses of the Internet. While popular for many different reasons, middleboxes thus deviate from the Internets endto-end design, creating large deployment black-holessingularities where legacy transports get through, but any new transport technology or protocol fails, severely limiting transport protocol evolution. The fallout of this ossication is that new transport protocols, such as SCTP and DCCP, that were developed to offer much needed richer end-to-end services to applications, have had trouble getting deployed since they require changes to extant middleboxes. Multipath TCP is perhaps the most signicant change to TCP in the past twenty years. It allows existing TCP applications to achieve better performance and robustness over todays networks, and it has been standardized at the IETF. The Linux kernel implementation shows that these benets can be obtained in practice. However, as with any change to TCP, the deployment bar for Multipath TCP is very high: only time will tell whether the benets it brings will outweigh the added complexity it brings in the end-host stacks. The design of Multipath TCP has been a lengthy, painful process that took around ve years. Most of the difculty came from the need to support existing middlebox behaviors, while offering the exact same service to applications as TCP. Although the design space seemed wide open in the beginning, in the end we were just able to evolve TCP this way: for many of the design choices there was only one viable option that could be used. When the next major TCP extension is designed in a network with even more middleboxes, will we, as a community, be as lucky? A pragmatic answer to the inability to deploy new transport protocols is Minion. It allows deploying new transport services by being backward compatible with middleboxes by encapsulating new protocols inside TCP. Minion demonstrates that it is possible to obtain unordered delivery and multistreaming from wire-compatible TCP and TLS streams with surprisingly small changes to TCP stacks and application-level code. Minion offers a path toward the performance benets of unordered delivery, which we expect to be useful to applications that use TCP for a variety of pragmatic reasons. Early in the Internets history, all IP packets could travel freely through the Internet, as IP was the narrow waist of the protocol stack. Eventually, apps started using UDP and TCP exclusively, and some, such as Skype, used them adaptively, probably due to security concerns in addition to the increasing proliferation of middleboxes that allowed only UDP and TCP through. As a result, UDP and TCP over IP were then perceived to constitute the new waist of the Internet. (Well note that HTTP has also recently been suggested as the new waist [86].) Our observation is that whatever the new waist is, middleboxes will embrace it and optimize for it: if MPTCP and/or Minion become popular, it is likely that middleboxes will be devised that understand these protocols to optimize for the most successful use-case of these protocols and to help protect any vulnerable applications using them. One immediate answer from an application would be to use the encrypted communication proposed in Minionbut actively hiding information from a network operator can potentially encourage the network operator to embed middleboxes that intercept encrypted connections, effectively mounting man-in-the-middle attacks to control trafc over their network, as is already being done in several corporate rewalls [72]. To bypass these middleboxes, new applications may encapsulate their data even deeper, leading to a vicious circle resembling an arms race for control over network use. This arms race is a symptom of a fundamental tussle between end-hosts and the network: end-hosts will always want to deploy new applications and services, while the network will always want to allow and optimize only existing ones [28]. To break out of this vicious circle, we propose that end-hosts and the
96
network must co-operate, and that they must build cooperation into their protocols. Designing and providing protocols and incentives for this cooperation may hold the key to creating a truly evolvable transport (and Internet) architecture.
Acknowledgements
We would like to thank the reviewers whose comments have improved this chapter. We would also like to thank Adrian Vladulescu for the measurements presented in gure 13.
References
[1] SPDY: An Experimental Protocol For a Faster Web. http://www.chromium.org/spdy/ spdy-whitepaper. [2] ZeroMQ: The intelligent transport layer. http://www.zeromq.org. [3] A FANASYEV, A., T ILLEY, N., R EIHER , P., AND K LEINROCK , L. Host-to-Host Congestion Control for TCP. IEEE Communications Surveys & Tutorials 12, 3 (2012), 304342. [4] AGACHE , A., AND R AICIU , C. GRIN: utilizing the empty half of full bisection networks. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing (Berkeley, CA, USA, 2012), HotCloud12, USENIX Association, pp. 77. [5] A LLMAN , M. Comments on selecting ephemeral ports. SIGCOMM Comput. Commun. Rev. 39, 2 (Mar. 2009), 1319. [6] A LLMAN , M., F LOYD , S., AND PARTRIDGE , C. Increasing TCPs Initial Window. RFC 3390 (Proposed Standard), Oct. 2002. [7] A LLMAN , M., O STERMANN , S., AND M ETZ , C. FTP Extensions for IPv6 and NATs. RFC 2428 (Proposed Standard), Sept. 1998. [8] A LLMAN , M., PAXSON , V., AND B LANTON , E. TCP Congestion Control. RFC 5681 (Draft Standard), Sept. 2009. [9] AUDET, F., AND J ENNINGS , C. Network Address Translation (NAT) Behavioral Requirements for Unicast UDP. RFC 4787 (Best Current Practice), Jan. 2007. [10] BAKER , F. Requirements for IP Version 4 Routers. RFC 1812 (Proposed Standard), June 1995. [11] BALAKRISHNAN , H., R AHUL , H., AND S ESHAN , S. An integrated congestion management architecture for internet hosts. In Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication (New York, NY, USA, 1999), SIGCOMM 99, ACM, pp. 175187. [12] BASET, S. A., AND S CHULZRINNE , H. An analysis of the Skype peer-to-peer Internet telephony protocol. In IEEE INFOCOM (Apr. 2006). [13] B EGEN , A., W ING , D., AND C AENEGEM , T. V. Port Mapping between Unicast and Multicast RTP Sessions. RFC 6284 (Proposed Standard), June 2011.
97
[14] B EVERLY, R., B ERGER , A., H YUN , Y., AND CLAFFY, K . Understanding the efcacy of deployed internet source address validation ltering. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference (New York, NY, USA, 2009), IMC 09, ACM, pp. 356369. [15] B IRRELL , A., AND N ELSON , B. Implementing remote procedure calls. ACM Trans. Comput. Syst. 2, 1 (Feb. 1984), 3959. [16] B ONAVENTURE , O. Computer Networking : Principles, Protocols and Practice. Saylor foundation, 2012. Available from http://inl.info.ucl.ac.be/cnp3. [17] B ONAVENTURE , O., H ANDLEY, M., AND R AICIU , C. An Overview of Multipath TCP. Usenix ;login: magazine 37, 5 (Oct. 2012). [18] B RADEN , R. Requirements for Internet Hosts - Communication Layers. RFC 1122 (INTERNET STANDARD), Oct. 1989. [19] B RADEN , R. T/TCP TCP Extensions for Transactions Functional Specication. RFC 1644 (Historic), July 1994. [20] B RAKMO , L., OM ALLEY, S., AND P ETERSON , L. TCP Vegas: new techniques for congestion detection and avoidance. In Proceedings of the conference on Communications architectures, protocols and applications (New York, NY, USA, 1994), SIGCOMM 94, ACM, pp. 2435. , R. A taxonomy and survey of SCTP [21] B UDZISZ , L., G ARCIA , J., B RUNSTROM , A., AND F ERR US research. ACM Computing Surveys 44, 4 (Aug. 2012), 136. [22] C ARPENTER , B., AND B RIM , S. Middleboxes: Taxonomy and Issues. RFC 3234 (Informational), Feb. 2002. [23] C HESHIRE , S., AND BAKER , M. Consistent overhead byte stufng. In Proceedings of the ACM SIGCOMM 97 conference on Applications, technologies, architectures, and protocols for computer communication (New York, NY, USA, 1997), SIGCOMM 97, ACM, pp. 209220. [24] C HU , J. Tuning TCP Parameters for the 21st Century. Presented at IETF75, July 2009. [25] C HU , J., D UKKIPATI , N., C HENG , Y., AND M ATHIS , M. Increasing TCPs Initial Window. Internet draft, draft-ietf-tcpm-initcwnd, work in progress, February 2013. [26] C ISCO. Rate-Based Satellite Control Protocol, 2006. [27] C LARK , D. The design philosophy of the darpa internet protocols. In Symposium proceedings on Communications architectures and protocols (New York, NY, USA, 1988), SIGCOMM 88, ACM, pp. 106114. [28] C LARK , D., W ROCLAWSKI , J., S OLLINS , K., AND B RADEN , R. Tussle in cyberspace: dening tomorrows internet. In Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications (New York, NY, USA, 2002), SIGCOMM 02, ACM, pp. 347356. [29] C LARK , D. D., AND T ENNENHOUSE , D. L. Architectural considerations for a new generation of protocols. In Proceedings of the ACM symposium on Communications architectures & protocols (New York, NY, USA, 1990), SIGCOMM 90, ACM, pp. 200208.
98
[30] DACUNTO , L., P OUWELSE , J., AND S IPS ., H. A Measurement of NAT and Firewall Characteristics in Peer-to-Peer Systems. In Proceedings of the ASCI Conference (2009). [31] DAVIES , J. DirectAccess and the thin edge network. Microsoft TechNet Magazine (May 2009). [32] DE V IVO , M., DE V IVO , G., KOENEKE , R., AND I SERN , G. Internet vulnerabilities related to TCP/IP and T/TCP. SIGCOMM Comput. Commun. Rev. 29, 1 (Jan. 1999), 8185. [33] D ETAL , G. tracebox. http://www.tracebox.org. [34] D HAMDHERE , A., L UCKIE , M., H UFFAKER , B., CLAFFY, K ., E LMOKASHFI , A., AND A BEN , E. Measuring the deployment of IPv6: topology, routing and performance. In Proceedings of the 2012 ACM conference on Internet measurement conference (New York, NY, USA, 2012), IMC 12, ACM, pp. 537550. [35] D IERKS , T., AND R ESCORLA , E. The Transport Layer Security (TLS) Protocol Version 1.2. RFC 5246 (Proposed Standard), Aug. 2008. [36] D UKE , M., B RADEN , R., E DDY, W., AND B LANTON , E. A Roadmap for Transmission Control Protocol (TCP) Specication Documents. RFC 4614 (Informational), Sept. 2006. [37] E DDY, W. TCP SYN Flooding Attacks and Common Mitigations. RFC 4987 (Informational), Aug. 2007. [38] E GEVANG , K., AND F RANCIS , P. The IP Network Address Translator (NAT). RFC 1631 (Informational), May 1994. [39] FABER , T., T OUCH , J., AND Y UE , W. The TIME-WAIT state in TCP and its effect on busy servers. In INFOCOM99. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE (1999), vol. 3, IEEE, pp. 15731583. [40] FALL , K., AND S TEVENS , R. TCP/IP Illustrated, Volume 1: The Protocols, vol. 1. Addison-Wesley Professional, 2011. [41] F ERGUSON , P., AND S ENIE , D. Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoong. RFC 2827 (Best Current Practice), May 2000. [42] F ONSECA , R., P ORTER , G., K ATZ , R., S HENKER , S., AND S TOICA , I. IP options are not an option. Tech. Rep. UCB/EECS-2005-24, UC Berkeley, Berkeley, CA, 2005. [43] F ORD , A., R AICIU , C., H ANDLEY, M., AND B ONAVENTURE , O. TCP Extensions for Multipath Operation with Multiple Addresses. RFC 6824 (Experimental), Jan. 2013. [44] F ORD , B. Structured streams: a new transport abstraction. In Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications (New York, NY, USA, 2007), SIGCOMM 07, ACM, pp. 361372. [45] F ORD , B., AND I YENGAR , J. Breaking up the transport logjam. In 7th Workshop on Hot Topics in Networks (HotNets-VII) (Oct. 2008). [46] F ORD , B., AND I YENGAR , J. Efcient cross-layer negotiation. In 8th Workshop on Hot Topics in Networks (HotNets-VIII) (Oct. 2009).
99
[47] G ONT, F. Survey of Security Hardening Methods for Transmission Control Protocol (TCP) Implementations. Internet draft, draft-ietf-tcpm-tcp-security, work in progress, March 2012. [48] G ONT, F., AND B ELLOVIN , S. Defending against Sequence Number Attacks. RFC 6528 (Proposed Standard), Feb. 2012. [49] G UHA , S., B ISWAS , K., F ORD , B., S IVAKUMAR , S., AND S RISURESH , P. NAT Behavioral Requirements for TCP. RFC 5382 (Best Current Practice), Oct. 2008. [50] G UO , L., TAN , E., C HEN , S., X IAO , Z., S PATSCHECK , O., AND Z HANG , X. Delving into internet streaming media delivery: a quality and resource utilization perspective. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement (New York, NY, USA, 2006), IMC 06, ACM, pp. 217230. [51] H AIN , T. Architectural Implications of NAT. RFC 2993 (Informational), Nov. 2000. [52] H AN , H., S HAKKOTTAI , S., H OLLOT, C., S RIKANT, R., AND T OWSLEY, D. Multi-path TCP: a joint congestion control and routing scheme to exploit path diversity in the Internet. IEEE/ACM Trans. Networking 14, 6 (2006). [53] H ANDLEY, M., PAXSON , V., AND K REIBICH , C. Network intrusion detection: Evasion, trafc normalization, and end-to-end protocol semantics. In Proc. USENIX Security Symposium (2001), pp. 99. [54] H ANDLEY, M., PAXSON , V., AND K REIBICH , C. Network Intrusion Detection: Evasion, Trafc Normalization, and End-to-end Protocol Semantics. In SSYM01: Proceedings of the 10th conference on USENIX Security Symposium (Berkeley, CA, USA, 2001), USENIX Association, pp. 99. [55] H AYES , D., B UT, J., AND A RMITAGE , G. Issues with network address translation for SCTP. SIGCOMM Comput. Commun. Rev. 39, 1 (Dec. 2008), 2333. [56] H ESMANS , B. Click elements to model middleboxes. https://bitbucket.org/bhesmans/ click. [57] H OLDREGE , M., AND S RISURESH , P. Protocol Complications with the IP Network Address Translator. RFC 3027 (Informational), Jan. 2001. [58] H ONDA , M., N ISHIDA , Y., R AICIU , C., G REENHALGH , A., H ANDLEY, M., AND T OKUDA , H. Is it still possible to extend TCP? In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference (New York, NY, USA, 2011), IMC 11, ACM, pp. 181194. [59] H UITEMA , C. Multi-homed TCP. Internet draft, work in progress, 1995. [60] I REN , S., A MER , P., AND C ONRAD , P. The transport layer: tutorial and survey. ACM Comput. Surv. 31, 4 (Dec. 1999), 360404. [61] I YENGAR , J., F ORD , B., A ILAWADI , D., A MIN , S. O., N OWLAN , M., T IWARI , N., AND W ISE , J. Minionan All-Terrain Packet Packhorse to Jump-Start Stalled Internet Transports. In PFLDNeT 2010 (Nov. 2010). [62] JACOBSON , V. Congestion avoidance and control. In Symposium proceedings on Communications architectures and protocols (New York, NY, USA, 1988), SIGCOMM 88, ACM, pp. 314329.
100
[63] JACOBSON , V., B RADEN , R., AND B ORMAN , D. TCP Extensions for High Performance. RFC 1323 (Proposed Standard), May 1992. [64] J IANG , S., G UO , D., AND C ARPENTER , B. An Incremental Carrier-Grade NAT (CGN) for IPv6 Transition. RFC 6264 (Informational), June 2011. [65] K ELLY, F., AND VOICE , T. Stability of end-to-end algorithms for joint routing and rate control. SIGCOMM Comput. Commun. Rev. 35, 2 (Apr. 2005), 512. [66] K HALILI , R., G AST, N., P OPOVIC , M., U PADHYAY, U., AND L E B OUDEC , J.-Y. MPTCP is not pareto-optimal: performance issues and a possible solution. In Proceedings of the 8th international conference on Emerging networking experiments and technologies (New York, NY, USA, 2012), CoNEXT 12, ACM, pp. 112. [67] KOHLER , E., H ANDLEY, M., AND F LOYD , S. Datagram Congestion Control Protocol (DCCP). RFC 4340 (Proposed Standard), Mar. 2006. [68] KOHLER , E., M ORRIS , R., C HEN , B., JANNOTTI , J., AND K AASHOEK , F. The click modular router. ACM Trans. Comput. Syst. 18, 3 (Aug. 2000), 263297. [69] L ARSEN , M., AND G ONT, F. Recommendations for Transport-Protocol Port Randomization. RFC 6056 (Best Current Practice), Jan. 2011. [70] L ARZON , L.-A., D EGERMARK , M., P INK , S., J ONSSON , L.-E., AND FAIRHURST, G. Lightweight User Datagram Protocol (UDP-Lite). RFC 3828 (Proposed Standard), July 2004. The
[71] L EECH , M., G ANIS , M., L EE , Y., K URIS , R., KOBLAS , D., AND J ONES , L. SOCKS Protocol Version 5. RFC 1928 (Proposed Standard), Mar. 1996. [72] M ARKO , K. Using SSL Proxies To Block Unauthorized SSL VPNs. www.processor.com 32, 16 (July 2010), 23. Processor Magazine,
[73] M ATHIS , M., AND H EFFNER , J. Packetization Layer Path MTU Discovery. RFC 4821 (Proposed Standard), Mar. 2007. [74] M ATHIS , M., M AHDAVI , J., F LOYD , S., AND ROMANOW, A. TCP Selective Acknowledgment Options. RFC 2018 (Proposed Standard), Oct. 1996. [75] M C L AGGAN , D. Web Cache Communication Protocol V2, Revision 1. mclaggan-wccp-v2rev1, work in progress, August 2012. Internet draft, draft-
[76] M OGUL , J. TCP ofoad is a dumb idea whose time has come. In HotOS IX (May 2003). [77] N ORTHCUTT, S., Z ELTSER , L., W INTERS , S., K ENT, K., AND R ITCHEY, R. Inside Network Perimeter Security. SAMS Publishing, 2005. [78] N OWLAN , M., T IWARI , N., I YENGAR , J., A MIN , S. O., AND F ORD , B. Fitting square pegs through round pipes: Unordered delivery wire-compatible with TCP and TLS. In NSDI (Apr. 2012), vol. 12. [79] O NG , L., AND YOAKUM , J. An Introduction to the Stream Control Transmission Protocol (SCTP). RFC 3286 (Informational), May 2002.
101
[80] PAASCH , C., D ETAL , G., D UCHENE , F., R AICIU , C., AND B ONAVENTURE , O. Exploring mobile/WiFi handover with multipath TCP. In Proceedings of the 2012 ACM SIGCOMM workshop on Cellular networks: operations, challenges, and future design (New York, NY, USA, 2012), CellNet 12, ACM, pp. 3136. [81] PAXSON , V., AND A LLMAN , M. Computing TCPs Retransmission Timer. RFC 2988 (Proposed Standard), Nov. 2000. [82] PAXSON , V., A LLMAN , M., C HU , J., AND S ARGENT, M. Computing TCPs Retransmission Timer. RFC 6298 (Proposed Standard), June 2011. [83] P ERREAULT, S., YAMAGATA , I., M IYAKAWA , S., NAKAGAWA , A., AND A SHIDA , H. Common Requirements for Carrier-Grade NATs (CGNs). RFC 6888 (Best Current Practice), Apr. 2013. [84] P FEIFFER , R. Measuring TCP Congestion Windows. Linux Gazette, March 2007. [85] P HELAN , T., FAIRHURST, G., AND P ERKINS , C. DCCP-UDP: A Datagram Congestion Control Protocol UDP Encapsulation for NAT Traversal. RFC 6773 (Proposed Standard), Nov. 2012. [86] P OPA , L., G HODSI , A., AND S TOICA , I. Http as the narrow waist of the future internet. In Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks (New York, NY, USA, 2010), Hotnets-IX, ACM, pp. 6:16:6. [87] P OSTEL , J. User Datagram Protocol. RFC 768 (INTERNET STANDARD), Aug. 1980. [88] P OSTEL , J. Internet Protocol. RFC 791 (INTERNET STANDARD), Sept. 1981. [89] P OSTEL , J. Transmission Control Protocol. RFC 793 (INTERNET STANDARD), Sept. 1981. [90] P OSTEL , J., AND R EYNOLDS , J. File Transfer Protocol. RFC 959 (INTERNET STANDARD), Oct. 1985. [91] R ADHAKRISHNAN , S., C HENG , Y., C HU , J., JAIN , A., AND R AGHAVAN , B. Tcp fast open. In Proceedings of the Seventh COnference on emerging Networking EXperiments and Technologies (New York, NY, USA, 2011), CoNEXT 11, ACM, pp. 21:121:12. [92] R AICIU , C., BARRE , S., P LUNTKE , C., G REENHALGH , A., W ISCHIK , D., AND H ANDLEY, M. Improving datacenter performance and robustness with multipath tcp. In Proceedings of the ACM SIGCOMM 2011 conference (New York, NY, USA, 2011), SIGCOMM 11, ACM, pp. 266277. [93] R AICIU , C., H ANDLEY, M., AND W ISCHIK , D. Coupled Congestion Control for Multipath Transport Protocols. RFC 6356 (Experimental), Oct. 2011. [94] R AICIU , C., PAASCH , C., BARRE , S., F ORD , A., H ONDA , M., D UCHENE , F., B ONAVENTURE , O., AND H ANDLEY, M. How hard can it be? designing and implementing a deployable multipath TCP. In NSDI (2012), vol. 12, pp. 2929. [95] R EIS , C., ET AL . Detecting in-ight page changes with web tripwires. In Symposium on Networked System Design and Implementation (NSDI) (Apr. 2008). [96] R EKHTER , Y., M OSKOWITZ , B., K ARRENBERG , D., DE G ROOT, G. J., AND L EAR , E. Address Allocation for Private Internets. RFC 1918 (Best Current Practice), Feb. 1996.
102
[97] R ESCORLA , E., AND M ODADUGU , N. Datagram Transport Layer Security. RFC 4347 (Proposed Standard), Apr. 2006. [98] ROSENBERG , J. UDP and TCP as the new waist of the Internet hourglass, Feb. 2008. Internet-Draft (Work in Progress). [99] ROSS , K., AND K UROSE , J. Computer networking : A top-down Approach Featuring the Internet. Addison Wesley, 2012. [100] S ALTZER , J., R EED , D., AND C LARK , D. End-to-end arguments in system design. ACM Transactions on Computer Systems (TOCS) 2, 4 (1984), 277288. [101] S CHULZRINNE , H., C ASNER , S., F REDERICK , R., AND JACOBSON , V. RTP: A Transport Protocol for Real-Time Applications. RFC 3550 (INTERNET STANDARD), July 2003. [102] S EMKE , J., M AHDAVI , J., AND M ATHIS , M. Automatic tcp buffer tuning. In Proceedings of the ACM SIGCOMM 98 conference on Applications, technologies, architectures, and protocols for computer communication (New York, NY, USA, 1998), SIGCOMM 98, ACM, pp. 315323. [103] S HERRY, J., H ASAN , S., S COTT, C., K RISHNAMURTHY, A., R ATNASAMY, S., AND S EKAR , V. Making middleboxes someone elses problem: network processing as a cloud service. In Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication (New York, NY, USA, 2012), SIGCOMM 12, ACM, pp. 1324. [104] S RISURESH , P., F ORD , B., S IVAKUMAR , S., AND G UHA , S. NAT Behavioral Requirements for ICMP. RFC 5508 (Best Current Practice), Apr. 2009. [105] S TEWART, R. Stream Control Transmission Protocol. RFC 4960 (Proposed Standard), Sept. 2007. [106] S TEWART, R., R AMALHO , M., X IE , Q., T UEXEN , M., AND C ONRAD , P. Stream Control Transmission Protocol (SCTP) Partial Reliability Extension. RFC 3758 (Proposed Standard), May 2004. [107] S TONE , J., AND PARTRIDGE , C. When the CRC and TCP checksum disagree. In Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (New York, NY, USA, 2000), SIGCOMM 00, ACM, pp. 309319. [108] S TRAYER , T., D EMPSEY, B., AND W EAVER , A. XTP: The Xpress transfer protocol. AddisonWesley Publishing Company, 1992. [109] T OUCH , J. TCP Control Block Interdependence. RFC 2140 (Informational), Apr. 1997. [110] T UEXEN , M., AND S TEWART, R. UDP Encapsulation of SCTP Packets for End-Host to End-Host Communication. Internet draft, draft-ietf-tsvwg-sctp-udp-encaps, work in progress, March 2013. [111] VASUDEVAN , V., P HANISHAYEE , A., S HAH , H., K REVAT, E., A NDERSEN , D., G ANGER , G., G IBSON , G., AND M UELLER , B. Safe and effective ne-grained tcp retransmissions for datacenter communication. In Proceedings of the ACM SIGCOMM 2009 conference on Data communication (New York, NY, USA, 2009), SIGCOMM 09, ACM, pp. 303314. [112] V ELTEN , D., H INDEN , R., AND S AX , J. Reliable Data Protocol. RFC 908 (Experimental), July 1984.
103
[113] V UTUKURU , M., BALAKRISHNAN , H., AND PAXSON , V. Efcient and Robust TCP Stream Normalization. In IEEE Symposium on Security and Privacy (2008), IEEE, pp. 96110. [114] W3C. The WebSocket API (draft), 2011. http://dev.w3.org/html5/websockets/. [115] WANG , Z., Q IAN , Z., X U , Q., M AO , Z., AND Z HANG , M. An untold story of middleboxes in cellular networks. In Proceedings of the ACM SIGCOMM 2011 conference (New York, NY, USA, 2011), SIGCOMM 11, ACM, pp. 374385. [116] W ISCHIK , D., H ANDLEY, M., AND BAGNULO , M. The resource pooling principle. SIGCOMM Comput. Commun. Rev. 38, 5 (Sept. 2008), 4752. [117] W ISCHIK , D., R AICIU , C., G REENHALGH , A., AND H ANDLEY, M. Design, implementation and evaluation of congestion control for Multipath TCP. In Proceedings of the 8th USENIX conference on Networked systems design and implementation (Berkeley, CA, USA, 2011), NSDI11, USENIX Association, pp. 88. [118] Z EC , M., M IKUC , M., AND Z AGAR , M. Estimating the impact of interrupt coalescing delays on steady state TCP throughput. In SoftCOM (2002). [119] Z IMMERMANN , H. OSI reference modelThe ISO model of architecture for open systems interconnection. Communications, IEEE Transactions on 28, 4 (1980), 425432.
Exercises
This appendix contains a few exercises on the evolution of transport protocols and their interactions with middleboxes.
an unreliable bytestream that prevents data corruption but can deliver holes in the bytestream
5. One of the issues that limits the extensibility of TCP is the limited amount of space to encode TCP options. This limited space is particularly penalizing in the SYN segment. Explore two possible ways of reducing the consumption of precious TCP option-space Dene a new option that packs several related options in a smaller option. For example try to combine SACK with Timestamp and Window scale in a small option for the SYN segment. Dene a compression scheme that allows to pack TCP options in fewer bytes. The utilisation of this compression scheme would of course need to be negotiated during the three way handshake.
A.2 Middleboxes
Middleboxes may perform various changes and checks on the packets that they process. Testing real middleboxes can be difcult because it involves installing complex and sometimes costly devices. However, getting an understanding of the interactions between middleboxes and transport protocols can be useful for protocol designers. A rst approach to understand the impact of middleboxes on transport protocols is to emulate the interference caused by middleboxes.This can be performed by using click [68] elements that emulate the operation of middleboxes [56] : ChangeSeqElement changes the sequence number in the TCP header of processed segments to model a rewall that randomises sequence numbers RemoveTCPOptionElement selectively removes a chosen option from processed TCP segments
SegSplitElement selectively splits a TCP segment in two different segments and copies the options in one or both segments SegCoalElement selectively coalesces consecutive segments and uses the TCP option from the rst/second segment for the coalesced one Using some of these click elements, perform the following tests with one TCP implementation. 1. Using a TCP implementation that supports the timestamp option dened in [63] evaluate the effect of removing this option in the SYN, SYN+ACK or regular TCP segments with the RemoveTCPOptionElement click element. 2. Using a TCP implementation that supports the selective acknowledgement option dened in [74] predict the effect randomizing the sequence number in the TCP header without updating anything in this option as done by some rewalls. Use the ChangeSeqElement click element to experimentally verify your answer. Instead of using random sequence numbers, evaluate the impact of logarithmically increasing/decreasing the sequence numbers (i.e. +10, +100, +1000, +1000, . . . ) 3. Recent TCP implementations support the large windows extension dened in [63]. This extension uses the WScale option in the SYN and SYN+ACK segments. Evaluate the impact of removing this option in one of these segments with the RemoveTCPOptionElement element. For the experiments, try to force the utilisation of a large receive window by conguring your TCP stack. 4. Some middleboxes split or coalesce segments. Considering Multipath TCP, discuss the impact of splitting and coalescing segments on the correct operation of the protocol. Use the Multipath TCP implementation in the Linux kernel and the SegCoalElement and SegSplitElement click elements to experimentally verify your answer.
105
5. The extensibility of SCTP depends on the utilisation of chunks. Consider an SCTP-aware middlebox that recognizes the standard SCTP chunks but drops the new ones. Consider for example the partialreliability extension dened in [106]. Develop a click element that allows to selectively remove a chunk from processed segments and evaluate experimentally its impact on SCTP. Another way to evaluate middleboxes is to try to infer their presence in a network by sending probe packets. This is the approach used by Michio Honda and his colleagues in [58]. However, the TCPExposure software requires the utilisation of a special server and thus only allows to probe the path towards this particular server. An alternative is to use tracebox [33]. tracebox is an extension to the popular traceroute tool that allows to detect middleboxes on (almost) any path. tracebox sends TCP and UDP segments inside IP packets that have different Time-To-Live values like traceroute. When an IPv4 router receives an IPv4 packet whose TTL is going to expire, it returns an ICMPv4 Time Exceeded packet that contains the offending packet. Older routers return in the ICMP the IP header of the original packet and the rst 64 bits of the payload of this packet. When the packet contains a TCP segment, these rst 64 bits correspond to the source and destination ports and the sequence number. However, recent measurements show that a large fraction of IP routers in the Internet, notably in the core, comply with [10] and thus return the complete original packet. tracebox compares the packet returned inside the ICMP message with the original one to detect any modication performed by middleboxes. All the packets sent and received by tracebox are recorded as a libpcap le that can be easily processed by using tcpdump or wireshark. 1. Use tracebox to detect whether the TCP sequence numbers of the segments that your host sends are modied by intermediate rewalls or proxies. 2. Use tracebox behind a Network Address Translator to see whether tracebox is able to detect the modications performed by the NAT. Try with TCP, UDP and regular IP packets to see whether the results vary with the protocol. Analyse the collected packet traces. 3. Some rewalls and middleboxes change the MSS option in the SYN segments that they process. Can you explain a possible reason for this change ? Use tracebox to verify whether there is a middlebox that performs this change inside your network. 4. Use tracebox to detect whether the middleboxes that are deployed in your network allow new TCP options, such as the ones used by Multipath TCP, to pass through. 5. Extend tracebox so that it supports the transmission of SCTP segments containing various types of chunks.
106
http://caia.swin.edu.au/urp/newtcp/mptcp/ provides a kernel patch that enables Multipath TCP in the FreeBSD-10.x kernel. This implementation only supports a subset of [43] The ns-3 network simulator10 contains two forms of support for Multipath TCP. The rst one is by using a Multipath TCP model11 . The second is by executing a modied Linux kernel inside ns-3 by using Direct Code Execution12 . Most of the exercises below can be performed by using one of the above mentioned simulators or implementation. 1. Several congestion control schemes have been proposed for Multipath TCP and some of them have been implemented. Compare the performance of the congestion control algorithms that your implementation supports. 2. The Multipath TCP congestion control scheme was designed to move trafc away from congested paths. TCP detects congestion through losses. Devise an experiment using one of the above mentioned simulators/implementation to analyse the performance of Multipath TCP when losses occur. 3. The non-standard TCP INFO socket option[84] in the Linux kernel allows to collect information about any active TCP connection. Develop an application that uses TCP INFO to study the evolution of the Multipath TCP congestion windows. 4. Using the Multipath TCP Mininet or netkit image, experiment with Multipath TCPs fallback mechanism by using ftp to transfer les through a NAT that includes an application level gateway. Collect the packet trace and verify that the fallback works correctly.
11 See
10 See
107
Abstract The increasing demand of various services from the Internet has led to an exponential growth of Internet trafc in the last decade, and that growth is likely to continue. With this demand comes the increasing importance of network operations management, planning, provisioning and trafc engineering. A key input into these processes is the trafc matrix, and this is the focus of this chapter. The trafc matrix represents the volumes of trafc from sources to destinations in a network. Here, we rst explore the various issues involved in measuring and characterising these matrices. The insights obtained are used to develop models of the trafc, depending on the properties of trafc to be captured: temporal, spatial or spatio-temporal properties. The models are then used in various applications, such as the recovery of trafc matrices, network optimisation and engineering activities, anomaly detection and the synthesis of articial trafc matrices for testing routing protocols. We conclude the chapter by summarising open questions in Internet trafc matrix research and providing a list resources useful for the researcher and practitioner.
Introduction
In the era of the telephone voice trafc dominated physical telecommunication lines. With the birth of the Internet, and its subsequent adoption as a key means of communication, data trafc has taken over (though some of this data trafc is actually re-badged voice trafc using Voice over IP). With the advent of new applications such as video streaming, and the rapid growth of data trafc from mobile devices, we are witnessing a global data explosion. Given the ever increasing importance of the Internet, knowledge of its trafc has become increasingly critical. The Internet, however, is just an all-encompassing term to describe the vast global collection of networks, and so it largely falls on individual network providers to determine their own trafc. This knowledge is vital for continued operations because it allows network operators to perform important tasks such as providing enough network capacity to carry the current trafc, as well as to predict and prepare for future trends. Trafc data is also important in network maintenance, which is necessary if services and content are to be provided to customers with minimal interruption. The focus of this chapter is on the trafc matrix, which, in a nutshell, is an abstract representation of the trafc volume owing between sets of source and destination pairs. Each element in the matrix denotes the amount of trafc between a source and destination pair. There are many variants: depending on the network layer under study, sources and destinations could be routers or even whole networks. And Amount is generally measured in the number of bytes or packets, but could refer to other quantities such as connections. Trafc matrices, as will be clearer below, are utilised for a variety of network engineering goals, such as prediction of future trafc trends, network optimisation, protocol design and anomaly detection. Considering
108
P. Tune, M. Roughan, Internet Trafc Matrices: A Primer, in H. Haddadi, O. Bonaventure (Eds.), Recent Advances in Networking, (2013), pp. xx-yy. Licensed under a CC-BY-SA Creative Commons license.
the widespread use of these matrices, the rst objective of this chapter is to provide an entry level for graduate students and new researchers to current research on trafc matrices. To that end, the material is organised in a tutorial-like fashion, so as to ensure key concepts can be easily understood.
1.1 Motivation
Why study Internet trafc matrices? Simply because their implications for network operators are vast. If the trafc matrix of a network is exactly known, then, armed with topology and routing information of the network, the operator knows exactly what is going on in the network, rendering network management tasks relatively easy. If the trafc matrix is completely unknown, then a network operator is blind. Subtle faults may plague their network, substantially reducing performance. Congestion may be rife, or sudden shifts in trafc may cause transient trafc losses. The issues are becoming more important. The dominant philosophy in the early days of the Internet was best effort delivery. Most applications did not have high quality of service (QoS) requirements, and had some tolerance to packet drops and link failures. In recent years, however, the landscape of the Internet is changing quickly with the introduction of streaming content such as video, high denition television and Voice over IP (VoIP), which have more stringent QoS requirements. For example, excessive packet drops and delays would produce highly noticeable artefacts in streamed video. These changes are driven primarily by user demands, with the introduction of new applications, such as online social networking, entertainment services and online multi-player gaming. Furthermore, with the increasing computational power of mobile devices and increasing wireless speeds, it is evident that a signicant portion of future trafc will be generated by these devices. These trends are making measurements more and more critical. For most operators, the state of their measurements is somewhere between extremes. Most operators have some trafc data (if only link counts). However, it is rare (in our experience) to nd a network operator who knows everything about their trafc. As such, one of the tasks we shall consider here is how to measure these matrices1 , but also how to obtain estimates of trafc matrices when presented with incomplete data. The trafc matrix inference or completion or recovery problem is one of the major areas of research into these interesting creatures, and it is intricately related to modelling the matrices, both as measurements supply data to populate the models, and because models are used to perform the inference. Even those operators with extensive measurements and exact knowledge of todays trafc matrix may be interested in methods to predict their matrices for use in future planning, and this can be seen as another form of matrix completion. Trafc matrices have many uses, apart from the simple fact that this type of business intelligence is critical to understanding your network. The more direct uses include network optimisation, anomaly detection and protocol design. There are three common optimisation problems on networks. Capacity planning is needed to ensure there is adequate bandwidth for users in the present and future, but at minimal cost. There are two types of network planning: evolutionary planning and green elds planning; see 5 for details. Trafc engineering tasks include day-to-day maintenance of the network as well as predicting growth trends and anticipating trafc demands [16, 45, 51, 52, 74, 75, 79, 91, 115]. Routing involves organising trafc ow in the network, as well as modelling routing of large networks [88]. This includes functions such as nding the shortest paths for ows but also, importantly, load balancing to ensure links remain uncongested. In all these cases, the trafc matrix is a key input to perform the tasks effectively and efciently.
1 This chapter is not really a primer on measurement tools as such, so much as the principles that underlie those tools. We will not tell the reader how to set up NetFlow on the readers particular router, but instead we will aim to inform the reader what could be achieved with this type of ow-level measurements.
109
Trafc matrices can also be used to detect sudden shifts in trafc due to anomalies. Anomalies include sudden unexpected events, such as network failures, or more malicious events, such as the September 11 World Trade Centre attack, worm infections and distributed denial of service (DDoS) attacks [99]. Regardless, these anomalies need to be detected so as to develop appropriate measures against possible threats to the network. Trafc matrices may also be used to conduct reliability analyses, where the effect (on trafc) of network failures is considered. A basic task in most network design is to create redundant paths to carry trafc in case of failure, but if high reliability is required, then an operator should also ensure that there is sufcient capacity in the network to carry this trafc along its alternate paths. Further, the performance of many network protocols depends on the trafc they carry, and the crosstrafc which forms their background. Design of new protocols therefore requires realistic measurements, or models of such trafc. Models can be used to test protocols on articially synthesised trafc. In this way, the limitations of a protocol may be understood in a controlled environment before running it on an actual network. These issues will be examined in-depth in 5, where algorithms utilising trafc matrices to perform these tasks will be discussed.
of each other, although there are some overlaps between these layers in practice. The basic properties of trafc matrices will depend on the network level, and trafc can be measured between logical or physical source/destinations, or at different levels of aggregation. Measurements of networks clearly serve as the foundation in any model development. The caveat, however, is that the measurements themselves are subject to errors and inconsistency which may lead to an incorrect model. Moreover, several hypotheses may t a particular observation of the network, leading to several possible models explaining the same observation. To argue for the use of one model over another requires additional knowledge from new data or from domain knowledge. And when new information become available, there is a question of how to incorporate it into the model. There are a variety of possible approaches, all equally valid [7]. To further compound the problem, underlying all Internet trafc is the fact that all trafc is driven by consumer demands. Unfortunately, human behaviour is inherently difcult to model let alone understand. Furthermore, changing trends in network usage, deployment of Content Distribution Networks (CDNs), and increasing mobile trafc all have implications on trafc measurement. Although measurements could be made extremely accurate, if one overlooks the costs involved, the observations themselves may be outdated very quickly. Therefore, a complicated model based on these measurements may be accurate for today, but fail in predicting trafc demands for the next few years or so, given uctuations in demand and unexpected changes in trafc patterns [113]. Furthermore, it is misleading to talk about a correct trafc matrix model. As pointed out in [7], just because a model replicates some set of properties of the observed data does not necessarily mean the model is correct. At the least, there is the dangerous possibility of over-tting. After all, a better t is always achievable simply by adding more parameters to the model. Several information criteria address the over-tting problem, for instance the Akaike information criterion (AIC) [6], Bayesian information criterion (BIC) [102], Minimum Message Length (MML) [120] or the Minimum Description Length (MDL) [95]. While these criteria are beyond the scope of our discussion, the basic principle is to choose the simplest explanation (measured in some information metric) amongst all competing explanations of the data. It is for these reasons models should be evaluated not just on their accuracy in making predictions of particular statistics, but also their simplicity, robustness and consistency in relation to the realities of network operations. Model assumptions should make sense to an operator as well as be empirically tested on various datasets to understand their reliability and pitfalls, which is not to say we cannot learn new ideas and principles from measurements. We just need to keep in mind their scope of application, and the usefulness of these principles in practice may be limited. It is often preferable to have simple, robust models, in preference to precise, but fragile ones. At a fundamental level we need to accept that models are all unrealistic in some way. A model describing the properties of a smaller scale network, such as a Local Area Network (LAN), may be unsuitable at the backbone network level. The underlying assumptions of one may not hold in the other. Models are simplications of the glorious complexity that comprises humanitys primary means of telecommunication. We must, instead, reread the adage, by George E. P. Box: All models are wrong, but some are useful. Some models have been more successfully used in real networks, and it is to these that we shall devote the most time here. However, we shall endeavour to cover the majority of simple models with the view that individuals should use the best model for their application without fear or favour. No doubt, the murkiness and apparent self-contradictions of this discussion have left our readers no wiser, as yet. Modelling is a topic that could be discussed endlessly. It is our aim that through consideration of the qualities of various trafc matrix models, we shall not only inform about these particular models, but also bring the reader to a new understanding of modelling to make these issues a little less opaque.
111
In this section, a formal approach to networks and trafc matrices is dened. Trafc originates from a source and is delivered to a destination (or several destinations). The trafc traverses a set of links between some set of sources and destinations. The links connecting these dene the topology of the network, and the paths chosen by trafc ows determine the routing. Trafc may be split across multiple paths by load balancing, or may keep to a single path. Often, sources and destinations are identied with network devices such as switches or routers, but they can also refer to a location in a logical space attached to the network, for instance IP addresses of prex blocks. Let denote the non-empty set of all sources and destinations in a network and let || = N . The sources and destinations often have a physical, geographic location, and so we regard their indices as spatial variables, even when they are actually logical entities, such as Autonomous Systems (ASes), or cannot be directly identied as network devices, as with IP addresses. A trafc matrix is naturally represented by a three dimensional, nonnegative hyper-matrix X(t), with i, j -th entry Xi,j (t). Each entry represents the trafc volume, or demand, measured in terms of bytes or packets, from source i to destination j in time interval [t, t + t) T , the full measurement interval being denoted by T . As an aside, a matrix representation is useful for the representation of other aspects of the network, for instance, delay, jitter, loss, bottleneck-bandwidth and distance [72], but throughout the chapter, trafc will be the focus. Whenever the context is clear, for example, when considering only the spatial structure of the matrix, the time index t is dropped. An abstract example of the trafc matrix is presented in Figure 1. A closely related concept is the demand matrix, distinct from the trafc matrix because the former is offered load, and the latter carried load [47]. They may be the same, but may differ where congestion limits the carried trafc, or rate limiting is used on some trafc streams. In general we cannot measure offered trafc, only carried trafc, and so almost all empirical research has concentrated on trafc matrices, but it is important to note that many of the assumptions of trafc matrix models are actually motivated by intuition about demand matrices, and these may not apply where the two differ substantially2 .
2 Many
works make no distinction between demand and trafc matrices, leading to confusion. Here, we shall try to keep the two
112
B
xAA xBA X= . . . xAB xBB . . . xAC xBC . . . .. .
A
Figure 1: An example of a trafc matrix. Note that often the diagonal elements, xAA , xBB , . . . are zero as this trafc does not cross the network, however, in almost as many cases it is non-zero because a node in the graph represents an aggregation of devices such as a PoP or an AS. In these cases we often do wish to measure these diagonals, even though they may not cross the logical links pictured, because they affect trafc engineering within the PoP, for example.
Large-scale, real-time monitoring of trafc is intractable at present, thus limiting measurements to the average trafc in a discrete time interval. Shorter time intervals i.e., small t, benet anomaly detection applications, the tradeoff being a possibility of uncertainty from trafc burstiness at shorter time scales combined with larger potential measurement or sampling errors3 . Longer time intervals result in smoother trafc, averaging out measurement errors. However, this smooths out real variability in the trafc as well, and can result in meaningless estimates in the presence of strong non-stationarity. Hence, the choice of t depends on the application and available measurements. Common choices range from 5 minutes to an hour. Further discussion on the temporal properties of trafc ows is found in 4. There are two popular denitions of trafc matrices: the origindestination (OD) matrix and the ingress egress (IE) matrix. (i) OD trafc matrix: this matrix measures trafc from true source to destination, i.e., the point that generates a packet to the point that receives it. In the Internet, it is perhaps most reasonably dened in terms of Internet Protocol (IP) addresses. However, if is dened over the entire IPv4 address space of 232 addresses (with even more for IPv6), this poses storage and computational problems. Moreover, the matrix would be very sparse. Whats more, protocols such as NAT, HTTP proxies and rewalls may obscure the true IP address mappings. One way to overcome some of these deciencies is to aggregate the trafc matrix into blocks of IP addresses, frequently using routing prexes. Bharti et al. [11] dened the idea of atomic aggregation to partially circumvent the problem of size of an OD matrix. If the logical indexing chosen is atomic, then a non-zero element of an OD trafc matrix implies that all ows between the particular source and destination pair is fully visible to the network operator (see [11] for details). (ii) IE trafc matrix: any single network operator sees only a small proportion of the Internet OD matrix. Thus this matrix is not just unknown, but unmeasurable (by a single operator). Instead, many operators nd that using their edge routers (or even the edge links) as sources and destinations results in a local trafc matrix of great use. We call this the IE trafc matrix as the set includes ingress, i.e., trafc going into the network, and egress, i.e., trafc owing out of the network, points as proxies for sources and destinations. A single ingress or egress node may denote a router, a collection of
distinct: when we say trafc matrix, we are referring to carried trafc. 3 Trafc is typically sampled at the backbone network level to cope with the tremendous volumes of data that could be collected.
113
physically co-located routers called a Point of Presence (PoP), or some other abstract collection of trafc ingress/egress points depending on the level of coarseness required in the modelling process. The PoP level convention is often adopted as it provides a simple visualisation of the network its operators. IE trafc matrices can be obtained in a number of ways. They can be formed from OD trafc matrices simply by mapping IP prexes to ingress/egress locations in the network, but this assumes knowledge of all ows traversing the ingress/egress nodes. Trafc at egress nodes may be inferred from router data (see the next section) and measurements of ingress trafc, but typically, the converse is difcult. Likewise, it is usually difcult to form an OD matrix from IE matrices. Consequently, the IE trafc matrix is frequently adopted for network optimisation applications as it is more practical to measure, and because in aggregating trafc the OD ows are bundled together into locally meaningful groupings. A network may carry ows between billions of IP address pairs and millions of prex pairs, but only thousands of router pairs, and hundreds of PoP pairs. In this way, the IE matrix is a more compact representation, but more importantly, the aggregation of the trafc into large bundles results in a smoothing effect on the data, reducing the number of independent parameters that may have to be estimated. At the PoP level, the aggregation of ows results in averaging out sampling error (similar to the choice of t above). This is highly benecial for numerical iterative algorithms used to estimate the trafc matrices, as aggregation leads to better conditioning of the trafc matrix. The trade-off, unfortunately, is the loss of ne grained data, as one can no longer observe IP-level ow data, or examine application proles. Another consideration worthy of concern in applying trafc matrices is invariance. A good representation of the trafc has to be invariant to other network aspects, such as routing and the network topology. For instance, if the trafc matrix for a network changes in response to changes in link placement, then the matrix is not terribly useful for network design. IE matrices are subject, for example, to large changes due to routing shifts, and this means that they are less useful to operators compared to OD matrices. However, the practicalities of measurements mean that IE matrices may be all that is measurable.
2.1 Example
As an example we present the Abilene (Internet2) data from 2004 that was used in several studies [7,96,124]. The backbone network is located in North America (see Figure 2a). The trafc data contains averages over 5 minute intervals, from March 1st to September 11th, 2004 (though there are some missing periods). The dataset is an Ingress-Egress, router-router4 trafc matrix. Table 1 shows one example trafc matrix from that period, along with row and column sums. We can immediately see that the matrix is not symmetric, and that there are quite a range of values. Also notable, the diagonals are not zero, as there is some measured trafc that enters Abilene, and then exits at the same PoP. A timeseries view of the data is seen in Figure 2. The top gure shows the total trafc during each time interval for the whole dataset, and the bottom gure shows the rst two weeks of the data, overlaid so as to illustrate the cyclic nature of the data. We can clearly pick out peak trafc hours, and weekdays from the weekend. However, the variation between peak and off-peak loads is perhaps smaller than is typical in commercial environments. We have endeavoured to put this data into an easily read format, and placed it onto a web server at http: //www.maths.adelaide.edu.au/matthew.roughan/traffic_matrices.html, where we will also place other TM datasets as far as possible, for comparison studies, or as test datasets for education.
4 In
114
115
1 0.07 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.07
2 0.07 4.09 4.70 1.93 4.76 2.87 0.67 4.18 8.61 0.18 3.47 18.20 53.74
3 0.43 6.42 25.48 10.25 0.25 23.73 4.79 2.58 12.34 0.04 3.28 16.04 105.61
4 0.00 0.06 4.11 1.68 0.01 1.55 1.92 5.80 5.71 1.71 0.54 0.83 23.94
5 0.06 7.07 13.99 5.63 24.06 13.53 3.50 26.35 18.21 1.69 8.60 34.03 156.73
Destination 6 7 0.12 0.06 4.42 1.59 11.53 3.31 6.11 2.59 0.04 0.01 4.78 2.89 2.24 1.25 0.17 0.16 11.05 3.84 0.00 0.06 0.13 0.93 11.18 5.64 51.76 22.34 8 0.00 0.02 87.27 0.01 0.02 0.01 0.00 1.41 0.41 5.61 3.92 0.09 98.77 9 0.05 3.24 5.22 4.11 1.24 9.45 0.93 10.88 36.36 0.96 1.77 25.57 99.77 10 0.00 0.03 0.01 2.60 0.02 0.08 0.02 2.11 0.02 1.82 0.81 0.08 7.59 11 0.00 0.16 0.08 0.04 0.03 0.50 0.03 3.64 0.52 8.44 0.61 0.80 14.84
12 0.25 11.09 7.70 5.92 18.05 7.64 3.31 16.67 17.31 0.36 2.32 47.02 137.65
Row sum 1.12 38.18 163.38 40.88 48.49 67.02 18.67 73.97 114.37 20.86 26.38 159.47 772.80
Table 1: An Abilene 5 minute trafc matrix from April 15th, 2004 from 16:2516:30, in Mbps.
Abilene 48oN 42oN 36oN 30oN 24oN 120oW 105oW 90oW 75oW 60oW Sunnyvale CA Seattle WA Chicago IL New York NY Denver CO Indianapolis IN Kansas City MO Washington DC Atlanta GA Houston TX
Los Angeles CA
(a) Map of Abilene PoPs. Table 1 is a router to router trafc matrix, but there is one router per PoP except in Atlanta, where there is a second router. The router numbers in the table are alphabetical.
4 3 2 1
Gbytes / second
0 01/03 15/03 29/03 12/04 26/04 10/05 24/05 07/06 21/06 05/07 19/07 02/08 16/08 30/08
(b) Abilene 5 minute totals of the trafc matrix from March 1st to September 11th, 2004.
1 0.8 0.6 0.4 0.2 0 Mon Tue Wed Thu Fri Sat Sun Mon
Gbytes / second
(c) The weeks starting March 1st (blue), and March 8th (red).
116
The data illustrates a number of properties of the data, perhaps most notably the fact that many datasets have anomalous spikes (these are often real, but may also be artefacts in the data), and periods of missing data. The question of how these arise then naturally leads to our next chapter on measuring trafc matrices.
Measurements
In theory, it is possible to collect accurate data of Internet trafc from a network. In reality, however, many issues confound such measurements. Budget constraints driven by the cost of collection, or the massive data storage facilities required due to the sheer amount of data traversing a backbone network limit what can be achieved. And even good measurement systems can suffer from errors and missing data. Added to this, current operator practice rarely includes any signicant calibration of measurement apparatus, so often the degree of accuracy of the measurements is unknown. There are several well-known strategies for collecting trafc measurements. A packet trace is a collection of packets headers (perhaps with some payload) and timestamps. A packet trace can be collected through various means, for example, through hardware such as a splitter placed strategically in optical bre; adding a monitor port on a router; or through software tools such as tcpdump executed on several hosts in a shared network. An OD trafc matrix can be constructed from such a trace by simple consideration of the IP address in the packet headers (with the caveats mentioned earlier). Such an approach is ideal in many respects: we have almost complete information, and the matrices may be drawn at almost any time resolution. However, collecting packet traces is expensive due to the need for dedicated hardware, and the huge amount of storage required: potentially, over a terabyte of data per hour on OC48 links (2.5 Gbps) is needed. It is rare for any but the smallest network to be able to completely instrument its network at this level of detail. Fortunately, constructing a trafc matrix does not require such detailed information. Perhaps the most common alternative is ow-level aggregation where packets are aggregated according to a common ow key. One popular denition of the key is the 5-tuple comprised of the IP source and destination address, TCP/UDP source and destination port numbers and protocol number. A series of packets possessing a common key is called a ow, and we maintain simple statistics (byte and packet counters, and start and stop times) for each ow. The aggregation of packets into ows reduces the number of records needed to be stored by removing redundancies of data from a packet trace. Flow-level collection is generally performed in 15 minute time bins5 and is often an in-built function of a router. The only additional infrastructure required is the Network Management Station (NMS) and ow records themselves are usually compressed by the router before being exported to the NMS. Despite this reduction, ows arrive at a router at rapid rates and the formation and storage of ow records at a router often burdens the routers CPU. To further reduce the number of ow records at a router, sampling methods are employed. The most popular sampling method is packet sampling, where incoming packets are sampled based on predetermined sampling patterns, used, for instance by Ciscos NetFlow [1]. Such pseudo-random patterns have a similar effect to independently picking incoming packets given sufcient mixing of trafc. The sampling rates can be adjusted depending on the capacity of the incoming links with recommendations such as 1 in 256 packets for OC192 links (10 Gbps). Higher capacity links require aggressive data reduction, and so lower sampling rates are used. Packet sampling reduces the number of ows signicantly by omitting many, but it is important to realise that although it may select packets in an unbiased way, it is not unbiased with respect to ows. Packet
5 The issue of timing of ows is actually somewhat more complicated, but readers should refer to detailed descriptions of specic ow-capture protocols for information on their particular ow capture.
117
sampling has a strong bias towards long ows, since it is more likely to have sampled packets from a long ow than a short one. Furthermore, the sampled ow size is not the true size of the ow and there are several works [37, 38] proposing methods to sample and estimate the true size of a sampled ow. While the strong bias may be a problem for some applications, there is usually no problem in using sampled ows to form the IE trafc matrix. In addition to packet sampling, we may also sample a set of ows, and these sampled ows can then be used to create trafc matrices. The resulting reduction in intermediate storage and processing can be substantial, particularly if the sampling is done in a clever way [37, 38]. It must be remembered that sampling is an inherently lossy process, regardless of the underlying sampling method used. The loss of information translates into errors or noise in the data. The size of these errors should, in best practice, be estimated for a given setup, but most operators do not undertake such procedures due to the difculties involved in obtaining ground-truth data with which to compare to the sampled data. In many cases it is simply assumed that these errors, once the data is aggregated further into trafc matrices, are negligible, but this assumption is rarely validated. Furthermore, there is the problem of possible multiple counting of trafc ows from the aggregated records of ows. The problem arises if the aggregated ow records come from internal backbone routers of the network, since a single ow may be recorded more than once by several routers if it traverses several links. One way to get around this relies on the placement of the measurement points. One simply performs trafc measurements at the ingress points. Specically, the measurements are performed on the backbone routers connected to the access routers, or on dedicated devices placed at the links connecting the access routers to the backbone routers. In this way, the total incoming trafc of the network may be reliably measured from the ow records. There is then no longer a problem of multiple counting of ows. Trajectory sampling [39, 40] is another method. It exploits the pseudorandomness of a hash function to simulate random sampling of packets. However, if a ow was sampled at one router, it will be consistently sampled at every other router the ow traverses by the deterministic operation of hash functions. Since the tracking of ows is now feasible, it is then also straightforward to disambiguate their identities and eliminate multiple counting. The downside is that a network operator must deploy trajectory sampling throughout the whole network, and congure the hash function on each router each time before a measurement interval begins. A less costly alternative is easily obtainable link counts. A link count, or link load measurement, gives the volume (in bytes or packets) of trafc on a link during a particular time interval. Link counts are obtainable from measurement data by the Simple Network Management Protocol (SNMP) [25], dened as part of the IETF standard and present on many Internet devices, including most routers. SNMP data from a single router provides two measurements for each interface, the incoming and outgoing byte counts. SNMP data is obtained by an NMS by periodically polling requests through an interface, typically UDP port 161. The polling period varies from 1 minute to several minutes, but the default seems to be 5 minutes. SNMP data is highly susceptible to error, due to the following factors: (i) missing link observations: data is transmitted via unreliable UDP packets and there may be errors when copying the data into the observers archive, (ii) incorrect link observations: poor router vendor implementations causes inaccuracies, and (iii) sampling coarseness: polling times are often inaccurate either due to poor NMS or SNMP agent implementations, high loads, or network delays. As with ow-level data, link count data should be calibrated, but rarely is. There is only one experiment of which we are aware that does so [97]. The study showed that in one network, SNMP errors were typically
118
low (90% of measurements had an error of less than 1%), but a small number of measurements were very large, some as large as 100%. This type of heavy-tailed distribution causes problems for some estimation approaches and should be considered in context. Another drawback of SNMP data is that it only provides aggregate link statistics, omitting details such as types of trafc on the link and the trafc source and destination. Despite all these issues, SNMP data is, at present, the easiest and perhaps most common way to obtain large-scale trafc data. The observed link counts provide some information about the trafc matrix, but only in an indirect manner. Thus, the trafc matrix has to be inferred through a process called tomography. Network tomography was rst introduced by Vardi [117], with the inspiration taken from inference techniques for medical tomography as both problems are similar in nature. Vardis work was subsequently expanded upon by Tebaldi and West [111] and Cao et al. [24]. Various other works in network tomography measure other properties of the network, such as link delays (see [26, 30] for an overview) via active packet probing, but for trafc matrix estimation, we are only concerned with the link count observations from SNMP data. Mapping trafc to links requires topological and routing data in the form of a routing matrix. The routing matrix A is dened by Ai,r = Fi,r , if trafc for r traverses link i, 0, otherwise.
where Fi,r (0, 1] is the fraction of trafc from source-destination pair r = (s, d) traversing link i. Fractional values occur in cases when load balancing is used, but it has often been assumed that Fi,r = 1, resulting in Ai,r {0, 1}. The size of the routing matrix depends on the network: with N nodes and L links, the routing matrix has size L N (N 1) (trafc from a node is assumed not rerouted to itself). Information on the routing matrix can be obtained from several different sources (router conguration les, traceroutes, or from the routing protocols themselves), but the collection of such information is not the topic of the chapter. The interested reader is referred to [46, 47]. A common assumption is that the routing matrix remains stable during the measurement interval, thus the temporal dependence is dropped, i.e., A(t) = A for all t T . However, changes in routing may occur if there are link or router failures, necessitating trafc reroutes. Generally, it is assumed the measurement interval is chosen over a period of time (minutes to hours) when A is stable enough to be considered static, justied by observations in [85], however in at least one case it is proposed that the changes be created, and exploited [107] for trafc matrix inference. All the link counts may be grouped into an L 1 vector y. Then, based on link observations in one measurement interval, the SNMP link counts may be expressed as y = Ax, (1) where x is the N (N 1) 1 vectorised version of the trafc matrix X, with its columns stacked upon one another. Figure 3 presents an example of (1). It shows how the trafc on a single link y1 , is built from the sum of trafc routed across the link x1 + x2 . We can see that by stacking each of these equations we would get y1 1 1 0 x1 y2 = 1 0 1 x2 , (2) y3 0 1 1 x3 which, written in matrix notation, is just (1). Note that in this case the routing matrix A is invertible, so the problem of inferring the trafc matrix from link measurements is easy, but this is rarely the case. Usually, L is usually much smaller than N (N 1), and so the problem is highly underconstrained.
119
1
link 1 link 2
1 2
route 2 link 3
2
route 3
3
(a) Link labels
3
(b) Trafc labels and routes
Figure 3: A simplied network and trafc (the main simplications are that we only consider a single router, and only consider unidirectional trafc, not bidirectional as in real networks).
There are two main assumptions implicit in this observation model. It assumes the trafc matrix is stationary, i.e., its statistical properties remain stable throughout the measurement interval and there are no errors in the measurement. Stationarity is preserved by choosing an appropriate measurement interval, say 1 hour (see 4). Moreover, in reality, errors do occur and to account for it, the model (1) is extended to y = Ax + z. (3) The second model is a simple noise additive model often used to test the robustness of an inference technique. Each element of the additive noise term z is typically chosen to be independent and identically distributed (i.i.d.) white Gaussian noise with mean zero and variance 2 . Often 2 is kept small, as large values would result in some elements of y violating the non-negativity constraint. It is for this reason other distributions may be used, for example, log-normal or gamma distributions. Additionally, due to the problem of missing link information due to poor SNMP implementation, some of the elements of y may be missing. Finally, if the given routing matrix A is incorrect, the observations y would signicantly depart from the true SNMP link counts. However, most works assume an accurate A because there are reliable methods for obtaining routing information [104]. There are usually many less links than the total number of IE pairs, and so the inverse problem above is highly underconstrained. Whether noise is present or not there may be an innite number of solutions x that t the observations (1). Figure 4 shows a picture of such a network, where we only measure at the bottleneck. Now, even in this very simple network the equations y1 = x1 + x2 are underconstrained. In order to make progress, some additional information needs to be assumed, usually in the form of a trafc matrix model, and we shall consider some of these in the following section. Before moving on to the modelling aspects of trafc matrices, note that there are other strategies for measurement. For instance, if MPLS is being used, this creates a set of tables (in many implementations) that can almost be read directly to obtain the matrix (e.g., see [12]). Alternatively, the network operator may have more accurate local trafc matrices, obtained through specic functionality at the routers. It is, in principle, easy for a router to keep counts of its decisions [118], essentially amounting to a table of the volumes of trafc between pairs of interfaces. Locality here is dened in the sense of the matrixs restriction to a single router we essentially see an IE trafc matrix of the single routers interface. These local matrices from all routers in the network can be used to improve the estimation of the IE trafc matrix; see 5. On some special cases, such as a star network, a single local matrix would be equivalent to the trafc matrix,
120
1
route 1
y1= x1+ x2
route 2
Figure 4: A harder inference problem where we only have one measurement, but two trafc elements to estimate. There is therefore a fundamental ambiguity in this problem.
serving to highlight the information gain from local trafc matrices. These matrices provide greater than a two-fold increase in accuracy of the tomography estimation schemes by [73, 125].
Models
In this section, several canonical as well as recently proposed models are presented. Modelling is divided into three categories: purely temporal modelling, spatial modelling and spatio-temporal modelling.
121
Second, most network trafc is human generated. Therefore, it stands to reason that trafc is inuenced by human activity in a 24 hour cycle. In fact, distinct diurnal patterns have been observed, with peak trafc occurring around mid-day or in the evening and dips during the night, for example see Figure 2c. This correlates to the daily schedule of an average human being, where mid-day trafc is generated for work or school purposes, while the lack of trafc during the night correlates to sleeping periods. Peak trafc rate is also noticeably less on the weekends. The regularity of this behaviour can be quite strong as shown in the gure where two successive weeks data are overlaid so one can see how closely they match. Most trafc measurements see some measure of cyclic behaviour, with the degree of regularity determined by the type of trafc (is it made up of many individuals or a few large ows) and the scale. Third, and leading from the last point, the trafc volume itself is dependent on the measurement period and the aggregation level of trafc. At very short scales, in milli-seconds or seconds, the trafc distribution is highly variable, and shows strong dependencies, making use of such measures statistically non-trivial, even if such measurements were easy to collect. A common paradigm is to consider a measurement interval of minutes to an hour, where measurement is easy. Also important is the fact that at these times scales stationarity6 of the trafc volume distribution is often a reasonable assumption, though we can see the limits of this in Figure 2c (stationarity clearly doesnt hold for more than a couple of hours), however it was shown that cyclo-stationarity holds to a large extent [106]. Fourth, there are natural variations in trafc over time, and these are often modelled as a random (or stochastic) process. This random process could have all sorts of features, but there are some basics that
6 Stationarity refers to the concept that the statistics of the trafc (for instance, the mean and variance, but in general including all statistics) are constant with respect to the time at which they are measured. Wide-sense stationarity is when only the mean and variance of the trafc satises the stationarity property. In Internet trafc data it is only ever approximately true. Moreover, it is hard to test for stationarity when trafc has long-term correlations, and so we can only ever talk about the degree of stationarity.
10
traffic (PB/quarter)
10
10
10
10 2000
2002
2004
2006
2008
2010
2012
Figure 5: Australian Internet trafc volumes based on data from the Australian Bureau of Statistics from 2000-2012. The dashed line shows a linear t to the (log) data. Note the log y -axis, so the plot shows quite a reasonable t to exponential growth with a doubling period of 465 days. Over the same period the growth in broadband subscribers has been almost exactly linear (soon this trend must decrease as a large proportion of Australias population are now connected), so most of the growth has come from growth in the amount downloaded per customer.
122
should be observed by all reasonable models. For instance, network operators aggregate trafc from multiple sources, which is known as multiplexing. Multiplexing is used to boost the efciency of the links in a network by smoothing out variations in trafc. The apparent smoothness is a result of decreases in the relative variance, as predicted by the central limit theorem [23]. The more OD ows multiplexed on a link, the higher the efciency and smoothing effect, provided the aggregated bandwidth does not exceed link capacity. Thus, any model for the large trafc rates in a network must be consistent under multiplexing, for example, when the number of ows being multiplexed is increased the relative variance should decrease in a predictable manner. Furthermore, the statistical properties of the aggregated trafc must also be consistent with the statistical assumptions of the trafc from a single user. Finally, although rare, sometimes there may be sudden spikes in trafc. Such a component may arise from unusual trafc behaviour, such as DDoS attacks, worm propagation or Border Gateway Protocol (BGP) routing instability from misconguration. Flash crowds are also an example of this behaviour, which happens when there is a signicant jump in the number of clients to a particular web server or CDN. Extreme unforeseen events, such as the September 11 attacks on the World Trade Centre in 2001 may instead cause a signicant drop of trafc rates. In any case, a massive shift in trafc rates would be of interest to a network operator. One temporal model of OD ows traversing backbone routers was proposed by Roughan et al. [99] by generalising the Norros model [78], originally used for modelling LAN trafc. Each OD ow is assumed to be generated from an independent source. The model is characterised by the following components (at time t): (i) L(t), the long term trafc trend, (ii) S (t), the seasonal (cyclical) component, (iii) W (t), random uctuations, and (iv) I (t), the anomaly component. These components correspond to the observations of trafc described earlier. The long term trend, L(t), depends on the observed underlying trafc growth in the data. This factor captures the overall growth of trafc over a long time period. An exponential growth model for instance, could be found by tting L(t) = A exp(ct) to the data, with the parameters A and c easily estimated via log-linear regression as in Figure 5. The component S (t) may be any periodic function, such that S (t + kTs ) = S (t) for all integers k , where the period Ts would typically be 24 hours, or one week. The component W (t) is assumed to be a stochastic process with zero mean and unit variance, to model the small uctuating component of observed trafc. Finally, I (t) captures the large variability of trafc from anomalies, say, a massive upsurge or downsurge in trafc. These events were captured in an individual component to separate their inuence from trafc under the networks normal operating conditions. Let x(t) denote the volume of an OD ow at time t. The model takes the following form, x(t) = m(t) + am(t)W (t) + I (t), (4)
where m(t) = S (t) L(t) is the mean of the OD ow, assumed to be a product of the seasonal and long term trends, and a is the peakedness of the trafc. The average is modelled in such a way because as large OD ows have a larger range of variation in the size of their cycles. The parameter a controls the smoothness of the OD ows volume in a way that is consistent given multiplexing of aggregated ows.
123
Figure 6 shows the decomposition in action on the set of Abilene data shown in Figure 2c, extending (t) is not shown as it the analysis beyond into the section of missing data. The estimated long-term trend L is almost constant over this period, but we show the estimated mean m (t) = S (t) L(t) (derived using a 5 term seasonal moving average, further smoothed with a 13 term moving average), shown in green. Normal bounds on this, obtained by modelling the random uctuations W (t) as a Gaussian process are shown as dashed lines, and where the trafc falls outside these bounds, we have indicated an anomaly. We have deliberately looked at this period, which is followed by period of missing data to illustrate one of the advantages of this approach, which is the ability to ll in missing gaps in the data, though more sophisticated approaches to solve that problem also exist, e.g., see [128].
1 0.8 0.6 0.4 0.2 0 01/03 08/03 15/03 22/03
Figure 6: Results of decomposing trafc into components. The blue curve shows the original Abilene trafc from (t) L (t), the estimated Figure 2c (note that the data is missing for the third week). The solid green line shows m (t ) = S mean, and the dashed green lines show the bounds of normal variation around this mean. The red asterisks indicate the anomalies I (t).
Another feature of this model is the preservation of the properties of the model through a linear combination, advantageous when looking at the aggregated behaviour of the OD ows. Consider K aggregated OD ows, then
K K K
Gbytes / second
xagg (t) =
i=1
mi (t) +
i=1 K
ai mi (t)Wi (t) +
i=1
Ii (t).
The mean of xagg (t) is simply magg (t) = i=1 mi (t), and the peakedness is the weighted average of the K 1 component peakedness, aagg = magg i=1 ai mi (t). The linearity properties allow xagg (t) to be expressed (t) in the same form as (4) with the new parameters magg (t) and aagg . The linearity property enables the consistent computation of the variances of the aggregated trafc, which is useful for network planning and analysis. Besides this, [99] demonstrated the ease of estimating the parameters of model (4), via simple estimators and ltering. The cyclical nature of the aggregated OD ows is also amenable to Fourier analysis. The Fourier transform decomposes a periodic signal into a weighted sum of sinusoids with distinct frequencies and phases. It would be reasonable to assume the observed cycles can be represented by a small number of Fourier coefcients (this does not require that the signal be exactly sinusoidal as any periodic signal can be approximated by a limited number of sinusoids). Indeed, it has been demonstrated this is the case with the trafc volumes [41, 106], where as little as just 5 Fourier coefcients were needed to achieve low error in tting large
124
OD ows of a Tier 1 network, demonstrating the relatively few signicant frequencies present in a diurnal cycle of the OD ows. Figure 7 shows a similar analysis of Abilene data from Figure 2c. It shows a simple example of the excellent degree of approximation to trafc we can obtain using only a very small number of Fourier coefcients corresponding to daily periods. Figure 7a shows the periodogram of the data (the absolute magnitude of the Discrete Fourier Transform (DFT)). We can see that only a few peaks (with frequencies of 1, 2, 7, and 14 cycles per week, corresponding to daily and weekly cycles and their harmonics) are large enough to matter for gross features. Figure 7b shows the cumulative power contained in the largest of the Fourier coefcients, clearly showing that a small number of these represent the power of the signal well. Figure 7c shows an approximation of the original signal using only 10 coefcients. We can see that the cyclic components of the data are represented well, though the actual signal varies around that noisily. Including additional components provides a better approximation (in the sense of tting the data more accurately), but we can see that we are simply recreating the noise. There is a clear tradeoff here between noise and signal, with no absolute standard for the correct choice, but the underlying promise of Fourier analysis is obvious. The choice of time interval to use in Fourier analysis/approximation is interesting. A longer interval provides more data, and hence better estimates if the data is truly cyclostationary7 . However, as we discussed earlier, there are noticeable trends in the overall volume, and so it is reasonable to assume that there will sometimes be signicant shifts in the pattern (with respect to time of day or time of week). In these cases, extending the length of the dataset can confuse the statistical variability with non-stationary effects, resulting in biased estimates. The best tradeoff appears to depend on the dataset, but periods of perhaps a month seem to work reasonably well for estimating weekly cycles. However, there are alternative tools for such analysis, designed to provide some exibility in the tradeoff between time and frequency. The simplest is perhaps the Short-Time Fourier Transform, in which the signal is windowed, and the standard DFT is applied to the windows of data to produce a spectrogram. The technique has been applied in Figure 8 to a longer (6 weeks) sequence of the Abilene data. This segment of the data is more challenging, as it contains many anomalies, e.g., the frequent spikes in the trafc, or the drop in trafc on the Independence Day holiday, observed on July 5th in 20048 . Figure 8a shows the set of data under consideration, along with its approximation using 10 coefcients. Figure 8b shows a spectrogram with windows of length 1 week, and here we can still clearly see the weekly and daily cycles appearing in most of the windows. Figure 8c shows a spectrogram using 1 day windows, with overlapping windows to smooth the results. The resulting picture is poor at showing the cyclical behaviour of the trafc, but clearly highlights the large anomalous spikes in the trafc in second week of the data. An even more powerful set of techniques that have been applied to the temporal analysis of trafc are the Wavelet transforms [10, 83, 119], which provide a more exible set of time/frequency tradeoffs. Wavelets have also been applied to spatial analysis of trafc matrices [31, 34, 93, 121, 123], which we will consider in a moment. However, wavelets are a relatively complicated set of techniques, and it is outside the scope of this chapter to provide an introduction to that material. See for instance [69] for more information. Principal components analysis (PCA) has also been employed to quantify the temporal correlations of the trafc matrix. If X is a matrix where the rows represent a measurement (for instance an OD ow) and columns represented trafc volumes at time t, then temporal PCA decomposes the matrix X T X into its corresponding components of eigenvalues and eigenvectors. Often, each column is centred, simply by , the average of all columns, from each column in X . In what follows, X is subtracting the mean vector x
7 A cyclostationary process can be thought of as one whose component processes formed from times embedded at multiples of the fundamental period from stationary sequences. 8 Independence Day is actually celebrated on July 4th, but in 2004, the day falls on a Sunday. Thus, the following day was declared a public holiday as well.
125
7 6
x 10
0.99
5
cumulative power
power
4 3 2
0.98
0.97
0.96
1 0
0.95 0 10
28
10
10 number of coefficients
10
10
Gbytes / second
08Mar2004
15Mar2004
Gbytes / second
Figure 7: Fourier analysis of Abilene data shown in Figure 2c. Blue curves represent the Fourier approximation of the trafc, indicated by the red curves.
126
1 0.8 0.6 0.4 0.2 0 07Jun2004 14Jun2004 21Jun2004 28Jun2004 05Jul2004 12Jul2004
Gbytes / second
28 21 14 7 1
1 2 3 4 5 6 (b) Spectrogram with weekly windows. Brighter colours indicate more power. Again we see strong power at 1 and 7 cycles per week, though the strength of these varies per week. For instance, in week 5 (when the Independence Day holiday was held) there was a week day, whose trafc more closely resembled weekend trafc, breaking the weekly cycle, and pushing more power into the daily cycle.
28 21 14 7 1
10 20 30 40 50 60 (c) Spectrogram with daily windows. Note that the time-resolution in this case is actually poorer than the image would suggest, accounting for the poor resolution of the daily and weekly cycles in this gure. However, the large anomalous spikes of trafc in the second week stand out clearly in this view, as they spread power across a range of frequencies.
127
assumed to be centred. The matrix X T X is positive semidenite. Visualising this geometrically, if the columns of the matrix is reinterpreted as a set of points, then they trace out an ellipsoid. Alternatively, X T X may be viewed as the empirical covariance matrix of the columns of X , in effect computing temporal correlations in trafc. PCA is used to nd the directions of greatest variance of X T X by decomposing X T X = WDWT , where W is an orthonormal matrix containing the eigenvectors of X T X and D the diagonal matrix containing the eigenvalues of X T X . The eigenvectors are known collectively as the principal axes. The eigenvectors are ordered in a non-decreasing order with respect to their associated eigenvalues, starting from the eigenvector associated with the largest eigenvalue to the smallest. An equivalent view is that PCA essentially performs a Singular Value Decomposition (SVD) of the matrix X , by computing only its right singular basis W. T Thus, every row of X can be expressed as xk = aT k W , i.e., a linear combination of a coefcient vector ak , called the principal components. Here, W is equivalent to a linear transform, post-multiplied to the data. Intuitively, if the size of the set of principal axes with large principal components are small, then this is evidence there are high temporal correlations between the trafc ows. In practice, it is common to focus on the few largest principal axes for the purpose of data reduction. Basically, this means choosing the rst k such that xk x k 2 < , for some small few signicant columns of W to approximate each xk with x error > 0. As an aside, PCA may be performed on X X T , in effect computing the spatial correlations of X instead. T , with each column x k = Va k , equivalent to V pre-multiplied with the Here, we have X X T = VDV data. Spatial PCA was used in the context of anomaly detection [6365] but there are problems with this approach. These discussions are deferred to 5.
1.5 1 Dimension 2 0.5 0 0.5 1 1.5 3 2 1 0 1 Dimension 1 2 3 PC 1 PC 2
Figure 9: Principal components analysis of the empirical covariance matrix of a two dimensional data matrix of 1000 centred points X . Here, PC 1 and PC 2 are the principal components. Note the elliptical shape of the contours of the density with semi-major and semi-minor axis given by the rst and second principal components, respectively.
Figure 9 demonstrates an example of PCA performed on the covariance matrix of 1000 two dimensional data points with zero mean, i.e., X has 2 rows and 1000 columns. The matrix X T X formed by the data points vaguely resembles an ellipse. Here, there are two principal components, denoted by PC 1 and PC 2, with the higher variance captured by PC 1. This is clear from the way the points on the gure are
128
distributed. The key point to take away is that both components capture the direction of highest variance and are orthogonal to each other. Moreover, the principal components matrix W has PC 1 and PC 2 as its rst and second columns respectively. Each data point can be expressed as linear combination of these two components. The concept is easily extended beyond two dimensions to the larger dimensions typically encountered with trafc matrices. PCA was performed by Lakhina et al. on empirical data from two backbone networks show that OD ows are a combination of no more than 35 eigenows (the principal axes), and in fact, fewer than this in general [66], out of over 100 OD ows. These eigenows belonged to one of three categories, depending on their properties: (i) deterministic or d-eigenow: generally the signicant diurnal component of the largest OD ows. Although present in smaller OD ows, these eigenows are less signicant. These eigenows have a cyclo-stationary property and suggests that these eigenows may be approximated by a small number of Fourier coefcients. These eigenows account for the majority of the total trafc of the OD ow. (ii) spike or s-eigenow: medium sized eigenows with a spikiness behaviour in time, with values ranging up to 5 standard deviations from the mean of the OD ow. This suggests these contributions come from bursty processes and may be modelled by a wideband Fourier process. (iii) noise or n-eigenow: small eigenows behaving like stationary additive white Gaussian noise. These eigenows have small energy and their contribution to overall trafc is negligible. The majority of eigenows from Lakhina et al.s datasets belong to this category. There are several eigenows belonging to two or more categories, but these eigenows are rare [66]. For the most part, these categories are very distinct for almost all eigenows. The low number of eigenows compared to the dimension of the trafc matrices under study suggests low intrinsic dimensionality of trafc matrices, although the upper bound of 35 eigenows indicates that the OD trafc is only approximately low rank. The results show a power law-type distribution of the principal components [66]. The decay of the distribution varies depending on the ISP, with some distributions exhibiting a very fast decay and some much slower decay. In many senses PCA conrms the previous analysis and modelling, but it is interesting because its approach simply looks for correlations across different sets of measurements, and uses a different set of assumptions from, for instance, Fourier analysis which can be performed on a single time series. Finally, it is important to note that the full data needs to be available (no missing entries in X ) in order to perform PCA. Furthermore, the basic avour of PCA as described above is not a robust method, since it is an entirely data driven method and is therefore sensitive to outliers [94]. Robust variants have been proposed but they necessarily complicate the basic version of PCA presented here, since these modications entail constructing methods to identify and exclude outliers. Despite these disadvantages, in its purest form, PCA is a useful tool to learn the temporal structure of trafc ows.
4.2.1
We must remember that the purpose of models is not always to realistically represent a networks trafc. Their purpose is to provide inputs to other tasks. One common task is to assess the sensitivity of a network to different types of trafc, and to that end, engineers can consider the effect of various articial test models. Three such are the uniform trafc model, peak load model, and focussed overload model. They are extremely simple: uniform this simple model assigns the same value to all trafc matrix elements. It is used to provide a base load in some experiments, or to see the behaviour of a network under one extreme (the most uniform extreme) of trafc [18, Chapter 4.5.1]. peak load this model is equally simple, and equally extreme. It has zero for all loads except one OD ow. It simulates the opposite extreme where the aim is to see the effect of one dominant ow. focussed overload this type of trafc matrix simulates the effect of a focussed overload, or ash crowd9 , where many users become interested in one location or resource and the trafc to this single location from all other sources is the dominant effect in the network. As a result, the focussed overload can be represented by a matrix with all elements zero, except for one row. We can likewise represent a focussed trafc load arising from a single point (say as response trafc to a focussed set of queries) by a matrix with a single non-zero column. The advantage of each of the models lies in its simplicity. The simplicity means that the effect of the trafc is easy to interpret, and thus gain insights from these models where a more complex model would perhaps confound us with multiple potential causes for some results. For instance, in each of the above models we can gradually increase the trafc to see when capacity bounds are reached, and where those bounds would be reached in order to identify potential bottlenecks in a network. Other test models based on classical distributions such as the Poisson and Gaussian distributions were proposed by Vardi [117], and Tebaldi and West [111] and Cao et al. [24], respectively. Their well-known properties make it easy to analyse results and provide insights, at the cost of a departure from real trafc properties. 4.2.2 Gravity model
The gravity model is perhaps the next simplest type of model, but it has a great deal to offer. Here, trafc from the source to the destination are modelled as a random process. In its simplest form it assumes, any packet originating a source to a destination nodes are independent of other packets. Depending on context, this could be the origin and destination, or ingress and egress nodes respectively. Consequently, the trafc between two nodes is proportional to the total trafc from the source node to the destination node. The gravity model is amongst one of the most well-studied models and is considered a canonical rst generation model. The name of the model derives from Newtons model of gravitation, where the gravitational force is proportional to the product of the mass of two objects divided by the distance between them squared. The general formulation of the gravity model is dened by two forces: the repulsive force (factor) Ri , associated with leaving from i and the attractive force (factor) Aj , associated with going into j . Its general form is described by the following equation: R i Aj , (5) Xi,j = fi,j
9 See
130
where fi,j represents the friction factor, which describes the weakening of the forces (akin to distance in Newtons model), depending on the physical structure of the modelled phenomenon. The model has been used extensively in various elds, for instance the modelling of street trafc [86]. In the context of Internet trafc matrix modelling, the friction factors have typically been taken to be constant. That is, distance is assumed to have little effect on network trafc. That certainly seemed to be true even at a fairly large scale in the past, but it is unknown to what extent the deployment of CDNs (Content Distribution Networks) over the last few years has changed distance dependence, since CDNs locate trafc closer to the end user to avoid paying for transit costs across the network, or how inter-country matrices are affected by distance (for instance through language barriers). Where distance is ignored, equation (5) becomes out in Xj Xi , (6) Xi,j = X total
out in is the total trafc entering the network through i, Xj is the total trafc exiting the network where Xi total through j and X is the total trafc across the network [125]. The model can be expressed succinctly as the single rank matrix T xin xout X= . (7) X total in out The popularity of the model stems from the ease of estimating the Xi and Xj for each node pair (i, j ), and especially at the PoP or backbone level, since the level of trafc aggregation mitigates errors in the estimation of these quantities from sampled trafc. The gravity model only captures the spatial structure of the trafc. The key assumption of the gravity model is the independence between each source i and destination j . Coupled with the assumption that none of the nodes act as a source or sink of trafc (i.e., that trafc is conserved in the network) X total = in out kI Xk = E X . Under normal operating conditions in most backbone routers, where congestion is kept to a minimum, the conservation assumption appears reasonable. With this assumption, out Xi,j = X total pin i pj ,
(8)
out Xj out , E X
where pin i =
are the proportions of trafc entering the ingress and exiting the egress nodes respectively, called fanouts. The formulation (8) is known as the fanout formulation because it describes how a packet entering via node i is distributed to several nodes j E . Fanout has been demonstrated to be close to a constant over several measurement intervals, compared to the trafc matrix [73], suggesting the fanout may be a better alternative to measure and use in, for instance, anomaly detection, than the raw trafc volumes. out Observe the implication of independence between the source and destination in (8): Pr(I , E ) = pin I pE . An immediate consequence is Pr(E | I ) = Pd (E ), where Pd (E ) is the marginal distribution of the trafc demand distribution at the destinations. The assumption of independence between the source and destination leads to two important properties of the gravity model making it well suited to trafc matrix modelling. Theorem 1 (Independence). Independence between the source and destination trafc holds for any randomly chosen submatrix of the model.
out Proof. The independence property implies Pr(s, d) = pin s pd , holding for every s I and d E . This condition would also hold for a subsample of locations in I and E .
in Xi , in kI Xk
and
pout j =
131
Theorem 2 (Aggregation). An aggregate of the gravity model is itself also a gravity model. Proof. Let all nodes be partitioned into N subsets {S1 , S2 , , SN }, with Si Sj = for i = j and N i=1 Si = . The aggregated trafc matrix is dened as XSi ,Sj = The independence condition implies Xi,j = Substituting (10) into (9), XSi ,Sj = = which is also a gravity model. These are not just theoretical results. Any model should be consistent in the sense that if the data to which it applies is viewed in a different way (for instance by sampling or aggregation) then the model should still apply (though its parameter values may change). It seems like an obvious requirement, and yet there are many models to which it does not apply. The utility of the gravity model is not just restricted to network measurement. It is used in various areas: teletrafc modelling [61, 67], economy and trade [87, 114], epidemiology [49, 76, 122], sociology [109], the retail industry, specically Reillys law of retail gravitation [32, 59, 92], and in vehicular trafc modelling [42]. More advanced discussion on the gravity model (albeit with an economics avour) is found in [103]. The gravity model can be interpreted in terms of the principle of maximum entropy. Entropy here is the Shannon entropy from information theory parlance [33]. The principle is closely related to Occams Razor, essentially choosing the parsimonious explanation of the data amongst competing explanations. With little information regarding the trafc matrix besides the total trafc information, it turns out that the best one can do, according to the principle, is to describe the observations with a model promoting independence and symmetry, consistent with known constraints. In this way, the model enjoys robustness compared to other models, as the gravity model seeks to minimise deviation from what has already been observed. The model, however, is not without its drawbacks. The main critique against the gravity model is in its main assumption: the independence of the ingress and egress nodes10 . It has been pointed out in several papers [43] that this assumption does not hold true. Most trafc between node pairs are determined by connections, for example TCP initiated sessions, so there exist dependencies between node pairs. The second is the violation of the conservation of trafc assumption, for e.g., when there is high congestion, causing packets to be dropped from router queues. Actual trafc matrices are generally asymmetric, violating the main assumption of gravity models. For example, forward trafc volumes of a source-destination pair of nodes do not typically match up with the volume of reverse trafc. Even if the OD trafc matrix matches the gravity model well, the corresponding IE trafc matrix may be vastly different, due to hot potato routing [113].
10 The
Xi,j .
iSi j Sj
(9)
(10)
iSi j Sj
Xi,
iSi j Sj
X,j
132
AS X
AS X
AS Y
AS Y
Perth
Sydney
Perth
Sydney
Figure 10: Trafc ow between two ASes, one in Perth and the other in Sydney. Note the asymmetry in trafc: due to the action of hot potato routing, the path taken by a trafc ow from Perth to Sydney differs from the reverse path, since by BGPs implementation, the closest external link of an AS is always chosen to route trafc out from the AS.
Hot potato routing is implemented as a part of the BGP decision process [90]. BGP allows network operators to choose the egress points of trafc at the prex level. The decision may also vary across a network so that trafc at different points can end up being routed to different egress points. The idea of hot potato routing comes from its namesake: trafc is the hot potato in this case and the network tries to get rid of the hot potato as quickly as possible to avoid costs of transiting it over long distances. Therefore, trafc is sent on the shortest external route connecting an ingress to egress point. BGP provides less control over ingress points and this is what leads to the fundamental asymmetry in the IE trafc matrix. An example of hot potato routing is in Figure 10. Here, there is a clear asymmetry since the paths taken by trafc ows from Perth to Sydney differ from Sydney to Perth. To further understand hot potato routing, we refer the reader to [17] for a basic understanding of BGP routing policy. Thus, although the source-destination independence assumption may hold for OD trafc matrices, it may not necessarily hold for IE trafc matrices, due to distortion by inter-domain routing. Consider a simple toy example of a network in Figure 11 (originally from [7]). The ASes A, B and C are assumed to be connected, with A having three routers: 1, 2 and 3. The inter-domain routing protocol between these ASes uses hot potato routing, seeking the shortest path between these ASes. Suppose X total = 9. Consider an OD trafc matrix with the form of a gravity model, with even spread of trafc over each internal router 1, 2 and 3, with xin = xout = x. The OD trafc matrix has the form XOD = xxT /X total , with x = (1, 1, 1, 3, 3)T , and written explicitly as 1 2 3 B C 1 1/9 1/9 1/9 1/3 1/3 2 3 B 1/9 1/9 1/3 1/9 1/9 1/3 1/9 1/9 1/3 1/3 1/3 1 1/3 1/3 1 C 1/3 1/3 1/3 1 1
XOD =
By Theorem 2, the gravity model for the aggregated OD matrix, comprising OD trafc volumes between ASes A, B and C, is given by A B C A 1 1 1 XOD = , (12) B 1 1 1 C 1 1 1
133
(11)
simply by summing the trafc in the internal nodes. In this case, XOD = xxT /X total , with x = (3, 3, 3)T , still a gravity model. In order to construct the IE trafc matrix, the ingress and egress points of the network in A needs to be determined. The following assumptions are made: (i) A, B and C are peers, (ii) the shortest AS path protocol is used for inter-domain routing, (iii) hot potato routing is used internally by A, and (iv) the Interior Gateway Protocol (IGP) weights are all equal. Suppose ingress and egress points are dened by the following routing table ( represents a wildcard character) Origin router 1 1 2 3 Destination B C * * Egress router 2 3 2 3
The path for each trafc ow in the network, therefore, differs depending on its source and destination. All trafc ows between the PoPs may be decomposed into four components: internal trafc within A, trafc departing A, trafc coming into A and trafc external to A, shown in Figure 12. The internal trafc of A (Figure 12a) is just the top-left 3 3 submatrix of XOD , which is Xinternal = 1 1/9 1 2 1/9 1/9 3 2 1/9 1/9 1/9 3 1/9 1/9 1/9 (13)
Trafc bound for A, as seen in Figure 12b to be specically for router 1 in this instance, from its peers has entry points controlled by B and C, given the above routing table. Hence, from As point of view, the trafc behaves as if the trafc randomly distributed across ingress links. Assuming the trafc is evenly spread, the trafc matrix is 2 3 1 0 0 0 1 (14) Xarriving = 2 1/3 1/3 1/3 1/3 1/3 1/3 3 Trafc departing from A, seen in Figure 12c as originating from router 1, and routed by hot potato routing, is described by 2 3 1 1 0 1/3 1/3 Xdeparting = (15) 2 0 2/3 0 3 0 0 2/3
Since A does not provide transit for B and C, trafc external to A, i.e., between B and C, should not appear on A, the trafc will remain unseen by A (Figure 12d). Thus, the total IE trafc matrix is the sum of the
134
B A
1 3 2
C
Figure 11: Example toy network with three ASes: A, B and C are all assumed to be peers. The routers 1, 2 and 3 are internal to A.
B A
1 3 2
B A
1 3 2
C
(a) Internal trafc within A. (b) Incoming trafc to A.
B A
1 3 2
B A
1 3 2
C
(c) Outgoing trafc from A. (d) Trafc external to A.
Figure 12: Trafc ows within the network of Figure 11, classied into four components.
135
The matrix XIE is not equal to XOD in (12), simply due to trafc asymmetry resulting from hot potato routing. Moreover, the assumption of the conservation of trafc no longer holds, since the total trafc of XIE is not equal to X total . The diagonal terms, for example, are much larger than in XOD . This example demonstrates that even if the OD trafc matrix is generated from the gravity model, the IE trafc matrix does not necessarily have a structure that conforms to the gravity model. For large backbone networks where large aggregates of trafc are observed, the gravity model performs admirably, as evident from the results of [125] and its use in AT&Ts backbone network for trafc engineering purposes. On smaller, local area networks, however, its effectiveness is limited. The friction factor fi,j may not necessarily be constant in actual trafc matrices, possibly due to different time zones [98], especially for a global spanning network, language barriers, or the increased deployment of CDNs11 . There may also a distance dependency present between ingress and egress points [7]. The gravity model by itself incurs signicant estimation error as the estimates obtained typically do not match the observed link counts. Due to violations of these assumptions, the gravity model turns out to be inaccurate when used in trafc matrix estimation. For example, it was reported to have 39% accuracy when used in estimating trafc matrices [96]. Despite the aws mentioned, the gravity model was reported to be a good initial estimate to more sophisticated methods. The model was paired with SNMP link measurements to develop the so-called tomogravity technique [125]. The gravity model is also surprisingly useful in the synthesis of trafc matrices. When proposed as a method for synthesising trafc matrices by Roughan [96], the gravity model serves as an excellent rst order model for generating the cumulative distribution function of the trafc demands, closely mimicking the statistical properties of actual trafc matrices. While the basic gravity model may not necessarily be an optimal model, it is a simple and good rst order model for estimation and synthesis purposes, and it can be improved to take into account the factors described above. 4.2.3 Generalised gravity model
component trafc above, so that the entry and exit points match, and is given by 1/9 4/9 4/9 XIE = 4/9 10/9 4/9 . 4/9 4/9 10/9
(16)
In order to improve the efcacy of the basic gravity model and to address its deciencies, a generalisation of the gravity model was developed [126, 127]. In a nutshell, the assumption of independent ingress and egress nodes was relaxed by dividing trafc into several classes of ingress and egress nodes, evident from the example in the previous section. Independence only applies to trafc belonging within a certain class, effectively enforcing a conditional independence criterion. Such an assumption is closer to actual conditions between ingress-egress pairs in a network. In particular, the model now accounts for asymmetry of the IE trafc matrix. To account for the effect from hot potato routing, trafc is separated into classes based on peering and access links. Consider again the network in Figure 11. From the gure, two classes can be dened: internal and external classes. There are then four types of source-destination links (see Figure 12): internal to internal, internal to external, external to internal and external to external.
11 The deployment of CDNs exacerbates this effect since they are located close to the end user so as to avoid having to pay for their trafc transiting other networks.
136
In the generalised gravity model, independence between nodes are only assumed between the internal to internal class and the external to external class. Thus, routers 1,2 and 3 in AS A are independent to each other, and so are ASes A, B and C to one another, but not trafc from 2 to B, for instance. Thus, in the generalised gravity model, a modication is made by ensuring the independence assumption still holds, but only when conditioned within each trafc class. In terms of probabilities, trafc is conditionally independent, as formulated below for the joint fanout distribution of the sets of access nodes of the network of interest A and peering nodes P respectively: pS (s) pD (d) for s A, d A, pS (A) pD (A) (1 pS (P ) pD (P )), p (s) pD (d) , for s P , d A, S (17) pS,D (s, d) = pS (s) pD (A) for s A, d P , pS (A) pD (d), 0, for s P , d P .
The four probabilities corresponds to the four cases in Figure 12. In particular, as per intuition, peering trafc is set to zero, since this class does not transit the network of interest. The stratication of trafc into several classes results in an improved model. Its performance in the trafc matrix estimation results is signicantly better than the basic gravity model [126]. Further stratication beyond separating peering and access nodes is possible. For example, the origin of the trafc, whether from a xed location or mobile device, or the destination of the trafc, depending on application proles, may be dened as new classes in the model. However, further classication in this manner is only possible with more side information available. The generalised gravity model is superior to the basic model, but gravity models in general have been somewhat tarnished by the same brush. Most works benchmarking the performance of various models, for instance, for estimation, compare against only the simple gravity model, but make confusing statements that could lead one to believe that all such models are faulty. In fact, the generalised gravity model is vastly superior, but rarely used outside of the company at which it was rst developed AT&T. The chief reason is that the model requires additional topological and routing data, and for the external trafc ows to be mapped using this data [127]. This is a non-trivial task. In addition, in many external studies researchers have not had access to, in particular, knowledge on access and peering links in the network under study. Network operators are not open to releasing information on their networks to the public, however, the Abilene dataset, used in [127], and the GEANT dataset [116] are publicly available, and contains enough information to make such comparisons. 4.2.4 Discrete choice models
Another proposed model is the choice model, introduced by Medina et al. [73]. The basis of the discrete choice model (DCM) is the theory of choice models for decision behaviour, originally developed in psychology, and later expanded upon by researchers in other elds, more recently in economics, by Daniel McFadden, for which he won the Nobel Prize in Economics in 2000 (see for example, [71]). Choice models are popular in econometric applications as the model is used to describe a simplied underlying mechanism of rational decision behaviour. It has been used for transportation analysis, econometrics, marketing and consumer theory. The main inspiration for its use in Internet modelling comes from [110], where a choice model is used in the context of modelling the behaviour of travellers between the cities of Maceio and Sao Paulo, two cities in Brazil, as it parallels trafc traversing PoPs. The choice model is dened by four elements: (i) the decision makers,
137
(ii) the set of alternatives (choices), (iii) the attributes of the decision maker and the set of alternatives, and (iv) the decision rules. All these elements play a key role in ultimately determining the decision process. The decision makers represent the agents making the decisions on which choices to go for. The set of alternatives characterise the set of possible actions the agents can choose. Each decision maker executes several choices based on its own inherent properties, or attributes, as well as the attributes of the set of alternatives. These attributes predispose a decision maker to certain alternatives. Finally, the decision rules determine how choices are made. How good a choice is, is measured by a standard based on a set of criteria. The rules establish constraints on the choices of the decision makers, enforcing consistency in the entire system. All four elements of the model aim to capture how agents would naturally decide on several differing choices in a system, in a rational and consistent way, based on a set of rules. In the context of network trafc modelling, there are two interdependent factors inuencing choices. First, the network users behaviours determine much of how trafc ows are generated, as discussed in 4.1. Second, the network design and conguration plays a very important role in how trafc ows are delivered on the network. Routing protocols, policies, QoS as determined by the network operator and the geographical local of routers and PoPs determine how trafc is transported within the network and between networks. One could visualise this as a two level process: users generate the trafc ows, whereupon the ows are routed through the network, based on the networks design and policies, to the ows destinations. All four elements have direct analogues in the context of network trafc modelling. The decision makers are the set of ingress nodes, aggregating all information about the users behaviours and network design and policies. The set of alternatives are the set of egress nodes, which aggregates the information about the users connected to these nodes. Thus, each decision maker i has a choice set C E . Each node i, a decision maker, is modelled by the equation, for all j C ,
i Uj = Vji + i j,
(18)
i where Uj denotes the utility between the node pair i and j , Vji aggregates the information from the user behaviour and network design, which is deterministic, and i j is a random component to account for missing i information from unknown factors. The term Vj can be thought of accounting for the level of attractivity of a destination node j . In [73], the authors proposed M -attributes per decision maker-choice pair, such that M
Vji =
m=1
i m j (m) + j ,
(19)
i where j (m) denotes the m-th attribute, m are weights to account for the relative importance of the m-th attribute, and j is a scaling term for other factors for attractivity, besides all M attributes. An attribute i j (m) could be the size of the destination node PoP, since a large egress PoP is more likely to have trafc exiting from it, or the number of peering links the destination node j has. Based on the above, Medina et al. [73] proposed a trafc matrix model that assumes the decomposition
Xi,j = Oi i,j .
(20)
The parameters Oi and i,j , j denote the total outgoing trafc volume from node i and the fanout of node i respectively. For each i, j i,j = 1. The total trafc from a node Oi is known from SNMP data.
138
Observe that the trafc matrix is now parameterised by the fanout distribution which has a direct analogy in the gravity model. In inference applications, it is the fanout distribution being estimated, thus indirectly inferring the trafc matrix, rather than directly estimating the trafc demands. Fanouts have been shown to be generally stable over a measurement period (several hours), compared to trafc demands [57], which is advantageous in trafc matrix estimation, since the stability contributes to more accurate inference. The fanout distribution is determined by a decision rule. In [73], a utility maximisation criterion was used,
i i i,j = Pr Uj = max {Uk } . kC
(21)
Now, i,j is a random quantity as it depends on i,j , as observed from equation (18). A natural starting point is to assume i j is i.i.d. Gaussian distributed with mean 0 and variance 1. This transforms (21) to the well-known multiple normal probability unit or m-probit model [70]. However, there is no closed form for (21) under this assumption. Instead, by assuming i j is i.i.d. distributed following the Gumbel distribution, the m-probit model can be approximated, with (21) now having a closed form. This model is popularly called the multiple logistic probability unit or m-logit model [70]. The closed form is simply i,j = implying that Xi,j = Oi exp(Vji ) , i kC exp(Vk ) exp(Vji ) . i kC exp(Vk ) (22)
(23)
The difculty lies in determining what attributes should be included. The authors considered two models which they empirically validated: (i) Vji = 1 j (1) + j , where j (1) denotes the total incoming bytes to an egress PoP j , and (ii) Vji = 1 j (1) + 2 i (2) + j , where in addition, i (2) denotes the total bytes leaving the ingress PoP i. In general, the second model is more accurate, owing to the additional attribute, but it is not known if it is just a case of overtting or the new parameter is truly useful. The choice model is a variation of the gravity model. In particular, looking back at equation (20), the total trafc outowing from ingress i may be regarded as the repulsion factor, while the parameters i,j combining both the attractiveness factor and the friction factor. A quick comparison of the choice model to (8) highlights the strong link between both models. The choice model, however, has a larger number of parameters to account for the attributes of the decision maker and set of alternatives. 4.2.5 Independent connections model
The independent connections model (ICM) was introduced in [43, 44]. Unlike the gravity model, this model discards the assumption of independence between the ingress and egress nodes, and instead focuses on the connections between nodes. More specically, the model differentiates between initiators, nodes that initiate a trafc connection, such as a TCP connection, and responders, the nodes that accept these connections. The independence assumption comes in by assuming that each initiator and responder are independent, in effect, resulting in independent connections.
139
The inspiration for the ICM comes from trafc characterisation studies, specically on TCP behaviour. TCP creates two-way connections in response to a SYN packet, the packet used to initialise a connection. Although it is common for the majority of trafc to ow in one direction, there is also a smaller reverse ow. Common examples include an HTTP query, which involves query packets owing in one direction, and a much larger set of data owing in the other as a response, or an FTP transaction which may involve mainly data ow in one direction, but the forward packets require acknowledgement packets in the reverse direction. Therefore, the model uses the notion of a connection: a two-way exchange of packets between an initiator and a responder, corresponding to the ingress and egress nodes, without necessarily assuming symmetry of the two-way trafc. Three parameters were dened as a product of these studies. The rst parameter, the forward trafc proportion fi,j is the normalised proportion of forward trafc from a connection between ingress i to egress j , measured in packets or bytes and 0 fi,j 1, i I and j E . The second parameter Ai describes the activity level of the users at i (the A stands for activity). Finally, some nodes may be chosen for connection more than others, and thus, Pj (stands for preference) denotes the preference for node j . The main assumption of the model is that the probability that a connection responder belongs to node j depends on j only. The values of Pj for j E are unnormalised. They are divided by the sum k Pk in order to treat them as the probability a node j is a connection responder. The parameters Ai and Pj were shown to be uncorrelated on empirical data, providing some evidence these parameters describe two very different underlying quantities. The model is expressed by Xi,j = (1 fj,i ) Aj Pi fi,j Ai Pj + . P k k k Pk (24)
The rst term captures the forward trafc of the connection between initiator i and responder j while the second term its reverse trafc, generated by the users from i and j respectively. The model may be viewed as a weighted sum of two gravity models, with one gravity model characterising the forward trafc, while the other the reverse trafc. Thus potential asymmetries in trafc can be accounted for. The model is sufciently exible to accommodate variations. For example, the simple IC model modies one parameter of model (24) by setting fi,j = f , where f is a constant as it has been observed that f is fairly stable from week to week (at least on the Abilene dataset [44]) simplifying the model considerably. Another variation, the time-varying IC model includes temporal variation of the parameters, i.e., Xi,j (t) = f (t) Ai (t) Pj (t) (1 f (t)) Aj (t) Pi (t) + , k Pk (t) k Pk (t)
and the stable-f P IC model removes the time dependency of f and the preferences {Pj }j I , while the stable-f IC model only removes the temporal dependence of f . These variations allows trade-offs between the degrees of freedom of the model and computational complexity, especially when used for the synthesis or inference of trafc matrices. With less parameters, which was shown to be less than the basic gravity model, the model is easier to compute. The parameters {Ai }iI and {Pj }j E were validated on actual data. Activity levels {Ai }i possess diurnal patterns, corresponding to user access patterns, and a periodic pattern on a weekly timescale. In particular, activity levels are higher on weekdays compared to the weekend, matching observations such as Figure 2c. There is also a more prominent periodic pattern when considering larger nodes, as this effect is due to aggregation, as it captures the users with higher activity levels. These observations are consistent with the temporal properties discussed of trafc matrices discussed in 2. The model was shown to be
140
effective in estimating trafc matrices, improving over the basic gravity model by 20% 25% for the GEANT dataset [116] and almost 10% for the Totem dataset [43, 44]. These results and observations show that average user behaviour is largely stable and predictable, a great boon to trafc modelling development. In some ways, the ICM is similar to the DCM, in that both models include parameters to describe the underlying user behaviour, unlike the basic gravity model. For example, both models have a parameter to quantify the level of attractiveness of one node (connection) to another. The DCM is also able to incorporate features of the ICM as well. The differences end there, however. For one, the ICM has a slightly richer description of the ow connections between nodes, such as the forward and reverse trafc ows between nodes, whereas the DCM aggregates the information in a single parameter. The ICM seeks to capture the behaviour of each connection made, rather than merely model the relationship between nodes, emphasising a different focus compared to the DCM. Thus, the ICM may account for hot potato routing and other asymmetries in trafc ow. 4.2.6 Low-rank spatial models
The very noticeable feature of both DCM and ICMs is that they can better represent trafc matrices, but are more highly parameterised. It is, in general, possible to t a data set more accurately when more parameters are available, but this presents a difculty does one accept the more complex, more highly parameterised model, or the simpler, perhaps more robust model? In the previous cases, this was an all or none decision (at least we had to decide on the type of model we used, of not the exact number of choices involved), whereas the gravity model is xed in its parameterisation. However, there are concepts that easily extend the gravity model. The low-rank model somewhat new, made popular by its use in matrix completion problems [19, 20, 22, 89]. Low rank models assume the trafc matrix is well-represented by the low rank approximation
r
Xr =
i=1
2 T i ui vi ,
(25)
where i denotes the i-th singular value, with all singular values arranged in order of descending order, i.e., 1 2 r . The famous Eckhart-Young theorem [108, Theorem 4.32, p. 70] states this is the best rank-r approximation, in the sense of the Frobenius norm12 , of a matrix A given by retaining the largest r singular values of its Singular Value Decomposition (SVD). The theorem, however, assumes that the target matrix for approximation is already known. In low-rank matrix recovery, however, the target matrix is unknown. In the context of trafc matrix modelling, low-rank models are a relatively recent introduction, beginning with work in [128], although this model was spatio-temporal (see further below). Low rank purely spatial models were proposed and used to good effect by Bharti et al. [11]. The choice of a low rank model has strong empirical backing by the earlier results of PCA applied to trafc data, for instance in [66, 124]. In essence, we can see (25) as expressing a trafc matrix as a weighted sum of gravity models, i.e., each single rank component looks exactly the same as that expressed in (7). It seems a logical approach simply because the Internet is not a homogenous entity. In particular there are many types of applications running across the network: from interactive session, to voice, to HTTP, to streamed video. We might imagine that a class of trafc, say streaming video, satises the gravity law, but with different row and column sums to, say, voice trafc. Given this, it seems that a weighted sum of gravity matrices is a natural extension.
12 The
i,j
x2 i,j .
141
Previous models actually turn out to be special cases of this low-rank model. The gravity and discrete choice model are spatial rank-1 models. The generalised gravity model and the ICM are spatial rank-2 models, the latter of which can be observed from the summation of the forward and reverse trafc contributions in equation (24). The low-rank model may be viewed as a general model for providing a fundamental framework for further model development. We shall consider this idea in more detail below in the context of spatio-temporal modelling.
The main idea in spatio-temporal modelling of trafc matrices so far has been to exploit the low-rank models mentioned above, but in this context to apply it to the stacked representation of a series of trafc matrices denoted here by X . Low-rank models assume the trafc matrix is well-represented by the low-rank approximation
r
Xr =
2 T i ui vi , i=1
(26)
where i denotes the i-th singular value. As before, all singular values are arranged in a descending order. Note the difference between model (26) and the above is the low rank assumption also applies to the temporal structure of the trafc matrix. In the context of trafc matrix modelling, low-rank models are a relatively recent introduction, beginning with work in [128]. Besides spatial correlations (exploited by the models proposed previously), trafc matrices are known to exhibit temporal correlations, resulting in a low-rank structure both spatially and temporally, justifying the rationale behind the model. The objective of the work is to approximate the time series of trafc matrices X by a rank-r model X r . The model proposed here is spatio-temporal, in contrast to the models discussed previously, which are only spatial in nature. Simply insisting on low rank, however, is missing another important point, which is that matrices also exhibit locality, i.e., elements that are close in time (where this might mean time of day, not absolute time), or space exhibit strong correlation. It turns out that the model (26) is greatly enhanced with additional simple constraints on the temporal and spatial structure to reect the smoothness property of Internet trafc, under normal operating conditions. The low-rank construction also proved relatively easy to use in practical applications such as matrix completion, and [128] showed that it could be used to do matrix inference from link data, impute missing data (from as little as a few percent of extant values), or be used to predict matrices into the future. Despite the demonstration of its effectiveness in trafc matrix estimation, low-rank models are still not well understood. Unlike the previous models, where the parameters are, by design, quantitative measures of an underlying network property, low rankedness (in spatio-temporal matrices) does not correspond to any particular network aspect, such as user behaviour. It is just a measure of the spatio-temporal correlation between trafc ows. It does, however, hint that OD trafc ows are clustered, if one considers the allocation
142
of IP prexes. A better interpretation is necessary to understand the properties of the model, and work in [11, 66] may provide clues in the right direction. Furthermore, in the recovery of trafc matrices using the low-rank model, theoretical work on the minimum number of measurements required for recovery under structured losses of rows and columns of X , which occurs frequently in the networking context, is left open. At present, the current focus is on random erasures of the elements of X [19, 20, 22, 89]. Overcoming structured losses is far more important than random erasures, as such a scenario is frequently encountered in real networks. For example, a router failure may result in missing data for an entire row of the trafc matrix. The results of [128] show much promise, as the method is largely immune to structured losses. The challenge now is to construct a theory as to why this is so and as to what extent structured losses may be recovered. Low-rank models hold much promise for the development of more sophisticated models. More work is required to understand the spatio-temporal properties of the trafc matrix, but as preliminary results indicate, there is a potentially rich structure to exploit. 4.3.2 Tensors and hyper-matrices
A time series of purely spatial trafc matrices is simply a 3-dimensional array, which is sometimes also called a hyper-matrix. Such a representation would be a more natural representation as it would theoretically preserve spatio-temporal properties better than the stacked matrix, as well as track the evolution of the trafc demands throughout the measurement interval. A tensor representation of trafc is even better as it is invariant to changes of basis, unlike hyper-matrices. The difculty, however, is identifying the type of decomposition of the tensor that would produce low-rank structures, or a benecial, exploitable structure. There are many proposed methods for tensor decomposition, but the two most popular are the Canonical Polyadic (CP) or PARAFAC decomposition and the Tucker or multilinear decomposition [60]. Tensor decomposition requires a large number of computations, which may be an obstacle to its adoption in trafc matrix recovery. At present the one work exploiting the tensor structure of network trafc to impute missing entries of network trafc tensor is found in [4].
Applications
polling of the measurements is performed every 5 minutes (and the polling intervals may not be perfectly synchronised across a whole network). Clearly, any inference method is required to be robust against these errors and uncertainties. The underconstrainedness of the problem may be mitigated by active measures. One is direct measurement, using dedicated monitors or in-built measurement software on routers such as NetFlow [1]. Direct measurements at even a single point of ingress results in measurement of an entire row of the trafc matrix, drastically reducing the number of missing matrix entries. Another interesting proposal is to change the IGP (Interior Gateway Protocol) link weights over several snapshots within the measurement interval to provide fresh sets of observations, thereby resulting in a system of linear equations with a unique solution (full rank) out of the SNMP measurements [80, 107]. Both these techniques may be impractical, either being too costly in the case of direct measurements, or requiring direct intervention by the network operator for IGP weight changes. Most proposals simply avoid these by settling on a passive approach of inferring the trafc matrix straight from SNMP data. There are two main approaches to trafc matrix inference. The rst is the deterministic approach, where y is assumed to provide hard constraints, rather than statistical data. Goldschmidt [54] formulated this as a Linear Program (LP) where the objective was designed to nd bounds on trafc matrix elements. In simple terms, the LP nds the trafc matrix with the worst case upper and lower bound on the trafc demand subject to constraints. Recall the vectorised trafc matrix x has size N (N 1). For the upper bound, the LP model is dened with the objective function max
x N (N 1) j =1
j xj ,
(27)
where j is a weight for an OD pair j , also called the coefcient of demand. There are three constraints to satisfy, namely, (i) observation constraints:
N (N 1) j =1
Aij xj yi , i = 1, 2, , L,
(28)
y 1A
1 =i, 2 =j, 1= 2
2k
y 1A
1 =j, 2 =i, 1= 2
2k
Similarly, the lower bound is found by substituting the maximisation operation in (27) with a minimisation operation. The LP only produces a nontrivial solution if the lower bound and upper bound on the trafc demand is greater than zero and less than the observed total link count, i.e., j yj . Unfortunately, the utility of the LP is only restricted to small toy problems. First, two linear programs have to be solved each time to obtain the upper and lower bounds on trafc demands, which is computationally expensive for large N . Second, the LP was shown to have terrible performance when tested on several types of trafc matrices [73]. Estimates of some trafc matrix entries were in excess of 200%, with most in excess of 100% error, proving that while the LP may be useful for certain small topologies, in general it
144
is not considered a practical estimation method. The reason for this is because the LP sets many estimated values to zero, resulting in overcompensation for the rest of the estimated values in order to meet the total trafc constraints. Third, there is a high sensitivity of the solution to weight choices, which implies that different solutions will be obtained depending on the chosen weights. Instead, a more successful alternative is the use of statistical models and regularisation, i.e., treating the trafc matrix as a realisation of a random process generated from a model. Regularisation refers to the inference technique of imposing additional structural assumptions on the problem to reduce underconstrainedness. Regularisation methods are dened by four components: (i) a prior solution, generated from a model, (ii) a model deviation measure, used to compute the deviation of a feasible solution from the model, (iii) a distortion measure, used to compare the deviation of the model with the observations, and (iv) an adjustment step, to ensure the constraints on the total trafc entering and exiting all ingress and egress nodes respectively, as well as non-negativity constraints, are satised. In terms of an optimisation procedure, solving the tomography problem is equivalent to x = argmin
xRN (N 1)
(30)
where R(, ) denotes the distortion measure, d(, ) denotes the model deviation measure and 0 is the penalty constant that amplies the penalisation of a feasible solution which strays too far away from the model13 . Typically, R(x, y) = y Ax 2 . Regularisation techniques are biased to a particular prior model. Thus, if the model is inconsistent, then the estimator (30) would be inconsistent as well. However, if the prior model chosen describes the nal solution somewhat accurately, then it is expected that the nal estimate would be fairly accurate. As an example, suppose the prior model, x(0) , used is the gravity model, which can be derived from link measurements by calculating the ingress and egress trafc volumes (by summing link measures on the edge-links of the network). One proposed penalty [127] is dened as d( x, x(0) ) = H ( x) + H (x(0) ) H ( x, x(0) ), where H (x) = is the empirical entropy, while H (x, x
(0) N (N 1) j =1
(31)
xj N (N 1) k=1
xk
log
xj N (N 1) k=1
xk
(32)
)=
N (N 1) j =1
xj
N (N 1) k=1
xk
log(
xj
N (N 1) k=1
xj xk
(0)
N (N 1) k=1
xk
(0)
),
(33)
is the joint empirical entropy, between the estimate and the prior model. The penalty function (31) measures the uncertainty between the quantities x and x(0) , and is commonly known as the mutual information [33].
13 Technically, x comprises non-negative integers, but a relaxation to real numbers is used as it is easier to compute, especially when considering large trafc matrices.
145
The joint entropy term H (x, x(0) ) quanties the uncertainty between x and x(0) . If H (x, x(0) ) = 0, then x is statistically independent of x(0) . The approach is highly exible: it can deal with the generalised gravity model simply by using a new prior model and constraints (17) are added to account for the different trafc classes (access and peering trafc). The penalty can be rewritten and thought of as the Kullback-Leibler distance [33] between the estimate x and the prior model x(0) , implying that the estimation objective seeks to preserve as much prior information from x(0) as possible, while minimising R(x, y). This can be used directly, or approximated, for instance as a weighted quadratic [125, 126]. Using suitable models, most of the existing inference methods can be described in this framework (see [127] for details). Or, other penalties can be used, such as the nuclear norm, given by
r
d(X) = X
=
i=1
(34)
for low rank model recovery. The solution of the optimisation procedure is often adjusted after regularisation using Iterative Proportional Fitting (IPF) [35], so as to satisfy the observed total trafc constraints and non-negativity constraints (those that were not included in the regularisation for computational reasons). In practice, the IPF is a very simple algorithm, performing fast even on large trafc matrices.
Figure 13: Three example topologies where local trafc matrices provide benets (motivated by the seminal gure in [9]). Left: centralised, or star, topology, Centre: decentralised topology, Right: distributed topology.
Additional information can also be used, for instance, if some rows of the matrix are known from measurements, then this eases the number of variables to be estimated, making the problem a little simpler. Another source of potential data is the collection of local trafc matrices [118], providing information on trafc between interfaces of routers. We can see why this is useful by considering the three network topologies in Figure 13, with a centralised or star, a decentralised and a distributed topology. If the network has a star topology, then the entire trafc matrix is known if the local trafc matrix of the router right in the centre is obtained. For the other two topologies, collection of local trafc matrices in strategic places of the topology is likely to reduce the underconstrainedness of the original inference problem, though less (relative) information is provided the more distributed the topology. Local trafc matrices have been demonstrated to provide a signicant information boost in [126], especially if the interfaces are well-connected, and that is highly dependent on the underlying network topology. If direct ow measurements from dedicated monitors are available, they provide a huge boon as an entire row of an IE trafc matrix would be revealed. In practice,
146
however, these are generally not available as they are deemed expensive. The advantage of the regularisation method is that these additional information may be incorporated easily via constraints. Another issue is their computational tractability. Speed is an issue for these algorithms, since trafc matrices are often large. Most model deviation and distortion measures are chosen to be convex14 , with linear constraints. In this way, problem (30) becomes a convex optimisation problem, where many fast, scalable and efcient algorithms have been developed to solve such problems [13]. The discussions here only considered point-to-point trafc matrices. For IE matrices, the point-tomultipoint matrix may be more useful instead. Recall from the above that an ideal trafc matrix is invariant to other network aspects to be useful for network design. Unlike the point-to-point trafc matrix, the pointto-multipoint matrix contains records on the amount of trafc from one ingress point to a set of egress points. These sets are chosen to preserve invariance under changes in the egress point, a property much more useful for network planning. Inference of the point-to-multipoint trafc matrices may be done in a similar fashion to point-to-point IE matrices [126].
The path distances from PoP 1 to 2, 1 to 3 and 2 to 3 is 1, 2 and 4 respectively. The goal is to nd the shortest path routing based on the trafc matrix and the topology. Recall that the link loads may be expressed as y = Ax, where x = (1, 4, 1, 2, 3, 2)T is the vectorised form of X, without including self-trafc. The routing matrix A here is size 3 by 6, since there are 3 links and 6 ows, as we do not consider self-trafc. Each element is either 0 or 1 as the ows are assumed to be delivered whole. The capacity of each link is assumed to be innite, or limitless.
14 Convex relaxations of a non-convex objective function is often used as a substitute, for e.g.,, the nuclear norm is used in place of the rank function to recover low rank matrices. Under these circumstances, it is crucial to note the assumptions of the model, so as to know when the recovered solution is an excellent approximation of the true solution.
147
3 2 4
Figure 14: Example of a simple network of three PoPs and path distances of each link. The shortest path routes chosen are from PoP 1 to 2, and from PoP 1 to 3, as these paths have the lowest distance (corresponding to the thick lines).
Thus, nding the routes may be cast as the optimisation problem (integer program) with a simple cost function based on path distance A = argmin
Ai,j {0,1},i,j
subject to
(35)
The example presented is a simple, trivial one. In practice, networks have a larger topology and additional costs may be included, such as bandwidth utilisation of each link so as to minimise the maximum utilisation of the network, QoS constraints, and constraints on the maximum capacity of each link. For instance, if the maximum capacity of the link from PoP 1 to 3 is at most 4 i.e., y1,3 4, then the routing matrix above is no longer a feasible solution. The new solution must incorporate the link from PoP 2 to 3 to conform to constraints (assuming the link 2 to 3 has adequate capacity). In all of the above mentioned tasks, one of the key ingredients is the trafc matrix. The reason these matrices are so useful comes down to invariance. The trafc on a particular link obviously varies as the links in the network change. However, an ideal trafc matrix is invariant to other aspects of the network such as the network topology and underlying routing protocols. Invariance allows the design to be varied without the inputs to the process changing. To a certain extent, the IE trafc matrix satises the invariance property, but it is far from perfect for some tasks, most notably it is highly sensitive to external routing changes and some internal changes [112, 113]. The OD matrix is in some sense preferable [7], but harder to measure in most cases. The point-to-multipoint IE matrix (discussed in the previous section) is a useful compromise. Furthermore, a network operator would require a prediction of the trafc matrix out to the level of the planning horizon for a task. Any forecast depends on the time scale involved and the underlying model used. At short time scales, say minutes, stationarity may be a reasonable approximation, and there are therefore many time series approaches the problem. On time scales of hours to days to weeks, the cyclostationary nature of the data must be included. The temporal models presented earlier can provide such predictions. For instance, with the model (4), we can estimate the average trafc at some time in the future by simply
148
1 0 A = 0 1 0 0
1 0 0
1 0 1 1 0 0
extrapolating the mean. Longer-term prediction often focusses purely on the large-scale trend L(t) (see 4.1), often captured using a simple growth model (linear or exponential) and regression. In all cases, historical data is needed, usually several times as long as the prediction interval. In addition, whenever performing prediction we should provide an estimate of variances, or condence intervals, though this component of the problem has not been well-studied in the specic context of trafc matrices.
149
suitable tuning parameter). However, proper assessment of an approach requires ground-truth data, which is, by the nature of anomalies, hard to obtain in the volumes required. Models themselves can be modied to account for anomalous trafc. In 4.1, the model (4) itself has a term to account for sudden spikes in trafc, which was demonstrated empirically to be useful in detecting large shifts in trafc. Low rank models [128] were shown to be highly effective in detecting anomalies as well. Many anomaly detection proposals may be broadly classied as methods to preprocess measurement data via a linear transformation, in order to separate normal trafc from anomalous trafc. This was observed in [124]. Their anomography (a portmanteau of anomaly and tomography) framework is easy to understand and is aimed at providing a framework for discussing these types of techniques. It proceeds as follows: start by assuming the routing matrix is static in the entire duration of the measurements. Given a series of SNMP measurements, Y = AX , a new inference problem is obtained by multiplying Y with a = AX , which are the anomalous link loads. Whether the focus is on spatial linear transform T to obtain Y or temporal anomalies depends on whether Y is pre- or post-multiplied with T: = TY, uses the spatial relationships between trafc (i) spatial anomography: pre-multiplication, i.e., Y at particular points in time to nd trafc that is unusual with respect to other ows at the same time; and = YT; uses the relationships between trafc at (ii) temporal anomography: post-multiplication, i.e., Y different times to determine if trafc is unusual for its point in time. The two have been combined to create spatio-temporal anomaly detection [128], though the full details of this go beyond the scope of this chapter. The above assumes the routing matrix is static over the series of measurements. The models themselves have to be modied to account for possible route changes. Some models are less amenable to modication, requiring a large number of constraints that scale with the number of measurements [124], which makes them undesirable for practitioners. Anomaly detection employing SNMP data took off with a series of papers [6365, 68], where the low intrinsic spatial dimensionality of trafc matrices was exploited via a PCA-based anomaly detector. In the spatial PCA-based method, the principal components and axes of Y are computed from its columns and ordered from most to least signicant component, to obtain a subspace P = [v1 v2 vm ]. The trafc space is then divided to a normal subspace and an anomalous subspace. The trafc time series is then projected on each principal axes, starting from v1 and so forth, and the projection magnitude is compared to a simple hard threshold of three standard deviations from the mean. Once there exists a projection exceeding this threshold, say at some vK , this component and subsequent components are classied as belonging the anomalous subspace PA = [vK vK +1 vm ]. The anomalous trafc is identied by projecting the time . series onto the anomalous subspace and projecting the trafc back to obtain Y Lakhina et al.s spatial PCA method ts in the framework since the last step of extracting the anomalous = (PT PA )Y so the linear transformation is T = (PT PA ). However, its trafc involves the projection Y A A shortcomings have been the subject of scrutiny. Implementation of spatial PCA to network trafc is likely to be ineffective due to several drawbacks [94, 124]. Spatial PCA can be contaminated with a large anomaly15 , rendering it unable to detect the anomaly. In fact, PCA has been known to ag an entire measurement interval although there is only one anomaly present [124]. Additionally, PCA is very sensitive to the underlying data: adequate measurements are required and there must be a sufcient level of trafc aggregation before
15 Though
in fact this is a problem in general for anomaly detection, and has not received the attention it deserves.
150
underlying trends can be detected by PCA. It is not robust enough in practice, requiring much ne tuning. Finally, there is a high computational cost involved in computing the principal components of a trafc matrix. Other alternatives exist: wavelet transformations [10, 83], Fourier transformations, autoregressive integrated moving average or ARIMA [124], and temporal PCA [124], where PCA is applied to the rows of Y (the temporal dimension) instead. In all these techniques, the baseline trafc ows are assumed to follow the prescribed model. In the Fourier model, baseline trafc is assumed to be composed of low frequencies. High frequencies may potentially indicated the presence of anomalies since these correspond to sudden changes in the trafc. Thus, the transformation lters out low frequencies and examines the remaining high frequencies to determine if any of these frequencies exceed a predetermined threshold. A similar rationale holds for the wavelet transform model. The ARIMA model [15] is very well-known in time series analysis, providing exibility in the choice of parameters. The model generalises popular models such as a model with a built-in Holt-Winters smoothing, the random walk model and exponentially weighted moving average models. It also allows memory and long range dependency [119] to be built into the model via fractional ARIMA, as evidenced and used to great effect in [101]16 . is obtained, the anomalous trafc X has to be recovered. The choice of a particular inference After Y algorithm would depend on the model. The spatial PCA method uses a greedy algorithm to nd the largest anomaly in each time bin [63]. Other methods include the use of 1 regularisation, inspired by compressive sensing [21, 36], which was shown, when coupled with the ARIMA model, outperforms other methods, including PCA and wavelet-based anomaly detection [124]. In short, the model of the trafc matrix matters in anomaly detection. It serves as a baseline. However, it also needs to consistently allow for anomalies. One problem with approaches such as PCA is the models implied by the approach are often left unstated (implicit) and do not allow the anomalies to be separated as part of estimation (thus they can pollute the estimation process). Good techniques, going on into the future, need to be able to perform such separation consistently.
151
(iv) there are a number of conicting goals in synthesis, e.g., to generate variability, but well matched to real trafc matrices. The major use of synthesis is in simulation. Articial trafc matrices may be used in capacity planning to stress test network topologies to see if they stand up to heavy loads without ending up with congestion. Monte Carlo-type simulations may be used to produce estimates of the behaviour of networks. Synthesised trafc matrices provide an avenue to explore the limitations of a protocol in a controlled environment before running it on an actual network. A good model simplies these simulations, as parameters of the model can be tuned to generate a variety of scenarios for testing protocols. In many cases a trafc matrix is enough for a simulation, but in others, we need to translate this into packets (or at least connections). The analogue in transportation modelling is often called a micro-simulation model [8, 58]. Here, the problem becomes one of taking a demand matrix (remember, most of the work here is related to trafc, not demand), and translating this into carried load. We know how to do that (using simulation tools such as ns) but doing it efciently is difcult. One paper [105] starts to tackle this problem, but as in the work of transportation modelling, there is considerable scope for advanced scalable microsimulation of trafc. Another use for synthetic trafc matrices is in the further task of synthetic topology generation, but we shall leave discussion of this topic to Chapter 7 of this book. Unfortunately, there is a dearth of work on trafc matrix synthesis, apart from [81, 96] and a brief mention regarding synthesis of matrices from the independent connections model [43, 44]. The problem of synthesising trafc matrices is the inverse of the inference problem. In synthesis, the topology of the network matters, as the generated entries of the articial trafc matrix must not exceed the link capacity it is mapped to. Some models, such as the gravity model, automatically satisfy bandwidth constraints naturally [96]. However, if the generated entries do not conform to these constraints, there are algorithms to solve these problems [81], albeit with added computational complexity. Computational complexity of the model is the other important issue, dependent on the number of parameters. Hence, there is an inherent tradeoff between the descriptive power of the model and the ease of synthesising trafc matrices. A guideline is to preferably choose the model with as little parameters as possible but enough proven descriptive power (focussed on the trafc aspect the practitioner intends to capture) measured via an information criterion, such as the AIC [6]. We here describe the simple approach of [96] motived by works such as [50], in order to provide a starting point for future work on such synthesis. We start by taking xin and xout to be vectors of N i.i.d. exponential random variables with mean one. The trafc matrix is then generated using (7). We can then adjust the total trafc to match the desired total by simple scaling. This method is extremely simple (an exponential distribution has only one parameter to estimate), and we need generate only 2N random variables. Yet it matches observed statistics for both Abilene and GEANT data extremely well [7]. Synthesis of trafc matrices is an open area, and is important to further explore due to the benets of using trafc matrices in trafc planning and engineering. There is considerable hope that progress can be made in terms of generating synthetic matrices, both for green-elds network design [61], and for simulation in general.
Future Directions
There are some interesting tasks left for trafc matrix research. The various algorithms and techniques described here could be improved, though in many cases the improvements may be relatively incremental
152
given the success of existing approaches. More interest may be found in extending the ideas and techniques used here to new domains, and to evolving Internet trafc. There are a few obvious cases (and no doubt many less obvious cases that we have not thought of), for instance: multicast trafc has not, to our knowledge, been studied in this way. Multicast is interesting because it violates the trafc conservation assumption that lies underneath many techniques for estimation and modelling of trafc matrices. We could imagine modelling it by considering the ow to be the trafc on a multicast group, from say, one source, to a set of destinations, and then stacking a vector with these. The routing matrices now include elements for every link used (no longer following a single path). The trafc matrix could then be a column vector of the trafc on each of these ows. So the idea of multicast trafc can t into the structure we have talked about here, but appropriate models for performing tasks such as inference do not seem to exist. It would also be very interesting to understand the way that CDNs are affecting network trafc. A step in this direction, although not directly on trafc, but more on the discovery of the content hosts is [5]. A CDNs typical goal is to bring content closer to the user, thereby reducing network trafc. However, that explicitly violates the friction free assumption in most gravity models, and introduces distance as something to be modelled. That leads naturally to the consideration of global trafc. Almost all studies of trafc have concentrated on a single network no larger than the national scale. That may still be very large for example, several studies looked at Tier 1 providers in the USA, which for some time dominated Internet trafc. However, although large, it was still relatively homogeneous trafc between people speaking much the same language(s) from place to place in the network. When we consider the Internet globally, we may see that there are language or cultural clusters where large groupings of trafc are focussed. On a large scale, time zones also play a signicant role. Trafc patterns show strong cyclic behaviour based on user activity, but such activity is strongly dependent on the local time zone. If trafc is owing from user to user, then this can result in strong apparent locality effects, simply because people in the same time zone are more likely to be awake at the same time [53]. While language and cultural focussing may be geographic in nature, it might also be considered per network, and that leads to another topic of some interest. Very few papers have tried to consider inter-AS (also known as inter-domain) trafc in any detail. Exceptions are Chang et al. [28] (which presents suggestions for estimating trafc based on models derived from business models and resulting usage); Bharti et al. [11] (which considered inference of hidden elements of this matrix using a subset of data), Feldman et al. [48] (which aimed to estimate a global trafc matrix, but only in the limited domain of WWW trafc), and Labovitz et al. [62] (which looked at inter-domain trafc from 110 network operators over a two-year period, though not in the form of a matrix). Study of the Internets global trafc matrix is made difcult by the sheer scale of the project: Labovitz et al. studied 110 network operators over a two-year period17 and to do so, collected over 200 Exabytes of data. Many network operators do not collect or store data of the type required for such a study, and many more regard it as proprietary or covered by privacy legislation with provisions such that no researcher is ever likely to see it. So we can see that study of the inter-domain matrix is likely to be a long-term, and rather challenging project. In addition, we know that the trafc prole (or mix of applications) has changed fairly rapidly over time. It is likely this trend will continue, and there are bound to be effects on trafc patterns as a result. Peer-2-peer trafc signicantly altered trafc patterns when it appeared because it was more symmetric than traditional (at the time) WWW trafc. However, in addition, peer-2-peer applications have the potential to exploit locality information to download from sources closer to the destination. This could potentially change the no friction assumption in much the same way that CDNs can, though in the early days it did not appear to
17 To
put this in context, there are tens of thousands of ASes in the Internet.
153
be the case [53]. An example trafc matrix (drawn from [53]) showing normalised18 trafc between regions in a cable-network operator is given in Table 2. The major deviation, in this data, from a pure gravity model seemed to be time-zone differences. From/To R1 R2 R3 R4 R5 R6 R7 R8 R1 0.172 0.132 0.107 0.161 0.107 0.107 0.109 R2 0.180 0.120 0.111 0.180 0.108 0.106 0.111 R3 0.140 0.141 0.182 0.136 0.145 0.137 0.127 R4 0.126 0.126 0.189 0.132 0.155 0.157 0.161 R5 0.174 0.190 0.135 0.124 0.125 0.127 0.128 R6 0.128 0.132 0.145 0.163 0.135 0.182 0.178 R7 0.124 0.118 0.139 0.155 0.127 0.187 0.185 R8 0.127 0.120 0.140 0.158 0.129 0.173 0.184 -
Table 2: Normalised inter-regional trafc matrix from [53]. New trafc classes may change trafc matrices in the future, and modelling these will be interesting. On the other side, applications such as anomaly detection are likely to remain interesting due to their immediate benets to operators, but the most overlooked task is trafc matrix synthesis. As mentioned in the previous section, the lack of real world trafc matrix datasets motivates the use of articial trafc matrices to provide some degree of approximation in network trafc planning, provisioning and engineering tasks. It is an important area to concentrate on, considering its usefulness to network operators.
Conclusion
This chapter has been aimed at introducing the reader to the state-of-the-art in Internet trafc matrix measurement, modelling and applications. It is not a complete survey of all research about Internet trafc matrices as such a survey would necessarily consume a much larger amount of space, and be less digestible, and we apologise to those whose work has not been referenced. The chapter has also aimed to clarify a set of common terminology in a eld which has occasionally been confounded by ambiguous or confusing terms. We have discussed how difcult it is a task to measure and model trafc matrices, due to the intricate complexities of the layering of network protocols. Moreover, these complexities also contribute to challenges in determining the focus a model should take. Models presented here were classied in three categories, depending on the trafc properties they aim to capture: purely temporal, purely spatial and spatio-temporal models. Several examples of popular models of varying complexity were discussed: Fourier and wavelet models, PCA, the modied Norros model, uniform and Gaussian priors, the gravity model and its variants, the independent connections model, the discrete choice model and low rank models. There are many more models, as the list of models presented here is not meant to be exhaustive. All models have their strengths and weaknesses, and they may only be used in circumstances where the assumptions from which they derive from hold. Furthermore, practitioners should always keep in mind the particular application the model is being used for. Several applications of trafc matrices and their models were discussed, including their recovery, network optimisation and engineering, anomaly detection and synthesis. These tasks emphasise the practical usefulness of trafc matrices to a network operator.
18 The elements have been normalised by dividing each row by the row-sum, so that each element actually represents the probability that a packet enters the network at a given region i will depart the network at region j .
154
There are several more open questions, considering the shift in current trafc trends. Multicast trafc has not been studied in-depth. Content delivery networks have a profound impact on trafc behaviour, but to our knowledge, has not garnered much attention with regard to studying their trafc behaviour. Other areas to understand further are the global trafc properties of trafc matrices, application proles of the trafc and trafc matrix synthesis. Understanding and modelling trafc matrices is a difcult problem, exacerbated by the lack of publicly available datasets. It is our aim to also provide, as an adjunct to this chapter, links to the most commonly used datasets in this domain, and code to perform some of the commonest tasks. In this way, we hope to provide a rm foundation for future work in the area, and to help those who just want to use trafc matrices in their research.
References
[1] Cisco NetFlow. http://www.cisco.com/go/netflow. [2] Cisco visual networking index: Forecast and methodology, 2011-2016. http://www.cisco. com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/white_ paper_c11-481360_ns827_Networking_Solutions_White_Paper.html. [3] Internet Activity, Australia. http://www.abs.gov.au/AUSSTATS/[email protected]/ DOSSbyTopic/0444532C5EBD3B76CA256BD0002833B6, 2013. [4] ACAR , E., KOLDA , T. G., D UNLAVY, D. M., AND M RUP, M. Scalable tensor factorizations for incomplete data. In SIAM International Conference on Data Mining (SDM) 2010 (April 2010), pp. 701712. [5] AGER , B., M UHLBAUER , W., S MARAGDAKIS , G., AND U HLIG , S. Web content cartography. In ACM SIGCOMM Internet Measurement Conference (IMC) (November 2011), pp. 585600. [6] A KAIKE , H. A new look at statistical model identication. IEEE Trans. Autom. Control 19, 6 (December 1974), 716723. [7] A LDERSON , D. L., C HANG , H., ROUGHAN , M., U HLIG , S., AND W ILLINGER , W. The many facets of Internet topology and trafc. Networks and Heterogeneous Media 1, 4 (December 2006), 569600. [8] A LGERS , S., B ERNAUER , E., B OERO , M., B REHERET, L., D I TARANTO , C., D OUGHERTY, M., F OX , K., AND G ABARD , J.-F. Review of micro-simulation models. Tech. rep., Simulation Modelling Applied to Road Transport European Scheme Tests (SMARTEST), Leeds University, 1997. http: //www.its.leeds.ac.uk/smartest. [9] BARAN , P. On distributed communications: 1. Introduction to distributed communications network. RAND Memorandum, August 1964. [10] BARFORD , P., K LINE , J., P LONKA , D., AND RON , A. A signal analysis of network trafc anomalies. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurement (2002), pp. 71 82.
155
[11] B HARTI , V., K ANKAR , P., S ETIA , L., G URSUN , G., L AKHINA , A., AND C ROVELLA , M. Inferring invisible trafc. In Proceedings of the 6th International Conference (2010), Co-NEXT 10, pp. 22:1 22:12. [12] B LILI , R., AND M AGHBOULEH , A. Best practices for determining trafc matrices in IP networks V 4.0. Tutorial in NANOG 43, http://www.nanog.org/meetings/nanog43/ presentations/Blili_trafficmatrix_N43.pdf, 2008. [13] B OYD , S., AND VANDENBERGHE , L. Convex Optimization. Cambridge University Press, 2004. [14] B RAUCKHOFF , D., D IMITROPOULOS , X., WAGNER , A., AND S ALAMATIAN , K. Anomaly extraction in backbone networks using association rules. In ACM SIGCOMM Internet Measurement Conference (IMC) (2009), pp. 2834. [15] B ROCKWELL , P. J., AND DAVIS , R. A. Introduction to Time Series and Forecasting, 2nd ed. Springer, March 2002. [16] B URIOL , L. S., R ESENDE , M. G. C., R IBEIRO , C., AND T HORUP, M. A hybrid genetic algorithm for the weight setting problem in OSPF/IS-IS routing. Optimization Online (2003). [17] C AESAR , M., AND R EXFORD , J. BGP routing policies in ISP networks. IEEE Network 19, 6 (2005), 511. [18] C AHN , R. S. Wide Area Network Design. Morgan Kaufmann, 1998. [19] C ANDES , E., AND P LAN , Y. Matrix completion with noise. Proc. IEEE 98, 6 (June 2010), 925936. [20] C ANDES , E., AND R ECHT, B. Exact matrix completion via convex optimization. Found. of Comput. Math. 9 (2008), 717772. [21] C ANDES , E., AND TAO , T. Near optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Info. Theory 52, 12 (December 2006), 54065425. [22] C ANDES , E., AND TAO , T. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Info. Theory 56, 5 (May 2010), 20532080. [23] C AO , J., C LEVELAND , W. S., L IN , D., AND S UN , D. X. On the nonstationarity of Internet trafc. In ACM SIGMETRICS 2001 (New York, NY, USA, 2001), pp. 102112. [24] C AO , J., DAVIS , D., W IEL , S. V., AND Y U , B. Time-varying network tomography: Router link data. J. Am. Statist. Assoc. 95, 452 (December 2000), 10631075. [25] C ASE , J. D., F EDOR , M., S CHOFFSTALL , M. L., AND DAVIN , J. R. A simple network management protocol (SNMP). Tech. Rep. RFC 1157, IETF, May 1990. http://www.ietf.org/rfc/ rfc1157.txt. [26] C ASTRO , R., C OATES , M., L IANG , G., N OWAK , R., AND Y U , B. Network Tomography: Recent Developments. Statistical Science Magazine 19, 3 (August 2004), 499517. [27] CERT/CC. CERT Advisory CA-2001-26 Nimda Worm: An Overview. http://www.cert. org/advisories/CA-2001-26.html, September 2001.
156
[28] C HANG , H., JAMIN , S., M AO , Z., AND W ILLINGER , W. An empirical approach to modeling interAS trafc matrices. In ACM/SIGCOMM Internet Measurement Conference (IMC) 2005 (October 2005), pp. 139152. [29] C LAFFY, K., B RAUN , H.-W., AND P OLYZOS , G. Tracking long-term growth of the NSFNET. Cooperative Association for Internet Data Analysis - CAIDA, http://www.caida.org/ publications/papers/1994/tlg/, 1994. [30] C OATES , M., H ERO , A., N OWAK , R., AND Y U , B. Internet tomography. Signal Processing Magazine 19, 3 (May 2002), 4765. [31] C OATES , M., P OINTURIER , Y., AND R ABBAT, M. Compressed network monitoring for IP and alloptical networks. In ACM SIGCOMM Internet Measurement Conference (IMC) (2007), pp. 241252. [32] C ONVERSE , P. D. New laws of retail gravitation. The Journal of Marketing 14, 3 (October 1949), 379384. [33] C OVER , T. M., AND T HOMAS , J. A. Elements of Information Theory, 2nd ed. John Wiley and Sons, Inc., 2006. [34] C ROVELLA , M., AND KOLACZYK , E. Graph wavelets for spatial trafc analysis. In Proceedings of IEEE Infocom (April 2003), pp. 18481857. [35] D EMING , W. E., AND S TEPHAN , F. F. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Stat 11, 4 (1940), 427444. [36] D ONOHO , D. Compressed sensing. IEEE Trans. Info. Theory 52, 4 (April 2006), 12891306. [37] D UFFIELD , N., L UND , C., AND T HORUP, M. Estimating ow distributions from sampled ow statistics. IEEE/ACM Trans. Networking 13, 5 (2005), 933946. [38] D UFFIELD , N., L UND , C., AND T HORUP, M. Learn more, sample less: control of volume and variance in network measurement. IEEE Trans. Info. Theory 51, 5 (2005), 17561775. [39] D UFFIELD , N. G., AND G ROSSGLAUSER , M. Trajectory sampling for direct observation. IEEE/ACM Trans. Networking 9, 3 (June 2001), 280292. [40] D UFFIELD , N. G., AND G ROSSGLAUSER , M. Trajectory sampling with unreliable reporting. In INFOCOM 2004 (2004), pp. 15701581. [41] E RIKSSON , B., BARFORD , P., B OWDEN , R., ROUGHAN , M., D UFFIELD , N., AND S OMMERS , J. BasisDetect : A model-based network event detection framework. In ACM SIGCOMM Internet Measurement Conference (Melbourne, Australia, 2010). [42] E RLANDER , S., AND S TEWART, N. F. The gravity model in transportation analysis: Theory and extensions. Topics in Transportation. International Science, 1990. [43] E RRAMILLI , V., C ROVELLA , M., AND TAFT, N. An independent-connection model for trafc matrices. In ACM SIGCOMM Internet Measurement Conference (IMC) (October 2006), pp. 251 256.
157
[44] E RRAMILLI , V., C ROVELLA , M., AND TAFT, N. An independent-connection model for trafc matrices. Tech. Rep. BUCS-TR-2006-022, Department of Computer Science, Boston University, September 2006. [45] F EAMSTER , N., B ORKENHAGEN , J., AND R EXFORD , J. Guidelines for interdomain trafc engineering. ACM SIGCOMM Computer Communications Review 33, 5 (October 2003), 1930. [46] F ELDMANN , A., G REENBERG , A., L UND , C., R EINGOLD , N., AND R EXFORD , J. Netscope: Trafc engineering for IP networks. IEEE Net. Magazine 14 (March 2000), 1119. [47] F ELDMANN , A., G REENBERG , A., L UND , C., R EINGOLD , N., R EXFORD , J., AND T RUE , F. Deriving demands for operational IP networks: Methodology and experience. IEEE/ACM Trans. Netw. 9 (June 2001), 265280. [48] F ELDMANN , A., K AMMENHUBER , N., M AENNEL , O., M AGGS , B., D E P RISCO , R., AND S UN DARAM , R. A methodology for estimating interdomain web trafc demand. In ACM SIGCOMM Internet Measurement Conference (IMC) (October 2004), pp. 322335. [49] F ERRARI , M. J., B JORNSTAD , O. N., PARTAIN , J. L., AND A NTONOVICS , J. A gravity model for the spread of a pollinator-borne plant pathogen. American Naturalist 168, 3 (September 2006), 294303. [50] F ORTZ , B., R EXFORD , J., AND T HORUP, M. Trafc engineering with traditional IP routing protocols. IEEE Communications Magazine 40, 10 (October 2002), 118124. [51] F ORTZ , B., AND T HORUP, M. Optimizing OSPF/IS-IS weights in a changing world. IEEE Journal on Selected Areas in Communications 20, 4 (2002), 756767. [52] F ORTZ , B., AND T HORUP, M. Robust optimization of OSPF/IS-IS weights. In Proc. INOC 2003 (October 2003), pp. 225230. [53] G ERBER , A., H OULE , J., N GUYEN , H., ROUGHAN , M., AND S EN , S. P2P The Gorilla in the Cable. In National Cable & Telecommunications Association(NCTA) 2003 National Show (Chicago,IL, June 2003). [54] G OLDSCHMIDT, O. ISP backbone inference methods to support trafc engineering: Methodology and experience. In Internet Statistics and Metrics Analysis (ISMA) Workshop (December 2000). [55] G ROSCHWITZ , N. K., AND P OLYZOS , G. C. A time series model of long-term NSFNET backbone trafc. Cooperative Association for Internet Data Analysis - CAIDA, http://www.caida.org/ publications/papers/1994/tsm/, 1994. [56] G U , Y., M C C ALLUM , A., AND T OWSLEY, D. Detecting anomalies in network trafc using maximum entropy estimation. In ACM SIGCOMM Internet Measurement Conference (IMC) (2005), pp. 3232. [57] G UNNAR , A., J OHANSSON , M., AND T ELKAMP, T. Trafc matrix estimation: A comparison on real data. In ACM SIGCOMM Internet Measurement Conference (IMC) (October 2004), pp. 149160. [58] H OOGENDOORN , S. P., AND B OVY, P. H. L. State-of-the-art of vehicular trafc ow modelling. In Proceedings of the Institution of Mechanical Engineers, Part I: Journal of Systems and Control Engineering (2001), pp. 283303.
158
[59] J UNG , A. F. Is Reillys law of retail gravitation always true? The Journal of Marketing 24, 2 (October 1959), 6263. [60] KOLDA , T. G., AND BADER , B. W. Tensor decompositions and applications. SIAM Review 51, 3 (September 2009), 455500. [61] KOWALSKI , J., AND WARFIELD , B. Modeling trafc demand between nodes in a telecommunications network. In ATNAC 95 (1995). [62] L ABOVITZ , C., I EKEL -J OHNSON , S., M C P HERSON , D., O BERHEIDE , J., AND JAHANIAN , F. Internet inter-domain trafc. In ACM SIGCOMM (2010), pp. 7586. [63] L AKHINA , A., C ROVELLA , M., AND D IOT, C. Characterization of network-wide anomalies in trafc ows. In ACM SIGCOMM Internet Measurement Conference (IMC) (2004), pp. 201206. [64] L AKHINA , A., C ROVELLA , M., AND D IOT, C. Diagnosing network-wide trafc anomalies. In ACM SIGCOMM 2004 (2004), pp. 219230. [65] L AKHINA , A., C ROVELLA , M., AND D IOT, C. Mining anomalies using trafc feature distributions. In ACM SIGCOMM 2005 (2005), pp. 217228. [66] L AKHINA , A., PAPAGIANNAKI , K., C ROVELLA , M., D IOT, C., KOLACZYK , E. D., AND TAFT, N. Structural analysis of network trafc ows. SIGMETRICS Perform. Eval. Rev. 32, 1 (June 2004), 6172. [67] L AM , D., C OX , D., AND W IDOM , J. Teletrafc modeling for personal communications services. IEEE Communications Magazine: Special Issues on Teletrafc Modeling Engineering and Management in Wireless and Broadband Networks 35 (February 1997), 7987. [68] L I , X., B IAN , F., C ROVELLA , M., D IOT, C., G OVINDAN , R., I ANNACCONE , G., AND L AKHINA , A. Detection and identication of network anomalies using sketch subspaces. In ACM SIGCOMM Internet Measurement Conference (IMC) (2006), pp. 147152. [69] M ALLAT, S. A Wavelet Tour of Signal Processing. Academic Press, 1998. [70] M C C ULLAGH , P., AND N ELDER , J. A. Generalized Linear Models, 2nd ed. Monographs on Statistics and Applied Probability. Chapman and Hall, 1989. [71] M C FADDEN , D. Modeling the choice of residential location. In Spatial Interaction Theory and Planning Models, A. Karlqvist et al., Ed. North Holland, 1978, pp. 7596. [72] M EDINA , A., F RALEIGH , C., TAFT, N., B HATTACHARYYA , S., AND D IOT, C. A taxonomy of IP trafc matrices. Proc. SPIE 4868, 200213 (July 2002). [73] M EDINA , A., TAFT, N., S ALMATIAN , K., B HATTACHARYYA , S., AND D IOT, C. Trafc matrix estimation: Existing techniques and new directions. In ACM SIGCOMM 2002 (2002), pp. 161174. [74] M ITRA , D., AND WANG , Q. Stochastic trafc engineering for demand uncertainty and risk-aware network revenue management. IEEE/ACM Trans. Networking 13, 2 (April 2005), 221233. [75] M URPHY, J., H ARRIS , R., AND N ELSON , R. Trafc engineering using OSPF weights and splitting ratios. In Proceedings of 6th International Symposium on Communications Interworking of IFIP Interworking 2002 (2002), pp. 1316.
159
[76] M URRAY, G. D., AND C LIFF , A. D. A stochastic model for measles epidemics in a multi-regional setting. Trans. Institue British Geographers 2 (1977), 158174. [77] NLANR. Abilene Trace Data. http://pma.nlanr.net/Special/ipls3.html. [78] N ORROS , I. A storage model with self-similar input. Queueing Systems 16, 3-4 (1994), 387396. [79] N UCCI , A., B HATTACHARYYA , S., TAFT, N., AND D IOT, C. IGP link weight assignment for operational Tier-1 backbones. IEEE/ACM Trans. Networking 15, 4 (August 2007), 789802. [80] N UCCI , A., C RUZ , R., TAFT, N., AND D IOT, C. Design of IGP link weight changes for estimation of trafc matrices. In INFOCOM 2004 (March 2004), vol. 4, pp. 23412351. [81] N UCCI , A., S RIDHARAN , A., AND TAFT, N. The problem of synthetically generating IP trafc matrices: Initial recommendations. SIGCOMM Comput. Commun. Rev. 35 (July 2005), 1932. [82] O DLYZKO , A. M. Internet trafc growth: Sources and implications. In Optical Transmission Systems and Equipment for WDM Networking II, B. B. Dingel, W. Weiershausen, A. K. Dutta, and K.-I. Sato, Eds., vol. 5247. Proc. SPIE, 2003, pp. 115. [83] PAPAGIANNAKI , K., TAFT, N., Z HANG , Z.-L., AND D IOT, C. Long-term forecasting of Internet backbone trafc. IEEE Trans. Neural Netw. 16, 5 (September 2005), 11101124. [84] PATRIKAKIS , C., M ASIKOS , M., AND Z OURARAKI , O. Distributed denial of service attacks. The Internet Protocol Journal 7, 4 (December 2004), 1335. [85] PAXSON , V. End-to-End routing behavior in the Internet. IEEE/ACM Trans. Networking 5, 5 (October 1997), 601615. [86] P OTTS , R. B., AND O LIVER , R. M. Flows in Transportation Networks. Academic Press, 1972. ONEN [87] P OYH , P. A tentative model for the volume of trade between countries. Weltwirtschaftliches Archive 90 (1963), 93100. [88] Q UOITIN , B., AND U HLIG , S. Modeling the routing of an autonomous system with C-BGP. IEEE Network 19, 6 (2005), 1219. [89] R ECHT, B., FAZEL , M., AND PARRILO , P. A. Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization. SIAM Review 52, 3 (2010), 471501. [90] R EKHTER , Y., AND L I , T. A Border Gateway Protocol 4 (BGP-4). RFC 1771 (1995). http: //www.ietf.org/rfc/rfc1771.txt. [91] R EXFORD , J. Route optimization in IP networks. In Handbook of Optimization in Telecommunications, Springer Science and Business (2006), Kluwer Academic Publishers. [92] R EYNOLDS , R. B. A test of the law of retail gravitation. The Journal of Marketing 17, 3 (January 1953), 273277. , D., ROUGHAN , M., AND W ILLINGER , W. Towards a meaningful MRA of trafc matrices. [93] R INC ON In ACM SIGCOMM Internet Measurement Conference (IMC08) (2008), pp. 331336.
160
[94] R INGBERG , H., S OULE , A., R EXFORD , J., AND D IOT, C. Sensitivity of PCA for trafc anomaly detection. In Proceedings of the 2007 ACM SIGMETRICS (2007), pp. 109120. [95] R ISSANEN , J. A universal prior for integers and estimation by minimum description length. Ann. Statis. 11, 2 (June 1983), 416431. [96] ROUGHAN , M. Simplifying the synthesis of Internet trafc matrices. SIGCOMM Comput. Commun. Rev. 35, 5 (2005), 9396. [97] ROUGHAN , M. A case-study of the accuracy of SNMP measurements. Journal of Electrical and Computer Engineering 2010 (2010). Article ID 812979. [98] ROUGHAN , M. Robust network planning. In The Guide to Reliable Internet Services and Applications, C. R. Kalmanek, S. Misra, and R. Yang, Eds. Springer, 2010, ch. 5, pp. 137177. [99] ROUGHAN , M., G REENBERG , A., K ALMANEK , C., RUMSEWICZ , M., YATES , J., AND Z HANG , Y. Experience in measuring backbone trafc variability: Models, metrics, measurements and meaning. In ACM SIGCOMM Internet Measurement Workshop (2002), pp. 9192. [100] ROUGHAN , M., T HORUP, M., AND Z HANG , Y. Trafc engineering with estimated trafc matrices. In ACM SIGCOMM Internet Measurement Conference (IMC) (October 2003), pp. 248258. [101] S CHERRER , A., L ARRIEU , N., OWEZARSKI , P., B ORGNAT, P., AND A BRY, P. Non-Gaussian and long memory statistical characterizations for Internet trafc with anomalies. IEEE Trans. Depend. Secure Computing 4, 1 (JanuaryMarch 2007), 5670. [102] S CHWARZ , G. Estimating the dimension of a model. Ann. Statist. 6, 2 (1978), 461464. [103] S EN , A. K., AND S MITH , T. E. Gravity models of spatial interaction behavior. Springer, 1995. [104] S HAIKH , A., I SETT, C., G REENBERG , A., ROUGHAN , M., AND G OTTLIEB , J. A case study of OSPF behavior in a large enterprise network. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurement (2002), IMW 02, pp. 217230. [105] S OMMERS , J., B OWDEN , R. A., E RIKSSON , B., BARFORD , P., ROUGHAN , M., AND D UFFIELD , N. G. Efcient network-wide ow record generation. In IEEE Infocom (2011), pp. 23632371. [106] S OULE , A., N UCCI , A., C RUZ , R., L EONARDI , E., AND TAFT, N. How to identify and estimate the largest trafc matrix elements in a dynamic environment. SIGMETRICS Perform. Eval. Rev. 32, 1 (June 2004), 7384. [107] S OULE , A., N UCCI , A., C RUZ , R. L., L EONARDI , E., AND TAFT, N. Estimating dynamic trafc matrices by using viable routing changes. IEEE/ACM Transactions on Networking 15, 3 (2007), 485498. [108] S TEWART, G. W. Matrix algorithms Volume 1: Basic decompositions. SIAM, 1998. [109] S TEWART, J. Q. Demographic gravitation: Evidence and applications. Sociometry 11, 1/2 (February 1948), 3158. [110] S WAIT, J. D. Probabilistic choice set information in transportation demand. Tech. rep., Department of Civil and Environmental Engineering, MIT, June 1984.
161
[111] T EBALDI , C., AND W EST, M. Bayesian inference on network trafc using link count data. J. Am. Statist. Assoc. 93, 442 (June 1998). [112] T EIXEIRA , R., D UFFIELD , N., R EXFORD , J., AND ROUGHAN , M. Trafc matrix reloaded: Impact of routing changes. In Proc. Passive and Active Measurement Workshop (April 2005), pp. 251264. [113] T EIXEIRA , R., S HAIKH , A., G RIFFIN , T., AND R EXFORD , J. Dynamics of hot-potato routing in IP networks. SIGMETRICS Perform. Eval. Rev. 32 (June 2004), 307319. [114] T INBERGEN , J. Shaping the world economy: Suggestions for an international economic policy. The Twentieth Century Fund (1962). [115] U HLIG , S., B ONAVENTURE , O., M AGNIN , V., R APIER , C., AND D ERI , L. Implications of the topological properties of internet trafc on trafc engineering. In 19th ACM Symposium on Applied Computing, Special Track on Computer Networks (March 2004), pp. 339346. [116] U HLIG , S., Q UOITIN , B., L EPROPRE , J., AND BALON , S. Providing public intradomain trafc matrices to the research community. Computer Communication Review 36, 1 (January 2006), 8386. [117] VARDI , Y. Network Tomography: Estimating source-destination trafc intensities from link data. J. Am. Statist. Assoc. 91 (1996), 365377. [118] VARGHESE , G., AND E STAN , C. The measurement manifesto. SIGCOMM Comput. Commun. Rev. 34, 1 (January 2004), 914. [119] V EITCH , D., AND A BRY, P. Wavelet analysis of long-range dependent network trafc. In Proceedings of the 9th INFORMS Applied Probability Conference (Cambridge, Massachusetts, 30 June - 1 July 1997), INFORMS Technical Section on Applied Probability, p. 57. http://appliedprob. society.informs.org/. [120] WALLACE , C. S., AND B OULTON , D. M. An information measure for classication. Computer Journal 11, 2 (August 1968), 185194. [121] WANG , J., AND R ABBAT, M. Wavelet-based trafc matrix modelling. In Network Measurement and Mapping Conference (2010). [122] X IA , Y. C., B JORNSTAD , O. N., AND G RENFELL , B. T. Measles metapopulation dynamics: A gravity model for epidemiological coupling and dynamics. American Naturalist 164, 3 (2004), 267 281. [123] Z HANG , Y., G E , Z., D IGGAVI , S., M AO , Z., ROUGHAN , M., VAISHAMPAYAN , V., AND W ILL INGER , W. Markov Processes and Related Topics: A Festschrift for Thomas G. Kurtz, vol. 4 of IMS Collections. Institute for Mathematical Statistics, 2009, ch. Internet Trafc and Multiresolution Analysis, pp. 215234. Stewart N. Ethier, Jin Feng, Richard H. Stockbridge, Editors. [124] Z HANG , Y., G E , Z., G REENBERG , A., AND ROUGHAN , M. Network anomography. In ACM SIGCOMM Internet Measurement Conference (IMC) (2005), pp. 317330. [125] Z HANG , Y., ROUGHAN , M., D UFFIELD , N., AND G REENBERG , A. Fast accurate computation of large-scale IP trafc matrices from link loads. In ACM SIGMETRICS 2003 (2003), pp. 206217.
162
[126] Z HANG , Y., ROUGHAN , M., L UND , C., AND D ONOHO , D. An information-theoretic approach to trafc matrix estimation. In ACM SIGCOMM 2003 (2003), pp. 301312. [127] Z HANG , Y., ROUGHAN , M., L UND , C., AND D ONOHO , D. Estimating Point-to-Point and Pointto-Multipoint trafc matrices: An information-theoretic approach. IEEE/ACM Trans. Netw. 13, 5 (October 2005), 947960. [128] Z HANG , Y., ROUGHAN , M., W ILLINGER , W., AND Q IU , L. Spatio-Temporal compressive sensing and Internet trafc matrices. In ACM SIGCOMM 2009 (August 2009), pp. 267278.
163
Introduction
The Internet has grown very large. No one knows exactly how large, but rough estimates indicate billions of users (around 1.8B in 2009, according to eTForecasts [4]), hundreds of millions of web sites (over 200M in February 2009, according to Netcraft [19]), and hundreds of billions of web pages (over 240B, according to the Internet archive [1]). The Internet is also very dynamic users log in and out, new services get added, routing policies change, normal trafc gets mixed with denial-of-service (DoS) attack trafc, etc. An important question is: How do we manage such a huge and highly dynamic structure like the Internet? As a corollary, how can we build a network of the future unless we understand the steady-state and dynamics of what we build? In this chapter, we resort to two mathematical frameworks: optimization theory to study optimal steady states of networks, and control theory to study the dynamic behavior of networks as they evolve toward steady state. Our emphasis will be on congestion control using the notion of prices to model the level of congestion, such as delays and losses, observed by users or trafc sources. Expected Background: We assume basic background in calculus and algebra. We also assume basic knowledge of systems modeling, optimization theory, Laplace transforms, and control theory Keshavs textbook [13] provides an excellent source for these mathematical foundations, in particular, chapters 4, 5, and 8. Basic knowledge of Internets Transmission Control Protocol (TCP) [11], namely Reno and Vegas [15] versions, as well as queue management schemes, namely Random Early Drop (RED) [6], should be helpful. This chapter briey covers needed background material to serve as a refresher or quick reference. The material of this chapter has been used at Boston University in a second (advanced) networking course taken by senior undergraduate and graduate students. Contribution and Outline: The purpose of this chapter is to make the application of optimization and control theory to congestion control more accessible through intuitive explanations and simple control applications, using examples from Internets protocols. This chapter has been largely inuenced by the work of Frank Kelly [12], which introduces the notion of prices and user utility, the work by R. Srikant [24], which discusses the dynamics of user (trafc source) and network adaptations, and control theory texts and notes (e.g., [20, 16]). The exposition here attempts to tie these various mathematical models and techniques through simple running examples and illustrations, modeling the dynamics of both ow control and routing. We start by motivating the network control problem using analogy to the problem of producing, pricing, and consuming gas/oil (Section 2). We introduce several examples of optimally allocating resources (link bandwidth) to users (trafc sources), resulting in different notions of fairness. We then introduce dynamic equations that model source and link adaptation algorithms (Section 3). Since these are generally non-linear equations, we review the technique of linearization and how classical (linear) control theory can be used to
164
I. Matta, Optimizing and Modeling Dynamics in Networks, in H. Haddadi, O. Bonaventure (Eds.), Recent Advances in Networking, (2013), pp. 163-220. Licensed under a CC-BY-SA Creative Commons license.
study stability and transient performance (Section 4). We use as a running example, a Vegas-like system controlled using different types of well-established controllers. Using linear approximation around a certain operating point, we can only study so-called local stability. To study general (global) stability of non-linear models, we introduce the control-theoretic Lyapunov method (Section 5). We also show how the control-theoretic Nyquist stability method can be applied to the linearized model to study the impact of delay in feedback (i.e., measurements of the current state of the system). The material on the Nyquist method is a bit more advanced and can be skipped on a rst reading. We generalize the notion of stability by introducing the concept of contractive mapping, and extend its application to routing dynamics (Section 6). Finally, we provide two case studies that apply control-theoretic techniques introduced in this chapter: the rst study investigates stability under class-based scheduling of rate-adaptive trafc ows (Section 7), and the second study investigates stability of data transfer over a dynamic number of rate-adaptive transport connections (Section 8). These case studies can be skipped on a rst reading. The chapter concludes with a set of exercises (Section 9) and their solutions (Section 10).
In this section, we describe Frank Kellys optimization framework [12], which models users expectations (requirements) with utility functions, and network congestion signals (e.g., loss, delay) as prices. The network is shown to allocate transmission rates (throughputs) to users (ows) in such a way as to meet some fairness objective. The objective of a user, or what makes the user happy, can be mathematically modeled as a utility function. For example, drivers observe theprice of transportation and make one of many possible decisions: drive, take the subway instead, walk, bike, or stay home. The decision may involve several factors like the price of gas, convenience, travel time, etc. For example, if it rains, you might decide to drive to work, or you might decide to walk to work to save money and can then afford to go to the movies later in the week. Of course, how much driving a person does, is affected by all sorts of factors and user priorities are unknown to the system of gas stations and oil companies. But, each driver has her own utility! Figure 1 illustrates with a block diagram the closed-loop relationship between drivers (users), gas stations (where gas is sold to and consumed by users), and the market (which represents OPEC1 , the government, and oil companies that collectively produce gas and set market prices based on user demand). Drivers set the total demand by observing gas prices. Notice that the gas price includes at-the-pump gas price, and possibly other exogenous prices like tips for full service, fees for credit card payment, or additional local taxes. Observe also that prices observed by users are delayed and do not typically represent the exact current state of the market given inherent delays in gas production, renement, transportation, etc. This kind of block diagram is typical of many closed-loop (feedback) control systems where the system is said to reach equilibrium if the demand (for gas by drivers) matches the supply (of gas in the market). In data networks, users drive the demand on the network and have different utilities (expectations) when downloading music, playing games, making skype (voice/video) calls, or denying others service by launching a denial-of-service (DoS) attack! In turn, the network observes that user demand and sets prices, where the price could be real money, or it could be some measure (indication) of congestion (e.g., delay, loss), or it could represent additional resources that need to be allocated to avoid congestion. An important question is: What is the goal of network design? Is it to make users happy? You hope so! Then, mathematically, we say the goal of the network is to maximize the sum of utilities for all its
1 OPEC
165
Figure 1: The gas control loop users [23].2 Figure 2 illustrates the data network equivalent of the gas control loop shown in Figure 1. We next consider the modeling of user utility and network behavior (resource allocation), before introducing the optimization framework to study the (optimal) steady state for the users and network.
166
r (xr ) as xr 0, and U r (xr ) 0 as xr . Throughout this (see Figure 4). And the derivative U chapter we assume strictly concave utilities.
Figure 4: Concave function. A function f (.) is said to be concave if f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ), i.e., for any two points x1 and x2 , the straight line that connects f (x1 ) and f (x2 ) is always below or equal to the function f (.) itself. Note that a differentiable concave function has a maximum value at some point xmax , and that the derivative f(xmax ) = 0. A strictly concave function would have a strict inequality, whereas a convex function has a cup-like shape and has a minimum instead.
each user has one ow over a single path, we use the terms user, ow, and route interchangeably.
167
Ur (xr )
subject to Ax C
over x 0
For such an optimization problem, it is known that there exists a unique solution. This is the case because the function to optimize is strictly concave and the link capacity inequality constraints Ax C form a so-called convex set (see Figure 6.)
Figure 6: A convex set. A convex set intuitively means that any linear combination of any two points located on the boundary of the region, which is formed by the linear inequalities, lies within the region itself. The practical challenge in solving this problem however is that the network does not know the utilities of its users, let alone its centralized nature makes it computationally expensive to solve! To address these challenges, we start by decomposing the problem into R problems, one for each user r R, and one problem for the network (we will later decompose this network problem further into individual
168
resource problems). The network will present each user with a price r ($/bit). Through these prices, the network attempts to infer user utilities. Specically, observing r , user r will then choose an amount to pay wr ($/second) for the service (that maximizes the users utility), which in turn determines how much rate xr (bits/second) the user would get (xr = wr /r ). The network sets its prices r based on the load xr , r.
r where w r = xr . Given the network price r and its own private utility function Ur , user r determines how much it is willing to pay wr so as to maximize her own utility. Knowing the vector W = {wr , r}, its routing and capacity matrices, the network allocates user rates r xr by optimizing some network function f (x, W ). Once xr s are obtained, prices are obtained as r = w xr .
f (xr , wr )
subject to Ax C
over x 0
wr xr
Maximizing this function results in maximizing the total weighted throughput for all users. As a special case, for unit weights, the network optimization problem maximizes the total throughput through the network. This might seem to y in the face of what we think is fair! Consider the following simple example (see Figure 7): given both links have capacities of 6 units, the total throughput allocated to all users is the total network capacity of 12 units. This can be achieved by allocating 6 units of capacity to each of the 1-link ows (users): the red user and the blue user, leaving the 2-link (green) ow with no capacity allocated to its user. That does not seem fair! A different function f would allocate rates to users differently and so it would provide a different notion of fairness. But, the big question is: how do (should) we dene fairness? The research literature introduces many notions of fairness, most notably the so-called max-min fairness.
169
Intuitively, max-min fairness means that resources (link capacities) are allocated to users (ows) so that the allocation is: 1. fair: all users get equal share of a link, as long as users have demand that would fully consume their share, and 2. efcient: each link is utilized to the maximum load possible. In other words, if a user cannot consume its equal share of a link, then the excess capacity must be (recursively) allocated equally among high-demanding users. So, the nal outcome is that low-demanding users get exactly what they need, while high-demanding users get equal allocations. Consider the following multilink network example (see Figure 8): all links have capacities of 150 units and we assume elastic trafc
Figure 8: Max-min fair capacity allocation sources, i.e., sources that would consume all what they can get. We start with the rst (left-most) link since it is used by most users so it is the most loaded one. Each ow using that link gets allocated an equal share of 150/3 = 50 units. Proceeding to the next loaded link, the middle one, each of its two ows should get an equal share of 75, however ow F3 is limited by its rst link to 50 units of throughput. Thus, ow F4 gets the left-over from F3 to a total allocation of 75 + 25 = 100. The right-most link, at capacity of 150, does not limit the throughput of F4 , which ends up using only 100 units of that link, leaving 50 unused. At the end of this process, we say that the max-min fair allocation vector is xT = (50, 50, 50, 100). Mathematically, max-min fairness is achieved when the network maximizes the following function: f = minrR xr Intuitively, maximizing the minimum of allocated rates results in equalizing these rates, as long as users have enough demand that will consume these rates over the network.
170
2.5.2
Proportional Fairness
Another equally popular fairness denition is the so-called (weighted) proportional fairness. This notion of fairness is achieved when the network maximizes the following function: f=
r R
wr log(xr )
Note that the log function is a concave, and strictly increasing function. Thus, given optimal rate allocation solution x , that is feasible, i.e., x 0 and A x C , any other feasible solution x will cause the aggregate x x proportional change rR wr rx r to be less than or equal zero. To show this, for simplicity, assume one r user and unit weight, so f (x) = log(x). Expanding f (x) into its rst-order (linear) Taylors approximation around x , we obtain: f (x) f (x ) + (x x )f(x ) Given the derivative f(x ) =
1 x ,
we have: f (x) f (x ) + ( x x ) x
Since f is maximized at x , f (x ) f (x) and so the proportional fairness condition must hold: x x 0 x Note that the presence of weight wr intuitively means that user (ow) r is equivalent to wr users with unit weight each. 2.5.3 General Parameterized Utility
If the network function f (x) is a function of the utilities of its users U (x), then the network is in fact maximizing a function of user utilities. Assuming each user r has unit weight wr , Ur (xr ) can be generalized as [18]: Ur (xr ) =
x1 r 1
where is a parameter that determines the fairness criterion of the network. More specically, if 0, then a users utility is linear in its allocated rate and the network is effectively maximizing the sum of user utilities rR Ur (xr ) = rR xr , which in turn yields a greedy allocation that maximizes the total throughput over the network. On the other hand, if 1, then this is equivalent to a log utility, yielding proportional fair allocation. To see this, lets take the derivative of Ur (xr ):
r (xr ) = (1 )xr 1 as 1 U 1 xr
r (xr ), we get back Ur (xr ) = log(xr ). By integrating U Similarly, it can be shown that corresponds to a minimum utility, yielding a max-min fair allocation.
171
wr log(xr )
subject to Ax C
over x 0
We can solve this problem using the theory of constrained convex optimization using the Lagrangian technique. Specically, we move the constraints into the objective function that we want to optimize, thus making the optimization problem effectively unconstrained. We do so by introducing so-called Lagrangian multipliers into the new objective (Lagrangian) function L: max L =
r R
wr log(xr ) + T (C Ax)
The (row) vector T is a Lagrangian vector with a variable j for each link j in the network. Note that L is a strictly concave function, thus a solution exists at which the derivatives of L with respect to each xr and each j are equal to zero: L xr L j = wr j xr j r (Cj xr )
r j
(1)
(2)
The notation j r indicates all links j used by user (ow/route) r, whereas r j denotes all ows r using link j , i.e. the total load on link j . By equating the rst set of equations (1) to zero, we obtain the (weighted proportionally fair) solution: xr = wr
j r
(3)
We obtain j by also equating the second set of equations (2) to zero. Note that j and (Cj rj xr ) must be greater than or equal to zero since negative values do not maximize the objective function L! Furthermore, (Cj rj xr ) 0 ensures that the link capacity constraints rj xr Cj are automatically satised. If (Cj rj xr ) = 0 then j can be greater than zero. On the other hand, if j = 0, then the associated link may not be fully utilized, i.e. rj xr < Cj . Intuitively, j represents the cost associated with link j , so it is zero if the link is under-utilized, and positive if the link is allocated to capacity.
172
Example: Consider the example in Figure 7 but now assume the networks objective of proportionally allocating its capacity, i.e., max f = log(x0 ) + log(x1 ) + log(x2 ) subject to: x0 + x1 6
x0 , x1 , x2 0 where x0 , x1 , and x2 are the rates allocated to the two-link ow (user), the rst-link ow, and the second-link ow, respectively. 4 Using the Lagrangians solution method, we obtain: max L = log(x0 ) + log(x1 ) + log(x2 ) + 1 (6 (x0 + x1 )) + 2 (6 (x0 + x2 )) Taking derivatives, we obtain:
L x0 L x1 L x2 L 1 L 2
x0 + x2 6
= = = = =
1 x0
(1 + 2 )
1 x1 1 x2
1 2
6 (x0 + x1 ) 6 (x0 + x2 )
Equating these derivatives to zero, the last two equations show full utilization of the link capacities and that x1 = x2 , while the rst three equations give the following values of xi s: x1 = x2 = 1 1 1 = = 1 2 x0 = 1 2
Substituting in the capacity equations, we obtain the price of each link : 1 1 + =6 2 Thus, = 1 4 , and so x0 = 2, and x1 = x2 = 4. Note that in this optimal case, each link is fully utilized to capacity, and the ow that traverses two links is charged twice for each link it traverses and so it gets allocated a lower rate.5 End Example.
4 Note that since the objective (log) function is strictly increasing, then the x s should be as large as possible to consume the total i capacity of the links, so the two inequalities on link capacities could be turned into equalities. 5 As we will later see, this proportional rate allocation is what TCP Vegas [15] provides.
173
If the utility of each user r is a log function in its allocated rate xr , then the (weighted proportionally wr fair) network solution xr = is in fact, a solution to the whole system optimization problem that includes the network, as well as all users possibly trying to independently (in a distributed way) maximize their own log utilities. However, in a distributed setting, as noted earlier, even if the network knows the user utility functions, the network allocates user rates based on their willingness to pay, wr , which might be unknown to the network. This lack of knowledge can be overcome by observing the demand behavior of the user xr and the price r = j r j , and so wr is computed as wr = xr r . Otherwise, the network can just assign some weights wr to users based on some preference policy. The moral of the story is that in practice, there is no central network controller that knows W and can then allocate rates to users. Each user and each resource (link) might have its own individual controller that will operate independently and so we need to study the collective behavior of such composite system and answer questions such as: Would the system converge (stabilize) to a solution in the long term (i.e., reaching steady state)? If so, is this solution unique and how far is it from the target (optimal) operating point? In general, if the system gets perturbed, is it stable, i.e. does it converge back to steady state, and how long does it take to converge and how smooth or rough was that? In control-theoretic terminology, we refer to the response to such perturbation until steady state is reached as the transient response of the system. We refer to how far the system is from being unstable, or the magnitude of perturbation that renders the system unstable, as stability margin. To formally address these questions, we will resort to the modeling of user and network dynamic behaviors, in the form of differential (or difference) equations, then use well-known control-theoretic techniques to study the overall transient and steady-state behavior of the system.
j r j
The basic control problem is to control the output of a system given a certain input. For example, we want to control the user demand (sending rate) given the observed network price (e.g., packet loss or delay). Similarly, we want to control the price advertised by a network resource given the demand (rates) of its users. There is basically two kinds of control: open-loop control, and closed-loop (feedback) control. In openloop control systems, there is no feedback about the state of the system and the output of the system is controlled directly by the input signal. This type of control is thus simple, but not as common as closed-loop control. An example of open-loop control system is a microwave that heats food for the input (specied) duration. Feedback (closed-loop) control is more interesting and multiple controllers may be present in the same control loop. See Figure 2 where a user controller is present to control demand based on price, and a resource controller is also present to control price based on demand. Feedback control makes it possible to control the system well even if we cant observe or know everything, or if we make errors in our estimation (modeling) of the current state of the system, or if things change. This is because we can continually measure and correct (adapt) to what we observe (i.e., feedback signal). For example, in a congestion control system, we do not need to exactly know the number of users, the arrival rate of connections, or the service rate of the bottleneck resource, since each user would adapt its demand based on its own observed (measured, fed back) price, which reects the current overall congestion of the bottleneck resource. Associated with feedback control is a delay to observe the feedback (measured) signal, which is referred to as feedback delay. More precisely, feedback delay refers to the time taken from the generation of a control signal (e.g., updated user demand) until the process/system reacts to it (e.g., demand is routed over
174
the network), this reaction takes effect at each resource (e.g., load is observed on each link), and this effect is fed back to the controller (e.g., price is observed by the user).
175
This can be re-written as wr xr r = 0. Also, we saw that the optimal solution ensures that each link l is fully utilized, i.e. the load (total input rate) on link l, denoted by yl = s:ls xs , equals the link capacity cl . The dynamics of the sources and links can then be modeled such that these steady-state user rates and link loads are achieved. Specically, we can write the dynamic (time-dependent) source algorithm as: x r (t) = k [wr xr (t)r (t)] (5)
where k is a proportionality factor. Note that wr represents how much user r is willing to pay, whereas xr (t)r (t) represents the cost (price) of sending at that rate. Intuitively, the user sending rate increases (decreases) when the difference between these two quantities is positive (negative). And in steady state, r x r () 0, and so we obtain the steady-state solution xr = w r (as expected). w r r (t) = Given that the derivative of Ur (t), U xr , the source rate adaptation algorithm can be re-written as: r (t) r (t)] x r (t) = kxr (t)[U r (t) r (t)] x r (t) = K (t)[U (6)
Intuitively, the user increases its sending rate if the marginal utility (satisfaction) is higher than the price that the user will pay, otherwise the user decreases its sending rate. We can also write a dynamic equation for the adaptation in the link price l (t), called the link pricing algorithm: l (t) = h(yl (t) cl ) (7)
where h is a proportionality factor, and the total price, r (t), for user r, is the sum of the link prices along the users route, i.e. r (t) = l:lr l (t). Intuitively, the link price increases if the link is over-utilized (i.e. l () 0, and we obtain the yl (t) > cl ), otherwise the link price decreases. Note that at steady state, steady-state optimal solution yl = cl (as expected). It turns out that the source and link algorithms, Equations 6 and 7, represent general user and resource adaptation algorithms that collectively determine the transient and steady behavior of the whole system. In what follows, we use the form of Equation 6 to reserve engineer different versions of TCP and deduce the utility function that the TCP source tries to maximize.
Figure 10: TCP Reno over RED feedback control system Modeling TCP Reno: First, consider the modeling of TCP Reno, where the congestion window cwnd is increased by 1/cwnd for every acknowledged TCP segment / non-loss, i.e., it is (roughly) increased by 1 every round-trip time, and cwnd is decreased by half for every loss. Thus, we can write the following equation for changes in the congestion window of a single TCP ow, where p is the segment loss probability: cwnd = cwnd 1 (1 p) p cwnd 2
Let x denote the sending rate, and T the round-trip time, thus x = cwnd T . Assuming acknowledgments T (ACKs) come equally spaced, the time between ACKs (or lack thereof) is given by cwnd . Thus, we can re-write the above equation in terms of change in rate as: d cwnd(t) = dt Dividing both sides by T , we get: d x(t) = dt
1 x(t)T 2 (1 1 cwnd(t) (1
p(t))
T cwnd(t)
cwnd(t) p(t) 2
p(t))
1 x(t)
x(t) 2 p(t)
(8)
Lets denote the loss probability p(t) of TCP connection r as pr (t). pr (t) depends on the current load on path r, and can be approximated by the sum of loss probabilities experienced on individual links j r along the connections path. More specically, pr (t) =
j r
pj (
s:j s
xs (t))
Assuming small p such that (1 p) 1, we can re-write Equation 8 as follows: d 1 x(t) x(t) = 2 p(t) dt T 2
177
2
d x(t) 2 p(t)] x(t) = [ dt 2 T 2 x(t)2 Comparing Equation 9 with Equation 6, we can deduce the utility function of a TCP Reno source: (x) = U (x) we get: Integrating U U (x) = 2 T 2x 2 T 2 x2
(9)
1 , which can be viewed as the Observe that maximizing Renos utility results in minimizing the quantity x potential delay as it is inversely proportional to the allocated rate x. Thus, a network allocation based on such utility is referred to as minimum potential delay fair allocation.
Example: Revisting the example in Figure 7 but now assume the networks objective is to allocate its capacity according to the minimum potential delay fair allocation, i.e., max f = subject to: x0 + x1 6 1 1 1 + + x0 x1 x2
x0 , x1 , x2 0 where x0 , x1 , and x2 are the rates allocated to the two-link ow (user), the rst-link ow, and the second-link ow, respectively. Using the Lagrangians solution method, we obtain: max L = 1 1 1 + + + 1 (6 (x0 + x1 )) + 2 (6 (x0 + x2 )) x0 x1 x2
L x0 L x1 L x2 L 1 L 2
x0 + x2 6
= = = = =
1 x2 0
(1 + 2 )
1 x2 1 1 x2 2
1 2
6 (x0 + x1 ) 6 (x0 + x2 )
178
Equating these derivatives to zero, the last two equations show full utilization of the link capacities and that x1 = x2 , while the rst three equations give the following values of xi s: 1 1 1 x1 = x2 = = = 1 2 1 x0 = 2 Substituting in the capacity equations, we obtain the price of each link = 0.08, and so x0 2.5, and x1 = x2 3.5. Note that in this optimal case, each link is fully utilized to capacity, and the rate allocated to a ow is inversely proportional to the square-root of the price it observes along its path. Note also that this captures the well-known steady-state relationship between the throughput of a TCP Reno source and the inverse of the square-root of the loss probability observed by the TCP source [21]. A TCP Reno source adapting based on Equation 8 would converge to such steady-state throughput value. End Example.
Modeling TCP Vegas: Now, let us consider the modeling of another version of TCP TCP Vegas [15]. This version, unlike Reno, tries to avoid congestion, rather than induce loss and and then adapt the transmission (congestion) window to it. The basic idea behind Vegas is to calculate the actual throughput of the (t) connection as w T (t) , where w (t) is the current window size, and T (t) is the measured round-trip time (RTT) over the connections path. This RTT includes queueing delay, as well as propagation delay D. Ideally, with (t) no congestion, the ideal throughput can be computed by the source as wD , where D is estimated using the minimum RTT recently observed by the source. To ensure high utilization of the network, we want some queueing, i.e. the actual throughput is lower than the ideal one, but not too low to start causing congestion (i.e. buffer overow at the bottleneck link resulting in losses). Vegas then adapts w(t) based on some target difference, , between the actual throughput and the ideal one. More specically, the window increases if (t) (t) w(t) w(t) ( wD w T (t) ) < , decreases if ( D T (t) ) > , and stays the same otherwise. This dynamic source behavior, i.e. change in window over time, can be modeled as: dw(t) w(t) w(t) = k [ ( )] dt D T (t) This can be re-written as: dw(t) k w(t) = [D (w(t) D)] dt D T (t) Denoting the sending rate (throughput) by x(t) =
w(t) T (t) ,
and =
k D,
we have:
Observe that the left-hand side represents the difference between the window size of packets injected by the source, and the number of packets in ight / propagating along the path (represented by the product of throughput and propagation delay). Thus, the left-hand side represents the number of packets in the bottleneck queue, and D denotes the target queue occupancy of the bottleneck link. Intuitively, Vegas tries to maintain a small number of D packets (i.e., 1-2 packets) in the bottleneck queue to maintain both small delay and high (100%) utilization. Section 4 uses control theory to analyze a Vegas-like transmission model. Given that x = w/T , we get: xT xD = D Denoting the queueing delay by Q, we have T = Q + D, and so: xQ = D D Q
x=
Comparing with Equation 4, we can deduce that the willingness to pay wr for a Vegas user r is D and that the price r experienced by the user is the queueing delay Q. Now, to deduce the utility function that a Vegas user tries to maximize, let us write its rate adaptation equation following Equation 5: x r (t) = k [D xr (t)Q(t)] x r (t) = K (t)[ Thus, comparing with Equation 6, we deduce: r (t) = D U xr (t) Integrating, we obtain: Ur (t) = D log(xr (t)) Recall that maximizing such user utilities results in a weighted proportional fair allocation. Modeling RED: Let us now consider the modeling of the buffer and associated RED queue management algorithm [6]. Figure 11 shows how RED tries to avoid congestion by dropping (or marking) packets with probability pc as a (non-linear) function of the average queue length v . First, we model the evolution of the queue length b(t) as a function of the total input rate, y (t) = xs (t), and (bottleneck) link capacity, C : (t) = y (t) C b Denoting by v (t), the Exponential Weighted Moving Average (EWMA) of the queue length: v (t + ) = (1 )v (t) + b(t)
180
D Q(t)] xr (t)
Figure 11: RED dropping (or marking) function v (t + ) v (t) = (b(t) v (t)) Given v (t) gets updated at the link rate, i.e. =
1 C,
and v (t)
v (t+ )v (t) ,
we have:
v (t) = C (b(t) v (t)) This last equation represents the dynamic model of RED averaging, which in turn determines the price pc (t) that users experience. To simplify the model and gain insight, let us ignore the (hard) non-linearities of the RED function and consider only the linear region: pc (t) = v (t) + = v (t)dt + = C (b(t) v (t))dt +
where = pm /(Bmax Bmin ), and = pm Bmin /(Bmax Bmin ). To gain more insight, let us further ignore the RED averaging, assuming that the price is set in proportion to the actual queue length, Bmin = 0 and pm = 1, then we have: pc (t) = Differentiating both sides, we obtain: (t) = h(y (t) C ) p c (t) = h b
1 . Comparing with Equation 7, the packet dropping (congestion marking) probability, pc (t), where h = Bmax represents the price, i.e. Lagrangian multiplier, observed by users of this buffer. Note that at steady state, p c () 0, and so y = C , i.e. the link is fully utilized at steady state.
1 b(t) Bmax
called Lyapunov [20] allows one to study convergence (stability) by showing that the value of some positive function of the state of the system continuously decreases as the system evolves over time. Finding such a Lyapunov function can be challenging, and transient performance can often only be obtained by solving the system equations numerically. To this end, a technique called linearization can prove more tractable where the non-linear system is approximated by a set of linear equations around a single operating point (state). See Figure 12. With
Figure 12: Linearization linearization, we become concerned with local stability and study perturbations around the operating point using standard (linear) control theory. By local stability, we mean that if the system is perturbed within a small region around the operating point then the system will converge and stabilize back to that point. This is in contrast to global stability where the original (non-linear) system is shown to converge from any starting state. To linearize the non-linear system around an operating point, the basic idea is to expand the non-linear differential equation into a Taylor series around that point and then ignore high-order terms. In what follows, we briey review some basics of classical control theory for linear systems, then we introduce non-linear control theory. We also show examples of control theoretic analysis for the dynamic models introduced above. For more detailed background on control theory, we refer the reader to [20, 13, 16].
In linear control theory, we transform differential equations in the time domain to algebraic equations in the so-called frequency or Laplace domains. Once this Laplace transformation is done, we use simple algebra to study the performance of the system without the need for going back to the (complicated) time domain. Specically, we can transform a function f (t) to an algebraic function F (s), referred to as the Laplace transform of f (t), as follows: F (s) =
0
f (t)est dt
where s is a complex variable: s = + j , is the real part of s, denoted by Re(s), and is the imaginary part of s, denoted by Im(s). Example (Unit step function): The Laplace transform of a unit step function u(t), where u(t) = 1 if t > 0, and u(t) = 0 otherwise, is given by: U (s) =
0
1.est dt =
1 s
182
Example (Impulse function): The Laplace transform of a unit impulse function (t), where (t) = 1 if t = 0, and (t) = 0 otherwise, is given by: U (s) =
0
1.est dt = e0 = 1
Table 1 lists basic Laplace transforms. Table 1: Basic Laplace transforms Impulse input: f (t) = (t) Step input: f (t) = a.1(t) Ramp input: f (t) = a.t Exponential: f (t) = eat Sinusoid input: f (t) = sin(at) F (s) = 1 F (s) = a/s F (s) = a/s2 F (s) = 1/(s a) F (s) = a/(s2 + a2 )
Table 2 lists basic composition rules, where L[f (t)] denotes the Laplace transform of f (t), i.e. F (s). Table 2: Composition rules Linearity: L[a f (t) + b g (t)] = aF (s) + bG(s) Differentiation: L[df (t)/dt] = sF (s) f (0) = sF (s) if f (0) = 0 Integration: L[ f ( )d ] = F (s)/s Convolution: y (t) = g (t) u(t) =
t 0
Example: Consider the following second-order linear, time-invariant differential equation, where y (t) represents the output of a system, and u(t) represents the input: a2 y (t) + a1 y (t) + a0 y (t) = b1 u (t) + b0 u(t) In the time domain, if we represent the system by g (t), then y (t) can be obtained by convolving u(t) with g (t), i.e. y (t) = g (t) u(t). This involves a complicated integration over the system responses, g (t ), to impulse inputs of magnitude u( ), for all 0 < < t. Assuming y (0) = u(0) = 0, taking the Laplace transform of both sides, we obtain: a2 s2 Y (s) + a1 sY (s) + a0 Y (s) = b1 sU (s) + b0 U (s)
183
Y (s) =
Thus, in the Laplace domain, the output Y (s) can be obtained by simply multiplying G(s), called the transfer function of the system, with U (s). We can then take the inverse Laplace transform, L1 [Y (s)], to obtain y (t), or as we will later see, we can simply analyze the stability of the system by examining the roots of the denominator of the transfer function G(s) and their location in the complex s-domain. Note that because Y (s) = G(s) for an impulse input, i.e. U (s) = 1, the transfer function G(s) is also called impulse response function.
Figure 13: Vegas-like system then instead of solving the equations in the time domain, we will transform them to the Laplace domain and analyze the stability of the system algebraically. We start by describing the buffer evolution as: d b(t) = x(t) c(t) dt Then, x(t) is the output of convolving the error e(t) = Br b(t) with the controller function Gc (t), i.e. Now, taking the Laplace transforms, we obtain: x(t) = Gc (t) e(t) X (s) C (s) s
X (s) = Gc (s)E (s) = Gc (s)(Br (s) B (s)) Figure 14 shows the system using its transfer functions and their input/output ows, where G0 = 1 s . This is called the block diagram and provides a powerful pictorial tool. From the block diagram, one can write
184
Figure 14: Block diagram of Vegas-like system the algebraic equation of the output in terms of the input(s). Dropping the s parameter for convenience: (Br B )Gc C =B s Rearranging, we get: B= Gc 1 Br C s + Gc s + Gc
Gc (s) s+Gc (s)
(10) and
Note that the system has two inputs: Br (s) and C (s), subjected to two transfer functions, 1 s+Gc (s) , respectively, and adding their responses we obtain the output B (s).
An important question is: does the P-controller make the system stable? More precisely, if we subject the system to impulse input(s), does the system converge back to a quiescent state? Control theory gives a systematic way to answer such stability question by examining the roots of the denominator of the systems transfer function, called the characteristic equation. In this case, the characteristic equation is: s + Kp = 0 s = Kp The system is stable if the roots (also called poles) lie in the left-hand side of the complex s-plane. Thus this system is stable if Kp < 0 Kp > 0. Note that we did not need to go back to the time domain to analyze the stability of the system. But lets do that here to understand why poles on the left-hand side of the s-plane makes the system stable. Taking the inverse Laplace transform of Equation 11, and assuming impulse inputs, i.e. Br (s) = C (s) = 1, we get: b(t) = Kp eKp t eKp t
185
We can then see that b(t) decays exponentially over time, starting from b(0) = (Kp 1). We say that the system is stable or exhibits overdamped response. We can also analyze transient performance by noting that b(0) = (Kp 1) represents an overshoot response to the impulse input, and that this overshoot is lower for lower Kp . See Figure 15. So by controlling Kp , referred to as the controller gain, we can also control the systems transient response.
Ki
Given Ki > 0, the two imaginary conjugate poles lie in the left-hand side of the complex s-plane, and so the system is stable, though critically stable as we explain next. To convince ourselves, let us go back to the time domain by taking the inverse Laplace transform: L1 [ s2 Ki Ki A B ] = L1 [ ] = L1 [ + ] + Ki (s j Ki )(s + j Ki ) (s j Ki ) (s + j Ki ) Aej
Ki t
Given the fact that ej = cos + j sin , the function in the time domain oscillates in a sinusoidal fashion. Although the time function does not decay over time, it does not diverge, i.e. it is not unstable! So, we consider such a system to have bounded oscillations in response to impulse input and we say that it is critically (or marginally) stable or the system exhibits undamped oscillatory response. Note that a higher value of Ki results in more oscillatory behavior. See Figure 16.
186
Figure 18: Performance Metrics For our Vegas-like system, the controlled variable is the window size, i.e. number of packets allowed into the system. The response is the queue length, which we measure and compare to the target buffer size. A good system is one that converges quickly to the desired target with minimum oscillations (i.e., overshoots and undershoots) and with almost zero steady-state error.
188
Br s
and D(s) =
D s,
we have:
Kp 1 C Br + C] = s + Kp s + Kp Kp
Recall that under the P-controller, the system is (overdamped) stable, i.e. b(t) approaches the target Br C and stabilizes at a value lower without oscillations, however, at steady state, b(t) misses the target by K p than Br . Notice that the higher the service capacity C is, the larger the steady-state error. So, to decrease the steady-state error, the controller gain Kp could be increased. However, increasing Kp increases the overshoot. A tradeoff clearly exists between transient performance and steady-state performance, and one has to choose Kp to balance the two and meet desired operation goals. End Example. Example (PI-control of Vegas-like system): E (s) = Br (s) B (s) Substituting for B (s) from Equation 12 and using the Final Value Theorem, we obtain: ess = lims0 s [Br (s) Assuming step inputs, i.e. Br (s) =
Br s
Ki s Br (s) + 2 C (s)] s2 + Ki s + Ki
C s,
and C (s) =
we have:
s Ki Br + 2 C] = 0 s2 + Ki s + Ki
Although the steady-state error is zero under the PI-controller, recall that the system is critically stable, i.e. it converges to the target while oscillating. Decreasing the controller gain Ki decreases these oscillations, however at the expense of longer time to reach steady state. This illustrates again the inherent tradeoff between transient performance and the quality of the steady state.
As we have seen, linear control theory can be applied to non-linear systems if we assume a small range of operation around which the system behavior is linear or approximately linear. This linear analysis is simple to use, and the system, if stable, has a unique equilibrium point. On the other hand, most control systems are non-linear, and operate over a wide range of parameters, and multiple equilibrium points may exist. In this case, non-linear control theory could be more complex to use. In what follows, we rst consider a non-linear model of the adaptation of sources and network, and use a non-linear control-theoretic stability analysis method, called Lyapunov method [20]. Then, we linearize the system and illustrate the application of linear control-theoretic analysis.
where pr (t) represents the total price observed by user r along its path. Note that this differential equation is non-linear since pr (t) is a function of the rates xs (t): pr (t) =
link lroute r
pl (t) =
l r
pl (
s:ls
xs (t))
To prove stability, we use the non-linear method of Lyapunov. The basic idea is to nd a positive scalar function V (x(t)), we call the Lyapunov function, and show that the function monotonically increases (or decreases) over time, approaching the steady-state solution. Dene V (x) as follows: V ( x) =
r R
We assume that the pricing function pl (y ) is monotonically increasing in the load y . At steady state, if the system stabilizes, setting the derivatives to 0, we obtain the steady-state solution: wr wr k [wr xr pr ] = 0 xr = = pr l r p l
wr log(xr )
s: j s
xs
pj (y )dy
j J
Finding a suitable Lyapunov function that shows stability is tricky and more of an art! If you cant nd one, it does not mean that the system is not stable. Note that this V (x) has some special meaning: the rst term represents the utility gain from making users happy, while the second term represents the cost in terms of price. So V (x) represents the net gain. Also, note that since the rst term is concave because of the log function, and the second term is assumed to be monotonically increasing, then the resulting V (x) is concave, i.e. it has a maximum value. x(t)) > 0, which implies that To show that V (x(t)) is strictly convergent, we want to show that dV (dt V (x(t)) strictly increases (i.e. the net gain keeps increasing over time), until the system stabilizes and x(t)) reaches steady state when dV (dt = 0 (i.e. the net gain V (x) reaches its maximum value). First, we note: V (x) wr = pj ( xs ) xr xr j r s:j s Then: V (x(t)) = dt V (x(t)) = dt V (x(t)) dxr (t) xr dt
r R
[
r R
V (x(t)) = dt
r R
Observe that this non-linear stability analysis shows that the system is stable, no matter what the initial state x(0) is. This property is referred to as global stability, which is in contract to local stability that we prove when the system is linearized locally around a certain operating point as we will see next.
190
(13)
This is now a linear differential equation, which unlike the original non-linear differential equation, we can easily study using linear control-theoretic techniques, or in this simple case, solve by straightforward integration:
t 0
d x(t) = x(t)
dt
0
log(
x(t) = x(0) et Note that from this time-domain analysis, the system is shown to be stable, i.e. the perturbation vanishes over time and the system returns to its original state x . We also observe that the system response decays exponentially from its original perturbation x(0), i.e. without oscillations, and so the response is classied as overdamped. If the linearized differential equation modeling the system were more complicated, it is much easier to transform it into the Laplace domain and analyze the system algebraically. Denoting x(t) by u(t), the Laplace transform of x(t) by U (s), and taking the Laplace transform of Equation 13, we get: s U (s) u(0) = U (s) u(0) s+
U (s) =
For stability analysis, we examine the location of the poles (roots) of the characteristic equation s+ = 0, yielding the pole s = . Since the pole is strictly in the left-side of the s-plane, given > 0, the system is stable and its response is overdamped. To evaluate the steady-state error, we dene the error as e(t) = u(0) u(t), and applying the Final Value Theorem with an impulse perturbation of magnitude u(0), i.e. U (0) = u(0), we obtain: ess = lims0 sE (s) = lims0 s[u(0) So, there is no steady-state error. 5.2.1 Effect of Feedback Delay and Nyquist Stability Criterion u(0) ]=0 s+
As we just noted above, the power of solving the linearized model in the Laplace domain comes when the model is even slightly more complicated. For example, let us consider a feedback delay T such that Equation 13 looks like: du(t) = u(t T ) dt
192
Taking the Laplace transform, and noting that the Laplace transform of a delayed signal u(t T ) is esT U (s), we obtain: sU (s) u(0) = esT U (s) u(0) s + esT
s + esT = 0
(14)
which we need to solve to locate the poles and determine the stability of the system. To solve such characteristic equation, we resort to another control-theoretic method called the Nyquist stability criterion [20]. To this end, we introduce, without proof, the Cauchys principle [20], which states that given F (s), and we plot F (s) as we vary s along a certain contour (trajectory) in the s-plane see Figure 19 and denote the following: Z : the number of zeros of F (s), i.e. the roots of the numerator of F (s), inside the contour. P : the number of poles of F (s), i.e. the roots of the denominator of F (s), inside the contour. N : the number of times that the plot of F (s) encircles the origin in the F (s)-plane, such that an encirclement is negative if it is in the opposite direction of the s-contour.6 Then the following relationship holds: Z =P +N
193
The Nyquist method applies the Cauchys principle as follows. Lets say we want to analyze the stability of a closed-loop control system whose forward transfer function is G(s) and its feedback transfer function G(s) is H (s)see Figure 20. Then, the closed-loop transfer function is given by 1+G (s)H (s) , where G(s)H (s) is referred to as the open-loop transfer function. The characteristic equation is given by: F (s) = 1 + G(s)H (s) = 0. Observe that the zeros of F (s) are the closed-loop poles, and the poles of F (s) are the poles of G(s)H (s) (so-called open-loop poles).
Figure 20: Typical closed-loop control system By taking the s-contour to be around the right-side (i.e. unstable side) of the s-plane (see Figure 21), and noting the number of unstable open-loop poles P and the number of encirclements N around the origin in the F (s)-plane, we determine the number of unstable zeros Z of F (s), i.e. number of unstable closed-loop poles, using Cauchys relationship: Z = P + N . If P = 0 and N = 0, then Z = 0 implies that there are no unstable closed-loop poles and so the closed-loop system is stable.7
194
This process can be slightly simplied if instead of plotting F (s), we instead plot the open-loop transfer function: G(s)H (s) and observe its encirclements of the (1, j 0) point in the G(s)H (s)-plane, instead of the origin (0, j 0) in the F (s)-plane. Given there are no poles of G(s)H (s) in the right-side of the s-plane, i.e. P = 0, in order for the closed-loop system to be stable, the plot of G(s)H (s) should not encircle -1 as we vary s along the contour enclosing the right-side of the s-plane. We are mostly interested in varying s along the imaginary axis, i.e. s = j where varies from 0 to . This is because the plot for from to 0 is symmetric, and the plot for the semi-circle as s maps to the origin in the G(s)H (s)-plane. Thus, we are interested in plotting G(j )H (j ) as varies from 0 to . Example: Lets go back to the characteristic equation in Equation 14: s + esT = 0 F (s) = 1 + esT G(s)H (s) = esT s s
Note that G(s)H (s) does not have any unstable poles, i.e. P = 0. In particular, s = 0 is considered a (critically) stable pole. Ignoring the constant factor for now, we want to plot: ejT j Noting that ej = cos + j sin , we have: ejT = cos(T ) j sin(T ) Then, ejT cos(T ) sin(T ) = j j Since we are interested in determining intercepts with the real-axis of G(j )H (j ) and whether they occur to the right or left of -1 (see Figure 22), we want to determine the values of for which the imaginary T ) 3 part of G(j )H (j ), i.e. cos( , is zero. Such intercepts occur when T = 2 , 2 , , when the cosine value is zero. Now, at these values of T , we can determine the points of interception along the real-axis, i.e. the magnitude |G(j )H (j )| when the plot intercepts the real-axis: sin(T ) 1 1 = , + 3 , 2T 2T :0
For the system to be stable, |G(j )H (j )| at these intercepts must be less than 1, so the G(s)H (s) plot does not encircle -1. This is the case if after restoring the constant factor we initially ignored, the following condition holds: 2T <1
End Example. Observe that T is the feedback delay, so as T gets larger, it becomes harder to satisfy the stability condition. Intuitively, this makes sense since a larger feedback delay results in outdated feedback (measurements) and it becomes impossible to stabilize the system. This is the fundamental reason why TCP over long-delay paths does not work, and architecturally, control has to be broken up into smaller control loops.
195
Routing Dynamics
So far, we assumed routes taken by ows to be static. In general, routes may also be adapted based on feedback on link prices (reecting load, delay, etc.), albeit over a longer timescale of minutes, hours or even days compared to that of milliseconds for sending rate adaptation. Figure 23 shows a block diagram that includes both route and sending rate adaptation.
Figure 23: Block diagram with both ow and routing control Figure 24 illustrates the general process of adaptation. Flow or routing control determines the amount of load directed to a particular link based on the links observed price relative to that of other possible links on alternate routes in the case of routing. We call this mapping from link price to link load x, the response function G(). Given link load, a certain price is observed for the link. We call such load-to-price mapping, the pricing (feedback) function H (x). The process of adaptation is then an iterative process: = H (x) x = G()
196
Figure 24: Convergence We can then write: = H (G()) = F () where F () is an iterative function whose stable (xed) point is the intersection of the response function and the pricing function. Figure 25 illustrates convergence to a xed point. Starting from an initial 0 , we nd F (0 ), then projecting on the 45o line we obtain 1 = F (0 ), which we use to nd F (1 ), and this iterative process continues until we reach the xed point.
Figure 25: Contractive mapping In order to converge to that xed point, F () must be a so-called contractive mapping. Intuitively, F () is contractive iff its slope is less than 1, i.e. |F (2 ) F (1 )| < |2 1 |, < 1. Figure 26 illustrates a mapping that results in divergence. Intuitively, the use of Lyapunov functions to prove convergence tests whether the iterative process describing the evolution of the system over time is a contractive mapping, i.e. the distance to the xed point keeps shrinking at every iteration.
197
Figure 26: Divergent mapping Example: Consider the adaptive routing of N > 0 unit-rate ows over two possible paths whose prices are given by monotonically increasing functions p1 (x) and p2 (N x), where x represents the number of ows (or load) routed on the rst path. Note that x completely denes the state of the system. Also, assume that routing to the least-loaded path is done gradually, to avoid wild oscillations, where 0 < < 1 of the ows are re-routed. Using a discrete-time model where routes are adapted at discrete-time steps, we can write the following difference equations: x(t + 1) = x(t) + [N x(t)] if p1 (x(t)) p2 (N x(t)) x(t) x(t) otherwise
At steady state, this system might converge to one of two possible stable (xed) points. One possibility is obtained when substituting with x(t) x in the rst difference equation: x(t) x x = x + [N x ] x = N , so all trafc will end up getting routed on the rst path. A necessary condition to reach that x = N xed point is that p1 (N ) p2 (0), i.e. the rst path is least loaded (priced) even when all N ows are on it. Another possibility is obtained when substituting with x(t) x in the second difference equation: x(t) x x = x x x = 0, so all trafc will end up getting routed on the second path. A necessary condition to reach that x = 0 xed point is that p1 (0) > p2 (N ), i.e. the second path is least loaded (priced) even when all N ows are on it. We can show convergence to one of these xed points depending on which necessary condition holds: p1 (N ) p2 (0) or p1 (0) > p2 (N ). Lets assume p1 (N ) p2 (0) holds. We want to dene a Lyapunov function V (x) 0 and show that V (x(t + 1)) V (x(t)) for some or all starting state x(0), i.e. V (x) monotonically decreases toward the x = N xed point where equality holds. If there are only certain values of the starting state x(0) for which the system converges then such conditions must hold, in addition to the necessary condition, for convergence to happen. In this case, we say that the necessary condition by itself is not sufcient for convergence. Dene V (x) = N x. Note that V (x) 0 because 0 x N , and V (x) = 0 when x = N , i.e. at the xed point. So, under convergence, we expect V (x) to monotonically decrease toward zero. Substituting
198
for x(t + 1) in V (x), we obtain: Given the pricing functions are monotonically increasing with load, p1 (N ) p2 (0) p1 (x(t)) p2 (N x(t)), x(t), and we can only use the rst difference equation to substitute for x(t + 1): V (x(t + 1)) = N (x(t) + [N x(t)]) = (1 )(N x(t)) So, we can conclude that the system is convergent regardless of the starting state x(0) as long as 0 < < 1. Thus, 0 < < 1, along with the necessary condition p1 (N ) p2 (0), represent necessary and sufcient conditions for convergence. A similar convergence proof can be obtained if on the other hand, the necessary condition p1 (0) > p2 (N ) holds. End Example. V (x(t + 1)) = (1 )V (x(t)) V (x(t)) V (x(t + 1)) = N x(t + 1)
In this and the following section, we consider the modeling and control-theoretic analysis of two trafc control case studies. This rst case study [17, 9] concerns the performance of elastic ows, i.e., rate-adaptive ows similar to TCP. The goal is to investigate the effect of class-based scheduling that isolates elastic ows into two classes (service queues) based on different characteristics, for example based on their lifetime (transfer size), or burstiness of their arrivals/departures and sending rate (window) dynamics. We want to show the benets of isolation, in terms of better predictability and fairness, over traditional shared queueing systems. We formulate two control models. In the rst model (Section 7.1), each ow controls its input trafc rate based on the aggregate state of the network due to all N ows. In the second model (Section 7.2), each ow (or class of homogeneous ows) controls its rate based on its own individual state within the network. We assume that the ows use PI control for adapting their sending rate. In the aggregate control model, the packet sending rate of ow i, denoted by xi (t), is adapted based on the difference between a target total buffer space, denoted by B , and the current total number of outstanding packets, denoted by q (t). In the individual control model, xi (t) is adapted based on ow (or class) is target, denoted by Bi , and its current number of outstanding packets, denoted by qi (t). We denote by c(t) the total packet service rate, and by ci (t) the packet service rate of ow/class i. In what follows, for each control model, we determine conditions under which the system stabilizes. We then solve for the values of the state variables at equilibrium, and show whether fairness (or a form of weighted resource sharing) can be achieved. Table 3 lists all system variables along with their meanings.
=
j =1
xj (t) c(t)
(15)
199
Table 3: Table dening system variables Meaning total number of ows (or classes of homogeneous ows) packet sending rate of ow/class i number of outstanding packets of ow/class i packet service rate of ow/class i total number of outstanding packets total packet service rate target total buffer space target buffer space allocated to ow/class i parameter controlling increase and decrease rate of xi (t)
Stability Condition: Without loss of generality, assume a constant packet service rate (i.e. c(t) = C for all t), all ows start with the same initial input state (i.e. xi (0) is the same for all i), and that all ows adapt at the same rate (i.e. i = for all i). Then, Equations (15) can be re-written as: x i (t) q (t) = (B q (t))
N
=
j =1
xj (t) C
N
(16)
Since ows adapt their xi (t) at the same rate, then xi (t) = j=1 for all i. Denote by e(t) the error at N N time t, i.e. e(t) = B q (t), and let y (t) = j =1 xj (t) C . Equations (16) can then be re-written as: y (t) = N q (t) = e(t) y (t) (17)
xj (t)
Taking the Laplace transform of Equations (17), and assuming the buffer is initially empty (i.e. q (0) = 0), we get: 1 (sY (s) y (0)) = E (s) N s Q(s) = Y (s) E (s) = B Q(s) (18) Solving Equations (18), we obtain the closed-loop systems characteristic equation (see Figure 27 for the systems block diagram):
+ s2 + N = 0 s = j N
(19)
For > 0, this system is marginally stable. However, the magnitude of oscillations increases for higher and/or higher N . This indicates that the existence of ows that rapidly change their sending rates through high values of i can cause the system to have high oscillations. This suggests that elastic ows that aggressively change
200
Figure 27: Block diagram of the link sharing model their sending rates, may affect the stability of other ows that change their sending rates cautiously, in a system that mixes both kinds of ows. Furthermore, in such a system, the value of N may be high so as to cause high oscillations. We now derive the values of the state variables at equilibrium. Denote by xi and q the steady-state values of xi (t) and q (t), respectively. Then, at equilibrium, we have from Equations (16): 0 = 0 =
j =1 N
(B q )
N
xj C
Thus, at equilibrium, q = B and j =1 xj = C . In other words, the system converges to a state where the total input rate matches the total service rate, and the target total buffer space is met. Note that if i = for all i, then xi (t) changes at the same rate for every ow i. Consequently, if we start the evolution of the system with xi (0) being the same for all ows, only then we have equal sharing C , regardless of the initial value q (0). However, in general, when of the network at steady-state, i.e. xi = N xi (0) are not equal for all ows, the system converges to an unfair state, more precisely, to a state where xi = xi (0) + C
N j =1
xj (0)
To summarize, controlling several ows by observing the resulting aggregate state of the network may lead to high oscillations due to either the existence of ows which are rapidly adjusting their sending rates, or a high number of ows competing for the same shared resource. Furthermore, the system is highly likely to converge to an unfair state where ows receive unequal shares of the resource.
(20)
Recall that under individual control, ow/class i regulates its input, xi (t), based on its own number of outstanding packets. For simplicity, assume a constant packet service rate, i.e. ci (t) = Ci for all t. Following the same stability analysis as aggregate control, it is easy to see that ow/class i stabilizes and the poles of the closed-loop system are:8
+ s = j i
Observe that, unlike aggregate control, ows/classes are isolated from each other. Therefore, the existence of ows/classes that rapidly change their sending rates through high values of i , does not affect the stability of other ows. This isolation can be implemented using, for example, a class-based queueing (CBQ) discipline [7]. In such CBQ system, each class of homogeneous ows can be allocated its own buffer space and service capacity. We now derive the values of the state variables of ow/class i at equilibrium. Denote by xi and qi the steady-state values of xi (t) and qi (t), respectively. Then, at equilibrium, we have from Equations (20): 0 = i (Bi qi ) 0 = xi Ci Thus, at equilibrium, qi = Bi and xi = Ci . In other words, each ow/class i converges to a state where its input rate matches its allocated service rate, and its target buffer space is met. We note that if the allocated buffers Bi and service capacities Ci are equal, then every ow receives an equal share of the resources, regardless of the initial values xi (0) and qi (0). One can also achieve a weighted resource sharing by assigning different Bi and Ci allocations. Thus, a ow/class with higher priority (e.g., short interactive TCP ows operating aggressively in slow start) can be allocated more resources, so as to receive better throughput/delay service than other ows (e.g., long TCP ows operating cautiously in congestion avoidance). To summarize, controlling each ow (or class of homogeneous ows) by observing its own individual state within the network provides isolation between them. Thus, stability can be achieved for a ow/class regardless of the behavior and number of other ows/classes. Furthermore, the system can converge to a fair state where ows/classes receive a weighted share of the resources.
Consider n regular user connections between sending and receiving end-hosts, all passing through two gateways lets call them a source gateway and a destination gateway. Our main goal is to provide a soft bandwidth-guaranteed tunnel [8] for these user ows over an Internet path of bottleneck capacity C , which is also shared by another set of x ows, representing cross trafc (see Figure 28). By soft guarantee we mean that there is no explicit resource reservation. Consider that user and cross-trafc connections are all rate-adaptive connections (similar to TCP). These x cross-trafc connections present a challenge: as x keeps changing, the bandwidth allocation for the n user ows keeps changing in tandem. So an important question is whether it is possible to counter the change in x so as to ensure that the n user ows are able to maintain a desirable bandwidth. Clearly without the intervention of the two gateways, the answer to the above question is no. When different ows share a link, the effect of each individual ow (or an aggregate of ows) affects the rest since all are competing for a xed amount of resources. However, if the gateways dynamically maintain a number
8
202
Figure 28: Soft bandwidth-guaranteed tunnel m of open rate-adaptive (e.g., TCP) connections between them, they can increase m to provide a positive pressure that would equalize the pressure caused by the cross-trafc connections, if the latter occurs. Since m will be changing over time, we describe the gateway-to-gateway tunnel, made of the m connections, as elastic. Note that the source gateway can decide to reduce m (i.e. relieve pressure) if x goes down the reason is that as long as the tunnel is achieving its target bandwidth, releasing extra bandwidth should improve the performance of cross-trafc connections, which is in the spirit of best-effort networking. To illustrate this scenario and the issues involved, consider a gateway-to-gateway tunnel going through a single bottleneck link. Assuming long-lived TCP-like load, the behavior of the bottleneck can be approximated by Generalized Processor Sharing (GPS) [22], i.e. each connection receives the same fair share of resources [3]. Thus, each connection ends up with mC +x bandwidth. This, in turn, gives the m gatewayCm to-gateway rate-adaptive ows, or collectively the elastic gateway-to-gateway tunnel, a bandwidth of m +x . As the source gateway increases m by opening more rate-adaptive connections to the destination gateway, the tunnel can grab more bandwidth. If x increases, and the gateways measure a tunnels bandwidth below a target value (say B ), then m is increased to push back cross-trafc connections. If x decreases, and the gateways measure a tunnels bandwidth above B , then m is decreased for the good of cross-trafc connections. It is important to note that the source gateway should refrain from unnecessarily increasing m, thus achieving a tunnels bandwidth above B , since an unnecessary increase in the total number of competing rate-adaptive ows reduces the share of each connection and may cause ows to timeout leading to inefciency and unfairness. The source gateway also has the responsibility of scheduling user packets coming on the n user connections over the tunnel, i.e. the m gateway-to-gateway connections. In this case study, we do not focus on scheduling but the control theoretic modeling and analysis of the tunnels bandwidth adaptation. We study the effect of different types of controllers employed at the source gateway. Such controller determines the degree of elasticity of the gateway-to-gateway rate-adaptive tunnel, thus it determines the transient and steady-state behavior of the soft bandwidth-guaranteed service. Na ve Control: This na ve controller measures the bandwidth b grabbed by the current m gateway-rateadaptive connections. Then, it directly computes the quiescent number m of gateway-rate-adaptive connections that should be open as: m = B m b
Clearly, this controller na vely relies on the previously measured bandwidth b and adapts without regard to delays in measurements and possible changes in network conditions, e.g. changes in the amount of cross trafc. We thus investigate general well-known controllers which judiciously zoom-in toward the target
203
bandwidth value. To that end, let us develop a ow-level model of the system dynamics. The change in the bandwidth grabbed b(t) by the m(t) gateway-rate-adaptive ows (constituting the elastic gateway-togateway tunnel) can be described as: (t) b = [(C B )m(t) B x(t)]
This dynamic equation captures what we want to model: b(t) increases with m(t), and decreases as the number of cross-connections x(t) increases. is a constant that represents the degree of multiplexing of ows and we choose it here to be the steady-state connections fair share ratio of the bottleneck capacity. At (t) equals zero, which yields (as expected): steady-state, b B = Cm ( x+m )
where m and x represent the steady-state values for the number of gateway-rate-adaptive ows and crosstrafc ows, respectively. Based of the current bandwidth allocation b(t) and the target bandwidth B , an error signal e(t) can be obtained as: e(t) = B b(t)
P and PI Control: A controller would adjust m(t) based on the value of e(t). For a simple Proportional controller (P-type), such adjustment is described by: m(t) = Kp e(t) (21)
Recall that P-type controllers are known to result in a non-zero steady-state error. To exactly achieve the target B (i.e. with zero steady-state error), a Proportional-Integral (PI-type) controller can be used: m(t) = Kp e(t) + Ki e(t) dt (22)
Figure 29 shows the block diagram of this elastic-tunnel model. In the Laplace domain, denoting the controller transfer function by Gc (s), the output B (s) is given by:
B (s) = G2 (s) Gc (s)G1 (s) B (s) + X (s) 1 + Gc (s)G1 (s) 1 + Gc (s)G1 (s) (23)
where G1 (s) is given by: G1 (s) where = (C B ). G2 (s) is given by: G2 (s) = = s s
where = B . For the P-controller, from Equation (21), Gc (s) is simply Kp . For the PI-controller, B (s) i from Equation (22), C (s) equals Kp + K s . Thus, the transfer function B in the presence of a P-controller is given by: B (s) B = Kp s + Kp
204
The system with P-controller is always stable since the root of the characteristic equation (i.e. the denominator of the transfer function) is negative, given by Kp for > 0 and B < C . In the presence of a (s) PI-controller, the transfer function B B is given by: B (s) B = Kp s + Ki s2 + Kp s + Ki
One can choose the PI-controller parameters Kp and Ki to achieve a certain convergence behavior to the target bandwidth B . We next dene transient performance measures to assess such convergence behavior.
205
Step Response 1 0.9 0.8 0.7 0.6 Amplitude 0.5 0.4 0.3 0.2 0.1 0