INTERNET DRAFT S. Bandyopadhyay draft-shyam-mshn-ipv6-00.txt November, 2009 Intended status: Proposed Standard Expires: May, 2010 Mesh Structured Hierarchical Networking and IPv6 Status of this memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document tries to address an approach for reorganization of entire network in a large address space. It describes how a three- tier mesh structured hierarchy can be established based on fragmenting the entire space into some regions and sub regions inside each of them. It addresses issues which could be relevant to this architecture in the context of IPv6. This document also tries to come out with an approach how IP switch based network can perform as good as ATM network for the processing of real time traffic. Bandyopadhyay Expires May, 2010 [Page 1] Internet Draft MSHN and IPv6 November, 2009 1. Introduction Transition from IPv4 to IPv6 is in the process. Work has been done to upgrade individual nodes (workstations) from IPv4 to IPv6. Also, there are established documents to make router/switches to work to support IPv4 as well as IPv6 packets simultaneously in order to make the transition possible [1]. The CIDR[2] based hierarchical architecture in the existing 32-bit system is supposed to be continued in IPv6 too with a large address space. There are documents/concern over BGP table entries to become too large in the existing system [3]. There are proposals to upgrade Autonomous System number to 32-bit from 16-bit to support the demand at the same time [4]. The challenge relies on how to make the transition smooth from IPv4 to a real IP world with least changes possible. ATM network performs faster than the network with IP switches. The difference becomes more prominent for real time applications. Whereas they have disadvantages as far as bandwidth usages is concerned compared to the IP-switch based network. This document tries to address approaches for IP-switch based network to process real-time applications as fast as ATM network also a mesh structured hierarchical network with flat address space for routing convenience. 2. A Three-tier mesh structured hierarchical network Existing system is in work with Autonomous System (AS) and inter-AS layer with the approach of CIDR. In order to meet the need within the 32-bit address space, Autonomous Systems of various sizes maintains CIDR based hierarchical architecture. With the help of NAT [5], a stub network can maintain an user ID space as large as a class A network and can meet its useful need to communicate with the rest of the world with very few real IP addresses. With the combination of CIDR and NAT applied in the entire space, most of the part of 32-bit address space gets effectively used as network ID. This is how, 16-bit 'Autonomous System Number' is realized as insufficient in order to meet the need of growing customers. If the same gets continued with a larger network ID, load in the switches will become too high. As Autonomous Systems of various sizes are supported, Autonomous Systems and the nodes inside the Autonomous Systems can be viewed as graphically lying in the same plane within the address apace. If network can be viewed as lying in different planes, routing issues can be made simpler. If network is designed with a fixed length of prefix for the Autonomous System everywhere, routing information for the rest will get confined with the other part of the network prefix. Which means the maximum size of AS gets assigned to all irrespective of their actual sizes. This can be made possible with the advantage of using a large address space and dividing it into number of regions Bandyopadhyay Expires May, 2010 [Page 2] Internet Draft MSHN and IPv6 November, 2009 of fixed sizes inside it. Thus entire network can be viewed as a network of inter-AS layer nodes. Each node in the inter-AS layer can act either only as a router only in the inter-AS layer or as a router in the inter-AS layer with an Autonomous System attached to it with a single point of attachment or as an Autonomous System with multiple Autonomous System border routers (ASBR) appearing like a mesh. Thus two tier mesh structured hierarchy gets established between AS layer and inter-AS layer with each AS having a fixed length of prefix. Based on the definition of Autonomous System, it is a small area within the entire network that maintains its own independent identity that communicates with the rest of the world through some specific border routers. In the similar manner, if a larger area (say region or state) can be considered as network of Autonomous Systems, that can maintain its own identity by communicating with the rest of the world through some border routers (say, state border router), mesh structured hierarchy can be established within inter-AS layer. The inter-AS layer will be split into inter-AS-top and inter-AS-bottom. To maintain this hierarchy, each node of inter-AS-top needs to have multiple regional or state border router(say, SBR) through which each one will communicate with the rest of the world in the similar manner an Autonomous System maintains ASBRs. Thus, the entire network will appear as a network of nodes of inter-AS-top layer. To maintain hierarchy, each node of the inter-AS-top needs to have a fixed length of prefix. i.e. each layer of the inter-AS top will have a fixed number of nodes of inter-AS-bottom layers. Thus, with three-tier mesh structured hierarchy in the network layer, network ID can be viewed as A.B.C. If pA, pB and pC be the prefix lengths of inter-AS-top, inter-AS-bottom and AS layers respectively, there will be 2^pA nodes at the topmost layer, 2^pB at the inter-AS- bottom layer and 2^pC nodes at the AS layer. Thus the entire space gets divided into a fixed number of regions and each regions gets divided into fixed number of sub regions. This division is supposed to be made based on geography, population density and their demands and related factors. It is worth noting that for the ease of use as well as computation, pA and pB are (most probably) selected at the nearest octet boundaries. Even for the sake of openness, they may be picked up much more than the nearest value needed. Let nMaxInterASTopNodes be the possible maximum number of nodes assigned at the top most layer and nMaxInterASBottomNodes be that at the inter-AS-bottom layer and nMaxASNodes at the AS layer. Where nMaxInterASTopNodes <= 2^pA and nMaxInterASBottomNodes <= 2^pB and nMaxASNodes <= 2^pC. Bandyopadhyay Expires May, 2010 [Page 3] Internet Draft MSHN and IPv6 November, 2009 2.1. Route propagation With hierarchy established, routing information that gets established inside a node of inter-AS-top, does not need to be propagated to another node of inter-AS-top. Entire routing information of inter-AS- top layer needs to be propagated to inter-AS-bottom layer. So, each router of inter-AS layer will have two tables of information, one for the inter-AS-top and another for the inter-AS-bottom of the inter-AS- top node that it belongs to. BGP (with little modification) will work very well with a trick applied at the SBRs. Each SBR will not propagate the routing information of inter-AS-bottom layer of its domain to another SBR of another domain. i.e. SBR of one top layer node will propagate routing information only of inter-AS-top layer to SBR of another top layer node. Inside a node of inter-AS-top, routing information of inter-AS-top and inter-AS-bottom need to be propagated from one ASBR to another neighboring ASBR. Inside a top layer node A, routing information of another top layer node B will have two parts; one for the list of SBRs through which a packet will traverse from top layer node A to B and another for the list of ASBRs through which the packet will traverse from one AS to another inside A. In terms of BGP, AS_PATH attribute will be split into two parts; one for the information of the top layer and another for the bottom layer. Within the same node A routing information of one AS to another AS will not have any top layer information. i.e. the top layer information will be set to NULL. Similarly, each node of the AS layer will have three tables of routing entries. One for the inter-AS-top, one for the inter-AS- bottom and another for the routing information inside the Autonomous System itself. With traditional CIDR based hierarchy, a node of higher prefix can be divided into number of nodes with lower prefixes. Each divided node can further be subdivided with nodes of further lower prefixes. This process can be continued till no further division is possible. The point worth noting is at each point the designer of the network has to preconceive the future expansion of the network with the concept in the mind that the resource can not be exhausted at any point of time. This phenomenon leads the designer to allocate resources much higher than whatever is needed which leads to a space of unused address space and the concept of H-D (host-density) ratio comes into play. The problem gets aggravated once resource gets exhausted by any chance. e.g. a node of prefix /16 can be divided with a number of nodes of prefixes /24. If any one of the nodes /24 gets exhausted, resources of other nodes of prefixes /24 can not be used even if they are available. Introduction of hierarchy at the inter-AS layer reduces the size of Bandyopadhyay Expires May, 2010 [Page 4] Internet Draft MSHN and IPv6 November, 2009 the routing table substantially. With the availability of hardware resources if flat address space is maintained at each layer, problems related to CIDR can be avoided. With flat address space, no hierarchical relationship needs to be established between any two nodes in the same layer. So, all the nodes inside each layer can be used till they get exhausted. With flat address space (i.e. without prefix reduction), BGP tables will have nMaxInterASTopNodes + nMaxInterASBottomNodes entries. IGP like OSPF has got provision to divide AS into smaller areas. OSPF hides the topology of an area from the rest of the Autonomous System. This information hiding enables a significant reduction in routing traffic. With the support of subnetting, OSPF attaches an IP address mask to indicate a range of IP addresses being described by that particular route. With this approach it reduces the size of the routing traffic instead of describing all the nodes inside it, but introduces another level of hierarchy. If subnetting concept can be avoided from the AS layer(with the additional overhead of computation inside the SPF tree), each area can be configured from a free pool of addresses based on its requirement dynamically. So, an AS can be divided into number of areas of heterogeneous sizes with the nodes from a free pool of addresses. With this architecture, each network(i.e. a node inside an AS) acts as a leaf node, i.e. a network will not act as a transit. In order to make use of user-id space properly and to support customer networks of heterogeneous sizes, the user-ID space needs to be divided as subnet-ID and user-ID. Profoundly, a VLSM (variable length subnet mask, or CIDR whichever term is more applicable) type of approach has to be adopted at each node of an AS. So, each node of the AS layer will act as the root of a tree whose leaves are independent small customer networks which will act as stub. As the routing information of inter-AS layer as well as AS layer need not be passed inside any node of the VLSM tree, each router inside the tree should maintain default route for any address outside of its network. With this approach, load on each router of the service providers will become negligible. Protocols that supports VLSM with MPLS/VPN has to be implemented inside the tree (inside the VLSM tree, all the physical ports of a switch have to be configured with the subnet mask. So, mere MPLS on top of static routing table should do the rest). To reduce complexity, packets are sent from SBR of one inter-AS-top node to SBR of its neighboring inter-AS-top node just by looking at the routing table entries of the inter-AS-top layer. i.e. no label switching activity within the inter-AS-top layer. This will reduce the maximum label stack depth by one as well (maximum label stack depth may become a criteria to process real time packets as discussed in the next section). Label switching is done at the inter-AS-bottom, Bandyopadhyay Expires May, 2010 [Page 5] Internet Draft MSHN and IPv6 November, 2009 AS layer as well as inside the VLSM tree of the user id space. While a packet gets transported from one inter-AS-bottom node to another through a node at the middle, packet needs to be label stacked at the middle node if it contains multiple ASBRs. The fundamental assumptions based on which this architecture lies can be summarized as follows: i) Entire network can be viewed as a network of regions or states where each region or state can have its own identity by communicating with the rest of the world through some state border routers. Each region or state is a network of Autonomous Systems. Each region as well as each Autonomous System inside them will have a fixed (maximum) length of prefix. ii) Availability of hardware resources is such that flat address space can be maintained at the inter-AS layer. Introduction of mesh-structured hierarchy at the inter-AS layer will have several advantages: o Load at each router will get reduced substantially. o Concept of CIDR style approach and complexity related to prefix reduction can be easily avoided. o Protection due to failure will become more stable. o Full mesh hierarchy will make traffic evenly distributed. o Physical cable connection can be optimized. o Administrative issues will become easier. 3. Processing of real time packets (QoS issue) Here is an attempt to come out with a solution for Gigabit Ethernet switches (in full duplex mode) to operate in the most user-friendly manner to transport data traffic (IP) as well as real time (RT) traffic (as RTP[6] packet) in the existing 32-bit system. In case of IP routing/switching entire packet gets collected at the intermediate router/switch and forwarded based on the forwarding table. Inside the switch/router the variable length IP packet gets fragmented into smaller size frames at the ingress side. The frames gets transported through the switching fabric with proper priority mechanism (to support QoS) and then reassembled at the egress side and passed through the media for the next hop. In case of ATM, packets get fragmented at the ingress edge devices into small size cells. Entire packet gets transported as a stream of cells and gets collected at the egress edge device. The success of ATM over IP routing as far as speed is concerned is due to the fact Bandyopadhyay Expires May, 2010 [Page 6] Internet Draft MSHN and IPv6 November, 2009 that the latency gets reduced as the entire packet does not get collected, fragmented and reassembled at the intermediate nodes. So, in case of Gigabit Ethernet, if RT packets can be passed without getting fragmented inside the switch, better performance can be expected. i.e. one RT packet needs to get to fit inside one internal frame of the switch fabric. Additionally, to make this approach successful, maximum size of MPLS label stack has to be defined. Inside the switch all the IP packets will be assumed to carry same number of MPLS labels whether they are having one or the maximum in real sense. In fact, to reduce overhead, this limit should be the minimum number of labels needed to satisfy all sorts of features supported by MPLS. i.e. label stacking of depth n (without limit) needs modification. If minimum frame size is selected to fit one RTP packet, overhead becomes too high due to very large (40 bytes: 20 bytes IP + 8bytes UDP + 12 bytes RTP) packet header. Again, if large frame size is used, fragmentation loss becomes too high for the small size packets (say, 40 bytes IP packets). So, a compromise is needed that will give a better result based on the IP packet size distribution. Frame size is selected based on the minimum value of the overhead due to the fragmentation loss of data packet as well as the overhead as header of the RT packets. Studies show that primarily IP data packets of three different sizes are found common in nature. Almost ~50% packets of size 40 bytes (TCP ACK), ~20% packets of size 576 bytes (path MTU set by X.25) and ~30% packets of size 1500 bytes (path MTU set by ethernet) Other packets are less compared to the above three categories and almost evenly distributed. For the sake of simplicity of calculation, traffic of the first three categories are only considered. Payload of the data traffic is the actual IP packet size where as the payload of RT traffic is the payload inside RTP packet. If totBytes are to be transported across the internet and dataPcnt be the %of data traffic, totBytes*dataPcnt/100 = data traffic and (100-dataPcnt)*totBytes/100 = RT traffic; Out of data traffic 50% of 40 bytes length; 20% of 576 bytes length;& 30% of 1500 bytes length. If totDataPkts be the total data packets, totDataPkts*(50*40/100 + 20*576/100 + 30*1500/100) = totBytes*dataPcnt/100; or, totDataPkts*58520 = totBytes*dataPcnt; Bandyopadhyay Expires May, 2010 [Page 7] Internet Draft MSHN and IPv6 November, 2009 Let totBytes = 58520*100, for the ease of calculation; i.e. totDataPkt = dataPcnt*100; 40 bytes packets = 50*totDataPkt/100 i.e. 50*dataPcnt 576 bytes packets = 20*totDataPkt/100 i.e. 20*dataPcnt 1500 bytes packets = 30*totDataPkt/100 i.e. 30*dataPcnt RT packets = totBytes * (100 - dataPcnt)/100 = 58520 * (100-dataPcnt); If n is considered to be the depth of MPLS label stack, inside the switch, actual size of 40 bytes packet = 40+4*n bytes, 576 bytes packet = 576+4*n bytes & 1500 bytes packet = 1500+4*n bytes Let frameSize be the payload of a frame (excluding the frame header) inside the switch. If a RT packet fits exactly inside frameSize, RT packet payload = (frameSize-40-4*n) bytes; Total overhead = packet header overhead (of RT packets) + fragmentation overhead (of data packets); If a plot is drawn for frameSize = 40+4*n+1 to 1500+4*n for different dataPcnt (with dataPcnt=80 to 100) minimum of overhead are found at frameSize = (84, 101, 118, 126 and 152) for n==3; frameSize = (119, 127 and 152) for n==4 and at frameSize = (118, 127 and 152) for n==5. Actual data of the IP traffic has to be collected to get the best result. As dataPcnt increases minimum values are found at a lower frameSize and it gives better result with the higher range for lower dataPcnt. With average IP packet size 585 bytes, switches will encounter a loss of 4*(n-1) bytes for packets that will need only one label. In order to make this scheme work, a standard for maximum label stack size has to be defined. RTP packet size also has to be standardized. 3.1. Duel mode operation Inside ingress as well as in the egress card, packets need to follow certain functional steps. In order to maximize the output, a series of processing units work in pipeline mode for these operations. Ingress service cards need to act in duel mode to process RT packets and non-RT packets. i.e. the RT packets should follow a direct path that won't need fragmentation and related complexities before they are sent to the VOQs (virtual output queues, where from packets gets picked up to be sent to the switching fabric). Whereas other packets Bandyopadhyay Expires May, 2010 [Page 8] Internet Draft MSHN and IPv6 November, 2009 need to follow a different path for fragmentation operations. This will prevent a RT packet to be blocked by the fragmentation procedure of not-RT packets that arrive in the service card prior to the arrival of RT packet. So, mere mapping of RT packet size with the frameSize of switch fabric will not achieve the speed of ATM switches. 4. Refinements over existing IPv6 specification As IPv6 was envisioned long before some of the newer technologies e.g. MPLS came into picture, some refinements can be made over the existing specification. These considerations are related to bandwidth usages and performance inside switches. Previous chapter shows that smaller packet size gives better result for processing of RT packet. So, it is desirable to have IP packet header to be as small as possible. With this architecture, IP address can be described as A.B.C.D where the D part represents the user id. The value nMaxInterASTopNodes can be determined easily and expected to be changed least frequently. The values nMaxInterASBottomNodes, nMaxASNodes and length of user id will vary from place to place; country to country and are purely geo- political on top of the limitations imposed by their hardwares. Once these parameters are determined with mutual agreements, values of pA, pB, pC and prefix length of user id can be determined. If the total length comes out to be less than 128, length of IP header will be reduced accordingly. The 'flow label' field of IPv6 packet header may not be of any use with MPLS is in use. ATM used to have 4 priority classes. The first specification of IPv6 RFC-1883 used a 4bit type of service field along with a 24bits flow label field. These two were modified to a 8bit type of service field and a 20bit flow label field in the current spec RFC-2460. Too many priority classes may increase complexities to process inside switches. If type of service field of IPv6 header may be reduced to be of 4bit length as it was stated in RFC-1883 and 'flow label' field gets removed, another three bytes may be reduced from the IPv6 header. The field 'Hop Limit' has got a 8bit value in the existing spec. The role of this field needs to be discussed properly. For a long route- ID (say 48bit) whether 8bit would be sufficient or how this field will be processed needs to be discussed. 4.1. Distributed processing and Multicasting With the inherent hierarchy involved in this architecture, distributed applications can also be structured in a suitable manner. Bandyopadhyay Expires May, 2010 [Page 9] Internet Draft MSHN and IPv6 November, 2009 Say, for a commonly used web based application a master level server will be there at every top level node. Any changes that might happen in the application, has to be synchronized within these master level servers first. There might be servers at the middle layer (inside each inter-AS-bottom) inside each top level node. Once the changes get reflected at the master node, all the servers at the middle layer needs to update themselves with their master level node. This will reduce network traffic substantially. Multicasting can also be looked at in the similar manner. Work on these issues can be progressed only after this architecture gets approved. 5. Expected changes at the application layer IP packets with size 576 in most of the cases come out of those TCP layers that do not process maximum path-MTU and takes the default one that was set during X.25. The 576 factor can be corrected very easily with path-MTU set to 1500. With the consideration that label switch path do not get changed very frequently in between two arbitrary network points for any particular type of packet, most of the applications are expected to become UDP based with negative ACK. TCP in turn might go through changes. Once this comes into effect, 40 bytes packets will come down drastically. Switch fabric frame size needs to be determined keeping these two factors in mind along with changes in IP packet header. With the existing 32-bit system, frame size (excluding the frame header) of 152 and 127 are most viable solution in general for label stack depth=3,4 &5. 6. IANA Consideration This is a first level draft for proposed standard. Hence, IANA actions should come into play at a later stage, if needed. 7. Security Consideration This document does not include any security related issues. 8. Normative Refereences [1] Nordmark, E. and R. Gilligan, "Basic Transition Mechanisms for IPv6 Hosts and Routers", RFC 4213, October 2005. [2] Fuller V., Li. T., "Classless Inter-Domain Routing (CIDR): The Internet Address Assignment and Aggregation Plan", RFC 4632, August 2006. [3] Huston, G., "Commentary on Inter-Domain Routing in the Internet", RFC 3221, December 2001. Bandyopadhyay Expires May, 2010 [Page 10] Internet Draft MSHN and IPv6 November, 2009 [4] Q. Vohra, E. Chen., "BGP Support for Four-octet AS Number Space", RFC 4893, May 2007. [5] Srisuresh, P. and K. Egevang, "Traditional IP Network Address Translator (Traditional NAT)", RFC 3022, January 2001. [6] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson. "RTP: A Transport Protocol for Real-Time Applications", RFC 3550, July 2003. 9. Informative References [7] Postel, J., "Internet Protocol", STD 5, RFC 791, September 1981. [8] Rekhter, Y., and T., Li, "A Border Gateway Protocol 4 (BGP- 4)",RFC 1771, March 1995. [9] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification, RFC 1883, December 1995. [10] Moy, J., "OSPF Version 2", STD 54, RFC 2328, April 1998. [11] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998. [12] Rosen, E. and Y. Rekhter, "BGP/MPLS VPNs", March 1999. [13] Rosen, E., Viswanathan, A. and R. Callon, "Multiprotocol Label Switching Architecture", RFC 3031, January 2001. 10. Author's Address Shyam Bandyopadhyay HL No 205/157/7, Inda Kharagpur 721305 India Phone: +91 3222 225137 e-mail: shyamb66@gmail.com Bandyopadhyay Expires May, 2010 [Page 11] Internet Draft MSHN and IPv6 November, 2009 Copyright and License Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Bandyopadhyay Expires May, 2010 [Page 12]