550 NETWORK SURVIVABILITY 10.2.4 Note that this protection scheme easily handles failures of links, transmitters/ receivers, or nodes. It is simple to implement and requires no signaling protocol or communication between the nodes. The capacity required for protection purposes is equal to the working capacity. This will turn out to be the case for the other ring architectures as well. The main drawback with the UPSR is that it does not spatially reuse the fiber capacity. This is because each (bidirectional) connection uses up capacity on every link in the ring and has dedicated protection bandwidth associated with it. Thus, there is no sharing of the protection bandwidth between connections. For example, suppose each connection requires 51 Mb/s (STS-1) of bandwidth and the ring op- erates at 622 Mb/s (OC-12). Then the ring could support a total of twelve 51 Mb/s connections. The BLSR architectures that we will study next do incorporate spatial reuse and can support aggregate traffic capacities higher than the transmission rate. UPSRs are popular topologies in lower-speed local exchange and access net- works, particularly where the traffic is primarily hubbed from the access nodes into a hub node in the carrier's central office. In this case, we will see that the traffic carrying capacity that a UPSR can support is the same as what the more complicated ring architectures incorporating spatial reuse can support. This makes the UPSR an attractive option for such applications due to its simplicity and, thus, lower cost. Typical ring speeds today are OC-3 (STM-1) and OC-12 (STM-4). There is no spec- ified limit on the number of nodes in a UPSR or on the ring length. In practice, the ring length will be limited by the fact that the clockwise and counterclockwise path taken by a signal will have different delays associated with them, which in turn, will affect the restoration time in the event of a failure. A UPSR is essentially 1 + 1 protection implemented at the path layer in a ring. Bidirectional Line-Switched Rings BLSRs are much more sophisticated than UPSRs and incorporate additional protec- tion mechanisms, as we will see below. Unlike a UPSR, they operate at the line or multiplex section layer. The BLSR equivalent in the SDH world is called a multiplex section shared protection ring (MS-SPRing). Figure 10.5 shows a four-fiber BLSR. Two fibers are used as working fibers, and two are used for protection. Unlike a UPSR, working traffic in a BLSR can be carried on both directions along the ring. For example, on the working fiber, traffic from node A to node B is carried clockwise along the ring, whereas traffic from B to A is carried counterclockwise along the ring. Usually, traffic belonging to both directions of a connection is routed on the shortest path between the two nodes 10.2 Protection in SONET/SDH 551 in the ring. However, in certain cases [Kha97, LC97], traffic may be routed along the longer path to reduce network congestion and make better use of the available capacity. A BLSR can support up to 16 nodes, and this number is limited by the 4-bit addressing field used for the node identifier. The maximum ring length is limited to 1200 km (6 ms propagation delay) because of the requirements on the restoration time in the case of a failure. For longer rings, particularly for undersea applications, the 60 ms restoration time has been relaxed. A BLSR/4 employs two types of protection mechanisms: span switching and ring switching. In span switching, if a transmitter or receiver on a working fiber fails, the traffic is routed onto the protection fiber between the two nodes on the same link, as shown in Figure 10.6. (Span switching can also be used to restore traffic in the event of a working fiber cut, provided the protection fibers on that span are routed separately from the working fibers. However, this is usually not the case.) In case of a fiber or cable cut, service is restored by ring switching, as illustrated in Figure 10.7. Suppose link AB fails. The traffic on the failed link is then rerouted by nodes A and B around the ring on the protection fibers. Ring switching is also used to protect against a node failure. A BLSR/2, shown in Figure 10.8, can be thought of as a BLSR/4 with the pro- tection fibers "embedded" within the working fibers. In a BLSR/2, both of the fibers are used to carry working traffic, but half the capacity on each fiber is reserved for protection purposes. Unlike a BLSR/4, span switching is not possible here, but ring switching works in much the same way as in a BLSR/4. In the event of a link failure, the traffic on the failed link is rerouted along the other part of the ring using the protection capacity available in the two fibers. As with 1:1 protection on point-to-point links, an advantage of BLSRs is that the protection bandwidth can be used to carry low-priority traffic during normal operation. This traffic is preempted if the bandwidth is needed for service restoration. BLSRs provide spatial reuse capabilities by allowing protection bandwidth to be shared between spatially separated connections. The spatial reuse achievable in a best-case scenario is illustrated in Figure 10.9. As in the UPSR example above, consider a BLSR/2 operating at 622 Mb/s (OC-12), supporting 51 Mb/s STS-1 con- nections. The figure shows a ring with four nodes and STS-1 connections between each pair of adjacent nodes. Note that all four of these connections can be protected by dedicating 51 Mb/s of bandwidth around the ring that is shared by all these con- nections. This is because these connections do not overlap spatially and thus do not need to be restored simultaneously, as long as we are dealing with only single-failure conditions. In this example, the 622 Mb/s ring could thus support a total of 24 such 552 NETWORK SURVIVABILITY Figure 10.6 Illustrating span switching in a BLSR/4. Traffic is switched from the work- ing fiber pair to the protection fiber pair on the same span. 51 Mb/s connections (6 connections per link; note that only half the capacity is avail- able for working traffic, over four links), as compared to just 12 for an equivalent UPSR. This capacity increases as the number of nodes in the rings increases. An 8-node OC-12 BLSR/2 could support 48 STS-1 connections in the example above. Thus BLSRs are more efficient than UPSRs in protecting distributed traffic pat- terns. Their efficiency comes from the fact that the protection capacity in the ring is shared among all the connections, as we saw above. For this reason, BLSRs are widely deployed in long-haul and interoffice networks, where the traffic pattern is more dis- tributed than in access networks. Today, these rings operate at OC-12 (STM-4), OC-48 (STM-16), and OC-192 (STM-64)speeds. Most metro carriers have deployed BLSR/2s, while many long-haul carriers have deployed BLSR/4s. BLSR/4s can handle more failures than BLSR/2s. For example, a BLSR/4 can simultaneously handle one transmitter failure on each span in the ring. It is also easier to service than a BLSR/2 ring because multiple spans can be serviced independently without taking down the ring. However, ring management in a BLSR/4 is more complicated than in a BLSR/2 because multiple protection mechanisms have to be coordinated. 10.2 Protection in SONET/SDH 553 Figure 10.7 Illustrating ring switching in a BLSR/4. Traffic is rerouted around the ring by the nodes adjacent to the failure. BLSRs are significantly more complex to implement than UPSRs. They require extensive signaling between the nodes for many reasons, as we will see below. This signaling is done using the K1/K2 bytes in the SONET overhead (see Chapter 6). Handling Node Failures in BLSRs So far, we have dealt primarily with how to handle failures of links, such as those occurring from a fiber cut. Failures of nodes are usually less likely because, in many cases, redundant configurations (such as dual power supplies and switch fabrics) are used. However, nodes may still fail because of some catastrophic events or human errors. Handling node failures complicates the BLSR restoration mechanism. The failure of a node is seen by all its adjacent nodes as failures of the links that con- nect them to the failed node. If each of these adjacent nodes performs restoration assuming that it is a single link failure, there can be undesirable consequences. One example is shown in Figure 10.10. Here, when node 1 fails, nodes 6 and 2 assume it is a link failure and attempt to reroute the traffic around the ring (ring switching) to restore service. This causes erroneous connections, as shown in the figure. The 554 NETWORK SURVIVABILITY Figure 10.8 A two-fiber bidirectional line-switched ring (BLSR/2). The ring has two fibers and half the bandwidth. Ring switching is used to restore service after a failure. only way to prevent such occurrences is to ensure that the nodes performing the restoration determine the type of failure before invoking their restoration mecha- nisms. This would require exchanging messages between the nodes in the network. In the preceding example, nodes 6 and 2 could first try to exchange messages around the ring to determine if they have both recorded link failures and, if so, invoke the appropriate restoration procedure. This restoration procedure can avoid these mis- connections by not attempting to restore any traffic that originates or terminates at the failed node. This is called squelching. Thus each node in a BLSR maintains squelch tables that indicate which connections need to be squelched in the event of node failures. The price paid for this is a slower restoration time because of the coordination required between the nodes to determine the appropriate restoration mechanism to be invoked. Low-Priority Traffic in BLSRs Just as we saw with 1:1 protection earlier, BLSRs can use the protection bandwidth to carry low-priority or extra traffic, under normal operation. This extra traffic is lost 10.2 Protection in SONET/SDH 555 Figure 10.9 Spatial reuse in a BLSR. Multiple working connections can share protection bandwidth around the ring as long as they do not overlap on any link. 10.2.5 in the event of a failure. However, this feature requires additional signaling between the nodes in the event of a failure to indicate to the other nodes that they should operate in protection mode and throw away the low-priority traffic. Ring Interconnection and Dual Homing A single ring is only a part of the overall network. The entire network typically consists of multiple rings interconnected with each other, and a connection may have to be routed through multiple rings to get to its destination. The interconnection of these rings is thus an important aspect to be considered. The simplest way for rings to interoperate is to connect the drop sides of two ADMs on different rings back to back, as shown in Figure 10.11. The interconnection is done using signals typically at lower bit rates than the line bit rate. For instance, two OC-12 UPSRs may be interconnected by DS3 signals. In many cases, a digital crossconnect is interspersed between the two rings to provide additional grooming and multiplexing capabilities. 556 NFTWORK SURVIVABILITY Figure 10.10 Erroneous connections due to the failure of a node being treated by its adjacent nodes as link failures: (a) Normal operation, with a connection from node 5 to node 1 and another connection from node 1 to node 4. (b) After node 1 fails, nodes 6 and 2 invoke ring switching independently. This causes a connection to be set up erroneously between node 5 and node 4. This problem can be prevented by first identifying the failed node and then not restoring any connections that originate or terminate at the failed node. The problem with the approach above is that if one of the ADMs fails, or there is a problem with the cabling between the two ADMs, the interconnection is broken. A way to deal with this problem is to use dual homing. Dual homing makes use of two hub nodes to perform the interconnection, as shown in Figure 10.12. For traffic going between the rings, connections are set up between the originating node on one ring and both the hub nodes. Thus if one of the hub nodes fails, the other node can take over, and the end user does not see any disruption to traffic. Similarly, if there is a cable cut between the two hub nodes, alternate protection paths are now available to restore the traffic. Rather than set up two separate connections between the originating node and the two hub nodes, the architecture uses a multicasting or drop-and-continue feature present in the ADMs. Consider the connection shown between an end node and the two hub nodes (hub 1 and hub 2) in Figure 10.12. In the clockwise direction of the ring, the ADM at hub 1 drops the traffic associated with the connection but 10.2 Protection in SONET/SDH 557 Figure 10.11 Back-to-back interconnection of SONET/SDH rings. This simple inter- connection is vulnerable to the failure of one of the two nodes that form the interconnect, or of the link between these two nodes. Figure 10.12 Dual homing to handle hub node failures. Each end node is connected to two hub nodes so as to be able to recover from the failure of a hub node or the failure of any interconnection between the hub nodes. The ADMs in the nodes have a "drop-and-continue" feature, which allows them to drop a traffic stream as well as have it continue onto the next ADM. 558 NETWORK SURVIVABILITY also simultaneously allows this traffic to continue along the ring, where it is again dropped at hub 2. Likewise, along the counterclockwise direction, the ADM at hub 2 uses its drop-and-continue feature to drop traffic from this connection as well as pass it through to hub 1. Note that additional bandwidth is used up between the two hub nodes on each ring to support this capability. Dual homing is being deployed in business access networks to interconnect access UPSRs with interoffice BLSRs as well as to interconnect multiple BLSRs. It can also be applied to interconnections between two subnetworks, not necessarily two rings (although rings are the major application). In general, for dual homing to work, the dual node interconnect itself must be a protected subnetwork, so that alternate paths are available if any of the hub nodes or the links interconnecting them fails. 10.3 Protection in IP Networks The IP layer has historically provided best-effort services. As we studied in Sec- tion 6.3, IP, by its very nature, uses dynamic, hop-by-hop routing of packets. Each router maintains a routing table of the next-hop neighbor for each destination, and incoming packets are routed based on this table. If there is a failure in the network, the intradomain routing protocol (OSPF or IS-IS) operates in a distributed manner and updates these routing tables at each router within the domain. In practice, it can take seconds after the failure is detected before the routing tables at all the routers converge and have consistent routing information. During this process, packets con- tinue to be routed based on the current versions of the routing tables at the routers, which can be inconsistent and incorrect. This causes packets to be routed incorrectly and possibly loop within the network. Potentially, packets could therefore be lost or undergo long delays on the order of seconds after a failure is detected. Even if a router decides to route a packet along an alternate route, following the detection of a failure, packets could still loop within the network, as shown in Figure 10.13. In this example, consider packets destined for router D. Suppose link CD fails. Node C would then attempt to route packets destined for D to router B, hoping to find an alternate path to reach router D. Router B, however, still thinks that the best way to get to router D is through router C and would route that packet back to router C. This is the case until the routing tables at the routers have all converged. The slow recovery from failures is due to the fundamental nature of IP routingmthe fact that it is distributed, next-hop-based dynamic routing. Providing faster restoration times requires some way to nail down paths and have packets fol- low a known path through the network. This capability is provided by multi-protocol label switching (MPLS). As we studied in Section 6.3, MPLS allows label-switched 10.3 Protection in IP Networks 559 Figure 10.13 An example to illustrate routing loops in an IP network after a failure. It takes many interations before the routing tables at the nodes converge to the correct routes. In the meantime, there can be routing loops. paths (LSPs) to be set up between nodes. All packets belonging to an LSP are routed along the same path. This allows several protection schemes to be implemented within the MPLS layer (which can be viewed as a link layer under the IP network layer). For example, upon detecting a link failure, we could set up alternate LSPs for all the LSPs currently using that link, and reroute packets on the newly set up LSPs. This could be done locally to route around a failed link, or it could be done at the ends of the LSPs. A variety of protection schemes, such as 1 + 1, ring, or shared mesh, could be implemented using this approach and are being developed currently. The other aspect of protection in the IP layer has to do with the time taken by the IP layer to detect failures in the first place. In a typical implementation used in intradomain routing protocols [AJY00], adjacent routers exchange periodic "hello" packets between themselves. If a router misses a certain number of these packets, it declares the link to have failed and initiates rerouting. By default, the routers send hello packets every 10 seconds and declare the link down if they miss three successive hello packets. Thus it could take up to 30 seconds to detect a failure. The process can be speeded up by exchanging hello packets more frequently; however, the minimum interval is currently specified to be 1 second. More typically, core routers detect failures in about 10 seconds. Alternatively, a separate set of packets can be exchanged periodically for this purpose [HYCG00]. However, these packets can get queued up in buffers if there are a lot of other packets waiting and so may have to be processed at higher priority levels than regular packets. Another option is to rely on the underlying SONET or optical layer to detect the failure and inform the IP layer. This can be done by having the line card inside a router look at the framing and communicate failure detection information up into the routing protocol. However, this is not usually architected into today's routers. . support aggregate traffic capacities higher than the transmission rate. UPSRs are popular topologies in lower-speed local exchange and access net- works, particularly where the traffic is primarily. access nodes into a hub node in the carrier's central office. In this case, we will see that the traffic carrying capacity that a UPSR can support is the same as what the more complicated. can handle more failures than BLSR/2s. For example, a BLSR/4 can simultaneously handle one transmitter failure on each span in the ring. It is also easier to service than a BLSR/2 ring because