530 CONTROL AND MANAGEMENT emission limits for the safety class. The values chosen for r and r I depend on the link propagation delay (see Problem 9.5). Since the Class I safety standard also specifies that emission limits must be main- tained during single fault conditions, the open fiber control circuitry at each node is duplicated for redundancy. Summary Network management is essential to operate and maintain any network. Operating costs dominate equipment costs for most telecom networks, making good network management imperative in ensuring the smooth operation of the network. The main functions of network management include configuration (of equipment and connec- tions in the network), performance monitoring, and fault management. In addition, security and accounting are also management functions. Most functions of manage- ment are performed through a hierarchy of centralized management systems, but certain functions, such as restoration against failures, or the use of defect indicators to suppress alarms, are done in a distributed fashion. Several management protocols exist, the main ones being TL-1, SNMP, and CMIP. It is useful to break down the optical layer into three sublayers: the optical channel layer, which deals with individual connections or lightpaths and is end to end across the network; the optical multiplex section layer, which deals with multiplexed wavelengths on a point-to-point link basis; and the optical transmission section layer, which deals with multiplexed wavelengths and the optical supervisory channel between adjacent amplifiers. Interoperability between equipment from different vendors is a major issue facing the industry today. Initially the focus was on trying to get interoperability between vendors at the WDM level, but that has been recognized now as being very complex. Today the focus is on establishing interoperability by defining standard port-side single-wavelength interfaces at regenerator (or transponder) locations. There is also significant work under way to define optical layer overheads and their functions as well as to establish signaling and control protocol standards for controlling connec- tions in the optical layer. The level of transparency offered by the optical network affects the amount of management that can be performed. Key performance parameters such as the bit error rate can only be monitored in the electrical domain. Fast signaling methods need to be in place between network elements to perform some key management functions. These include the use of defect indicator signals to prevent the generation of unwanted alarms and protection-switching action, and other signaling bytes to control rapid protection switching. Optical path trace is another indicator that can Further Reading 5 31 be used to verify and manage connectivity in the network. Several methods exist for exchanging management information between nodes, including the optical su- pervisory channel, pilot tones, the use of certain overhead bytes in the SONET/SDH overhead, and the new digital wrapper overhead defined specifically for the optical layer. Connection management in the optical network is slowly migrating from a cen- tralized management-plane-based approach to a more distributed connection control plane approach using protocols similar to those used in IP and ATM networks. Eye safety considerations are a unique feature of optical fiber communication systems. These considerations set an upper limit on the power that can be emitted from an open fiber, and these limits make it harder to design WDM systems, since they apply to the total power and not to the power per channel. Safety is maintained by using automated shutdown mechanisms in the network that detect failures and turn off lasers and amplifiers to prevent any laser radiation from exiting the system. Further Reading Network management is a vast subject, and several books have been written on the subjectmsee, for instance, [Sub00, Udu99, Bla95, AP94] for good introductions to the field, including descriptions of the various standards. [McG99, Wil00, Mae98] provide overviews of issues in optical network management. There is currently a lot of interest in the standards bodies in standardiz- ing many of the items we discussed in this chapter. The standards groups cur- rently engaged in this are the International Telecommunications Union (ITU) study groups 13 and 15 (www.itu.ch), the American National Standards Institute (ANSI) TIX1.5 subcommittee (www.ansi.org), the Optical Internetworking Forum (OIF) (www.oiforum.com), the Internet Engineering Task Force (IETF) (www.ietf.org), Telcordia Technologies (www.telcordia.com), and the Network and Services Inter- operability Forum (NSIF) (www.atis.org/atis/sif/sifhom.htm). The ITU defines the standards, including both SDH and the optical layer. ANSI provides the North Amer- ican input to the ITU. IETF is the standards body for the Internet and is actively involved in defining optical layer control protocols. The OIF serves as a discussion fo- rum for data communications equipment vendors, optical networking vendors, and service providers. Telcordia defined many of the SONET standards. NSIF has defined many of the management interfaces for facilitating interoperability in SONET. We have provided a list of relevant standards documents in Appendix C. Pilot tones have been used in optical networks for several years now. See [Hi193, HFKV96, HK97] for a sampling of papers describing implementations of pilot tones 532 CONTROL AND MANAGEMENT for signal tracing and monitoring. [Epw95] uses pilot tones to control the gain of optical amplifiers. ITU G.709 defines the digital wrapper including the associated maintenance signals such as the path trace and the defect indicators. Telcordia's GR-253 defines an equivalent set of signals for SONET. Distributed protocols for connection management are commonly used in many types of networks; examples include PNNI in ATM networks [ATM96] and RSVP/CR-LDP [BZB+97, Abo01] in IP/MPLS networks. See [CGS93] for some early work and [RS97, Wei98] for related work on optical networks. Significant activity is under way currently toward defining extensions to IP control protocols to provide optical layer connection management. Many of these are contributions to the ITU, ANSI, IETF, and OIF and may be accessed from their Web sites. See also [GR00, AR01] for a discussion of the various types of control plane models. Laser safety is covered by several standards, including ANSI, the International Electrotechnical Commission (IEC), the U.S. Food and Drug Administration (FDA), and the ITU [Ame88, Int93, Int00, US86, ITU99, ITU96]. 9.1 9.2 9.3 Problems Which sublayer within the optical layer would be responsible for handling the fol- lowing functions? (a) Setting up and taking down lightpaths in the network (b) Monitoring and changing the digital wrapper overhead in a lightpath (c) Rerouting all wavelengths (except the optical supervisory channel) from a failed fiber link onto another fiber link (d) Detecting a fiber cable cut in a WDM line system (e) Detecting failure of an individual lightpath (f) Detecting bit errors in a lightpath Consider the SONET network operating over the optical layer shown in Figure 9.13. Trace the path of the connection through the network, and show the termination of different layers at each network element. Consider the network shown in Figure 9.14. Suppose the link segment between OLT A and amplifier B fails. (a) Assume that each node detects loss of light in 2 ms and waits 5 ms before it sends an FDI signal downstream. Also, each node waits for 2 s after the loss of light is detected before it triggers an alarm. Assume that the propagation delay on each link segment (segment defined as the part of the link between adjacent amplifiers or between an OLT and adjacent amplifier) is 3 ms. Problems 533 Figure 9.13 A combined SONET/WDM optical network for Problem 9.2. Figure 9.14 Example for Problem 9.3. 9.4 Draw a time line indicating the behavior of each node in the network after the failure, including the transmission of OCh-FDI and OMS-FDI signals. (b) Now assume that each node detects loss of light in 2 ms, immediately sends an FDI signal downstream, and waits an additional 2 s after the loss of light is detected before it triggers an alarm. Assume the same propagation delay values as before. Redraw the time line indicating the behavior of each node in the network after the failure, including the transmission of OCh-FDI and OMS-FDI signals. What do you observe as the difference between the two methods proposed above? Consider an OXC connected to multiple OLTs. (a) If the OXC has an electronic switch core with optical-to-electrical conver- sions at its ports, what overhead techniques can it use? How would it commu- nicate with other such OXCs in the network? What performance parameters could it monitor? 534 CONTROL AND MANAGEMENT 9.5 (b) If the OXC is all optical, with no optical-to-electrical conversions, what overhead techniques can it use? How would it communicate with other such OXCs in the network? What performance parameters could it monitor? Consider the open fiber control protocol in the Fibre Channel standard. (a) How would you choose the parameters r and r' as a function of the maximum link propagation delay dp~op? (b) What is the time taken for a node to go from the DISCONNECT state to the ACTIVE state, assuming a successful reconnection attempt, that is, it never has to go back to the DISCONNECT state? References [Abo01] O. Aboul-Magd et al. Constraint-Based LSP Setup Using LDP. Internet Engineering Task Force, 2001. draft-ietf-mpls-cr-ldp-O5.txt. lAme88] American National Standards Institute. Z136.2. Safe Use of Optical Fiber Communication Systems Utilizing Laser Diodes and LED Sources, 1988. [AP94] S. Aidarus and T. Plevyak, editors. Telecommunications Network Management into the 21st Century. IEEE Press, Los Alamitos, CA, 1994. [AR01] D. Awduche and Y. Rekhter. Multiprotocol lambda switching: Combining MPLS traffic engineering control with optical crossconnects. IEEE Communications Magazine, 39(4):111-116, Mar. 2001. [ATM96] ATM Forum. Private Network-Network Interface Specification: Version 1.0, 1996. [Bla95] U. Black. Network Management Standards. McGraw-Hill, New York, 1995. [BZB+97] R. Bradon, L. Zhang, S. Berson, S. Herzog, and S. Jamin. Resource Reservation ProtocolmVersion 1 Functional Specification. Internet Engineering Task Force, Sept. 1997. [CGS93] I. Cidon, I. S. Gopal, and A. Segall. Connection establishment in high-speed networks. IEEE/ACM Transactions on Networking, 1(4):469-482, Aug. 1993. [Epw95] R.E. Epworth. Optical transmission system. U.S. Patent 5463487, 1995. [GR00] J. Gruber and R. Ramaswami. Towards agile all-optical networks. Lightwave, Dec. 2000. [HFKV96] E Heismann, M. T. Fatehi, S. K. Korotky, and J. J. Veselka. Signal tracking and performance monitoring in multi-wavelength optical networks. In Proceedings of European Conference on Optical Communication, pages 3.47-3.50, 1996. [Hi193] G.R. Hill et al. A transport network layer based on optical network elements. IEEE/OSA Journal on Lightwave Technology, 11:667-679, May/June 1993. References 535 [HK97] Y. Hamazumi and M. Koga. Transmission capacity of optical path overhead transfer scheme using pilot tone for optical path networks. IEEE/OSA Journal on Lightwave Technology, 15(12):2197-2205, Dec. 1997. [Int93] International Electrotechnical Commission. 60825-1: Safety of Laser Productsm Part 1: Equipment Classification, Requirements and User's Guide, 1993. [Int00] International Electrotechnical Commission. 60825-2: Safety of Laser ProductsmPart 2: Safety of Optical Fiber Communication Systems, 2000. [ITU96] ITU-T SG15/WP 4. Rec. G.681" Functional Characteristics of Interoffice and Long-Haul Line Systems Using Optical Amplifiers, Including Optical Multiplexing, 1996. [ITU99] ITU-T. Rec. G.664: Optical Safety Procedures and Requirements for Optical Transport Systems, 1999. [Mae98] M. Maeda. Management and control of optical networks. IEEE Journal of Selected Areas in Communications, 16(6):1008-1023, Sept. 1998. [McG99] A. McGuire. Management of optical transport networks. IEE Electronics and Communication Engineering Journal, 11(3):155-163, June 1999. [RS97] R. Ramaswami and A. Segall. Distributed network control for optical networks. IEEE/ACM Transactions on Networking, Dec. 1997. [Sub00] M. Subramanian. Network Management: Principles and Practice. Addison-Wesley, Reading, MA, 2000. [Udu99] D.K. Udupa. TMN Telecommunications Management Network. McGraw-Hill, New York, 1999. [US86] U.S. Food and Drug Administration, Department of Radiological Health. Requirements of 21 CFR Chapter J for Class I Laser Products, Jan. 1986. [Wei98] Y. Wei et al. Connection management for multiwavelength optical networking. IEEE Journal of Selected Areas in Communications, 16(6):1097-1108, Sept. 1998. [Wil00] B.J. Wilson et al. Multiwavelength optical networking management and control. IEEE/OSA Journal on Lightwave Technology, 18(12):2038 2057, 2000. This Page Intentionally Left Blank Network Survivability p ROVIDING RESILIENCE AGAINST FAILURES is an important requirement for many high-speed networks. As these networks carry more and more data, the amount of disruption caused by a network-related outage becomes more and more significant. A single outage can disrupt millions of users and result in millions of dollars of lost revenue to users and operators of the network. As part of the service-level agreement between a carrier and its customer leasing a connection, the carrier commits to providing a certain availability for the connection. A common requirement is that the connection be available 99.999% (five 9s) of the time. This requirement corresponds to a connection downtime of less than 5 minutes per year. A connection is routed through many nodes in the network between its source and its destination, and there are many elements along its path that can fail. The only practical way of obtaining 99.999% availability is to make the network surviv- able, that is, able to continue providing service in the presence of failures. Protection switching is the key technique used to ensure survivability. These protection tech- niques involve providing some redundant capacity within the network and automati- cally rerouting traffic around the failure using this redundant capacity. A related term is restoration. Some people apply the term protection when the traffic is restored in the tens to hundreds of milliseconds, and use the term restoration to schemes where traffic is restored on a slower time scale. However, we do not distinguish between protection and restoration in this chapter. Protection is usually implemented in a distributed manner without requiring centralized control in the network. This is necessary to ensure fast restoration of service after a failure. 537 538 NETWORK SURVIVABILITY We will be concerned with failures of network links, nodes, and individual chan- nels (in the case of a WDM network). In addition, the software residing in today's network elements is immensely complex, and reliability problems arising from soft- ware bugs has become a serious issue. This is something that is usually dealt with by using proper software design and is hard to protect against in the network. In most cases failures are triggered by human error, such as a backhoe cutting through a fiber cable, or an operator pulling out the wrong connection or turning off the wrong switch. Links fail mostly because of fiber cuts. This is the most likely failure event. There were 136 such failures reported by U.S. carriers to the Federal Communications Commission in 1997. Fiber that is deployed inside of oil and gas pipelines is less likely to be cut than fiber that is buried directly in the ground or strung on poles. For instance, Williams Communications, which runs fiber beside oil pipelines, has experienced only a single fiber cut since 1986. The next most likely failure event is the failure of active components inside net- work equipment, such as transmitters, receivers, or controllers. In general, network equipment is designed with redundant controllers. Moreover, failure of controllers doesn't affect traffic but only impacts management visibility into the network. Node failures are another possibility to be reckoned with. Entire central offices can fail, usually because of catastrophic events such as fires or flooding. These events are rare, but they cause widespread disruption when they occur. Examples include the fire at the Hinsdale central office of Illinois Bell in 1988 and the flooding of several central offices due to Hurricane Floyd in 1999. Protection schemes are also used extensively to allow maintenance actions in the network. For example, in order to service a link, typically the traffic on the link is switched over to an alternate route using the protection scheme before it is serviced. The same technique is used when nodes or links are upgraded in the network. In most cases, the protection schemes are engineered to protect against a single failure event or maintenance action. If the network is large, we may need to provide the capability to deal with more than one concurrent failure or maintenance action. One way to handle this is to break up the network into smaller subnetworks and restrict the operation of the protection scheme to within a subnetwork. This allows one failure per subnetwork at any given time. Another way to deal with this issue is to ensure that the mean time to repair a failure is much smaller than the mean time between failures. This ensures that, in most cases, the failed link will be re- paired before another failure happens. Some of the protection schemes that we will study do, however, protect the network against some types of simultaneous multiple failures. The restoration times required depend on the application/type of data being carried. For SONET/SDH networks, the maximum allowed restoration time is 60 ms. This restoration time requirement came from the fact that some equipment in the 10.1 Basic Concepts 539 network drops voice calls if the connection is disrupted for a period significantly longer than 60 ms. Over time, operators have gotten used to being able to achieve restoration on these time scales. However, in a world dominated by data, rather than voice traffic, the 60 ms number may not be a hard requirement, and operators may be willing to tolerate somewhat larger restoration times, particularly if they see other benefits as a result, such as higher bandwidth efficiency, which in turn would lead to lower operating costs. On the other hand, another point of view is that the restoration time requirements could get more stringent as data rates in the network increase. A downtime of 1 second at 10 Gb/s corresponds to losing over a gigabyte of data. Most IP networks today provide services on a best-effort basis and do not guarantee availability; that is, they try to route traffic in the network as best as they can, but packets can have random delays through the network and can be dropped if there is congestion. Survivability can be addressed within many layers in the network. Protection can be performed at the physical layer, or layer 1, which includes the SONET/SDH and the optical layers. Protection can also be performed at the link layer, or layer 2, which includes the ATM layer and the MPLS layer that is part of IP networks. Finally, protection can also be performed at the network layer, or layer 3, such as the IP layer. There are several reasons why this is the case. For instance, each layer can protect against certain types of failures but probably not protect against all types of failures effectively. We will focus primarily on layer 1 restoration in this chapter, but also briefly discuss the protection techniques applicable to layers 2 and 3. The rest of this chapter is organized as follows. We start by outlining the basic concepts behind protection schemes. Many of the protection techniques used in today's telecommunication networks were developed for use in SONET and SDH networks, and we will explore these techniques in detail. We will also look at how protection is implemented in today's IP networks. Following this, we will look at protection functions in the optical layer in detail, and then discuss how protection functions in the different layers of the network can work together. 10.1 Basic Concepts A great variety of protection schemes are used in today's networks. We will talk about working paths and protect paths. Working paths carry traffic under normal operation; protect paths provide an alternate path to carry the traffic in case of failures. Working and protection paths are usually diversely routed so that both paths aren't lost in case of a single failure. Protection schemes are designed to operate over a range of network topologies. Some work on point-to-point links. Ring topologies are particularly popular in . Connection management in the optical network is slowly migrating from a cen- tralized management-plane-based approach to a more distributed connection control plane approach using protocols similar. multiplexed wavelengths on a point-to-point link basis; and the optical transmission section layer, which deals with multiplexed wavelengths and the optical supervisory channel between adjacent amplifiers accounting are also management functions. Most functions of manage- ment are performed through a hierarchy of centralized management systems, but certain functions, such as restoration against failures,