Mission-Critical Network Planning phần 2 doc

expressed as Metcalfe’s Law, states that the potential value of a network is propor - tional to the number of active user nodes. More precisely, the usefulness of a network to new and existing users equals the square of the number of user nodes. Although networks with higher scale factors provide greater potential value per node, they are less economical to grow [13]. σ has implications on network risk. Large, scale-free networks imply distributed architectures, which, we see from prior discussion, have a greater likelihood of outage. But because there are fewer users served per node, the expected loss from a nodal outage is less. In a network of limited scale, where users are more concentrated at each node, greater damage can happen. These points can be illustrated by estimating the minimum expected loss in a network of N nodes, each with an outage potential p, equal to the probability of a failure. The probability of f failures occurring out of N possible nodes is statistically characterized by the well-known Binomial distribution [14]: () ( ) ( ) Pf N p p f N f f Nf =− − − !/!!1 (2.2) where P(f) is the probability of f failures occurring (this formula will be revisited in the next chapter with respect to the discussion on network reliability). If P(0) indi - cates the percentage of the time no failures occur (f = 0), then 1 − P(0) is the per - centage of time that one or more failures occur. If we next assume that σ is the average nodal loss per outage, measured in terms of percentage of users, then on a broad network basis, the minimum risk (or percent expected minimum loss) ρ for a network is given by: () []ρσ=−10P (2.3) Figure 2.14 graphically shows how ρ varies with network scale at different nodal outage potentials p. It shows that investing to reduce nodal outage potentials, regardless of scale, can ultimately still leave about one percent of the users at risk. Expanding the size of the network to reduce risk is more effective when the scale is 26 Principles of Continuity Relatively scale-free scale = 10 scope = 5 2,,=.σ Exponential scale-free scale = 10 scope = 5 2,,=.σ Relatively scale-limited scale = 10 scope = 2 5,,=.σ Service node User node Figure 2.13 Network scale factor. limited or, in other words, users are more concentrated. Concentration will domi - nate risk up to a certain point, beyond which size can greater influence network risk. This analysis assumes uniform, random outage potential at each node. Of course, this assumption may not hold true in networks with a nonuniform user distribution and nodal outage potential. An intentional versus random outage at a concentrated node, such as one resulting from a directed security attack, can inflict great damage. Overall, scale-free networks with less concentrated nodal functional- ity are less vulnerable to directed, nonrandom outages than networks of limited scale. 2.4.4 Complexity Complexity in a network is characterized by variety. Use of many diverse network technologies, circuits, OSs, vendors, interfaces, devices, management systems, serv - ice providers, and suppliers is quite common and often done for valid reasons. Many networks are grown incrementally over time and result in à la carte designs. Such designs usually do not perform as well as a single integrated design. Even if a complex design has strong tolerance against errors, extra complexity can impede recovery activities and require more costly contingency. The problems with complexity are attributed to not only system variety. They are also attributed to whether components are chosen correctly for their intended purpose and how well they are matched. A matched system is one in which compo - nents fit well with each other and do not encroach upon the limits of one another. Turnkey systems delivered by a single vendor or integrator are usually delivered with intentions of behaving as matched systems. On the other hand, unmatched sys - tems are usually characterized by poorly integrated piecemeal designs involving dif - ferent vendor systems that are plugged together. The cost of maintaining the same levels of service increases exponentially as more complexity is used to deliver the service. Consolidation of multiple components, interfaces, and functions into a single system or node can reduce the number of resources needed and eliminate potential points of failure and the interoperability issues associated with matching. But from previous discussion, we know that consolidation poses greater risk because the 2.4 Principles of Design 27 0% 2% 4% 6% 8% 10% 12% 1.00 0.33 0.20 0.14 0.11 0.09 0.08 0.07 0.06 0.05 Scale () σ Minimum risk ()ρ p = .01 p = .05 p=.1 Figure 2.14 Network risk example. consolidated resource can result in a single point of failure with high damage poten - tial, unless a redundant architecture is used. 2.5 Summary and Conclusions This chapter reviewed some fundamental concepts of continuity that form the basis for much of the remaining discussion in this book. Many of these concepts can be applied to nearly all levels of networking. A basic comprehension of these principles provides the foundation for devising and applying continuity strategies that use many of the remedial techniques and technologies discussed further in this book. Adverse events are defined as those that violate a well-defined envelope of operational performance criteria. Service disruptions in a network arise out of lack of preparedness rather than the adverse events themselves. Preparedness requires having the appropriate capabilities in place to address such events. It includes having an early detection mechanism to recognize them even before they happen; the ability to contain the effects of disruption to other systems; a failover process that can transfer service processing to other unaffected working components; a recovery mechanism to restore any failed components; and the means to resume to normal operation following recovery. Redundancy is a tactic that utilizes multiple resources so that if one resource is unable to provide service, another can. In networks, redundancy can add to capital and operating costs and should therefore be carefully designed into an operation. At a minimum, it should eliminate single points of failure—particularly those that sup- port mission-critical services. There are various ways to implement redundancy. However, if not done correctly, it can be ineffective and provide a false sense of security. Tolerance describes the ability to withstand disruptions and is usually expressed in terms of availability. A greater level of tolerance in a network or system implies lower transaction loss and higher cost. FT, FR, and HA are tolerance categories that are widely used to classify systems and services, with FT representing the highest level of tolerance. FR and HA solutions can be cost-effective alternatives to minimiz - ing service disruption, but they may not guarantee transaction preservation during failover to the same degree as FT. There are several key principles of network and system design to ensure continu - ity. Capacity should be put in the right places—indiscriminate placement of capacity can produce bottlenecks, which can lead to other service disruptions. Networks should be designed in compartments that each represent their own failure groups, so that a disruption in one compartment does not affect another. The network architec - ture should be balanced so that loss of a highly interconnected node does not disrupt the entire network. Finally, the adage “the simpler, the better” prevails. Complexity should be dis - couraged at all levels. The variety of technologies, devices, systems, vendors, and services should be minimized. They should also be well matched. This means that each should be optimally qualified for its intended job and work well with other components. 28 Principles of Continuity References [1] Henderson, T., “Guidelines for a Fault-Tolerant Network,” Network Magazine, Novem - ber 1998, pp. 38–43. [2] Slepicka, M., “Masters of Disasters—Beating the Backhoe,” Network Reliability—Supple - ment to America’s Network, June 2000. [3] Barrett, R., “Fiber Cuts Still Plague ISPs,” Interactive Week, May 31, 1999, p. 36. [4] Campbell, R., “Cable TV: Then and Now,” TV Technology, September 20, 2000, p. 34. [5] Brumfield, R., “What It Takes to Join the Carrier Class,” Internet Telephony, May 1999, pp. 80–83. [6] Glorioso, R., “Recovery or Tolerance?” Enterprise Systems Journal, July 1999, pp. 35–37. [7] Klein, D. S., “Addressing Disaster Tolerance in an E-World,” Disaster Recovery Journal, Spring 2001, pp. 36–40. [8] Whipple, D., “For Net & Web, Security Worries Mount,” Interactive Week, October 9, 2000, pp. 1–8. [9] Oleson, T. D., “Consolidation: How It Will Change Data Centers,” Computerworld—Spe - cial Advertising Supplement, 1999, pp. 4–19. [10] Nolle, T., “Balancing Risk,” Network Magazine, December 2001, p. 96. [11] Sanborn, S., “Spreading Out the Safety Net,” Infoworld, April 1, 2002, pp. 38–41. [12] Porter, D., “Nothing Is Unsinkable,” Enterprise Systems Journal, June 1998, pp. 20–26. [13] Rybczynski, T., “Net-Value—The New Economics of Networking,” Computer Telephony Integration, April 1999, pp. 52–56. [14] Bryant, E. C., Statistical Analysis, New York: McGraw-Hill Book Company, Inc., 1960, pp. 20–24. 2.5 Summary and Conclusions 29 . CHAPTER 3 Continuity Metrics Metrics are quantitative measures of system or network behavior. We use metrics to characterize system behavior so that decisions can be made regarding how to man - age and operate them efficiently. Good metrics are those that are easily understood in terms of what they measure and how they convey system or network behavior. There is no single metric that can convey the adequacy of a mission-critical net - work’s operation. Using measures that describe the behavior of a single platform or portion of a network is insufficient. One must measure many aspects of a network to arrive at a clear picture of what is happening. There is often no true mathematical way of combining metrics for a network. Unlike the stock market, use of a computed index to convey overall network status is often flawed. For one thing, many indices are the result of combining measures obtained from ordinal and cardinal scales, which is mathematically incorrect. Some measures are obtained through combination using empirically derived models. This is can also be flawed because a metric is only valid within the ranges of data from which it was computed. The best way of combining measures is through human judgment. A network operator or manager must be trained to observe different metrics and use them to make decisions. Like a pilot, operators must interpret information from various gauges to decide the next maneuver. Good useful metrics provide a balance between data granularity and the effort required for computation. Many statistical approaches, such as experimental design, are aimed at providing the maximum amount of information with the least amount of sampling. The cost and the ability to obtain input data have improved over the years. Progress in computing and software has made it possible to conduct calculations using vast amounts of data in minimal time, impossible 20 or 30 years ago. The amount of time, number of samples, complexity, and cost all should be considered when designing metrics. Metrics should be tailored to the item being measured. No single metric is appli - cable to everything in a network. Furthermore, a metric should be tied to a service objective. It should be used to express the extent to which an objective is being achieved. A metric should be tied to each objective in order to convey the degree to which it is satisfied. Finally, computing a metric should be consistent when repeated over time; oth - erwise, comparing relative changes in the values would be meaningless. Repeated calculations should be based on the same type of data, the same data range, and the same sampling approach. More often than not, systems or network services are compared based on measures provided by a vendor or service provider. Comparing different vendors or providers using the measures they each supply is often difficult 31 and sometimes fruitless, as each develops their metrics based on their own methodologies [1]. 3.1 Recovery Metrics Recovery is all of the activities that must occur from the time of an outage to the time service is restored. These will vary among organizations and, depending on the con - text of use, within a mission-critical network environment. Activities involved to recover a component are somewhat different than those to recover an entire data center. But in either case, the general meaning remains the same. General recovery activities include declaration that an adverse event has occurred (or is about to occur); initialization of a failover process; system restoration or repair activities; and system restart, cutover, and resumption of service. Two key recovery metrics are described in the following sections. 3.1.1 Recovery Time Objective The recovery time objective (RTO) is a target measure of the elapsed time interval between the occurrence of an adverse event and the restoration of service. RTO should be measured from the point when the disruption occurred until operation is resumed. In mission-critical environments, this means that operation is essentially in the same functional state as it was prior to the event. Some IT organizations may alter this definition, by relaxing some of the operational state requirements after resumption and accepting partial operation as a resumed state. Likewise, some will define RTO based on the time of recognizing and declaring that an adverse event has occurred. This can be misleading because it does not take into account monitoring and detection time. RTO is an objective, specified in hours and minutes—a target value that is deter - mined by an organization’s management that represents an acceptable recovery time. What value an organization assigns to “acceptable” is influenced by a variety of factors, including the importance of the service and consequential revenue loss, the nature of the service, and the organization’s internal capabilities. In systems, it may even be specified in milliseconds. Some will also specify RTO in transactions or a comparable measure that conveys unit throughput of an entity. This approach is only valid if that entity’s throughput is constant over time. RTOs can be applied to any network component—from an individual system to an entire data center. Organizations will define different RTOs for different aspects of their business. To define an RTO, an organization’s managers must determine how much service interruption their business can tolerate. They must determine how long a functional entity, such as a business process, can be unavailable. One may often see RTOs in the range of 24 to 48 hours for large systems, but these numbers do not reflect any industry standard. Virtual storefronts are unlikely to tolerate high RTOs without significant loss of revenue. Some vertical markets, such as banking, must adhere to financial industry requirements for disruption of transactions [2]. Cost ultimately drives the determination of an RTO. A high cost is required to achieve a low RTO for a particular process or operation. To achieve RTOs close to zero requires expensive automated recovery and redundancy [3]. As the target RTO 32 Continuity Metrics increases, the cost to achieve the RTO decreases. An RTO of long duration invites less expensive redundancy and more manual recovery operation. However, concur - rent with this is business loss. As shown in Figure 3.1, loss is directly related to RTO—the longer the RTO, the greater the loss. During recovery, business loss can be realized in many ways, including loss productivity or transactions. This topic is discussed further in this chapter. At some point, there is an RTO whose costs can completely offset the losses during the recovery [4, 5]. It becomes evident that defining an RTO as a sole measure is meaningless with - out some idea of what level of service the recovery provides. Furthermore, different systems will have their own RTO curves. Critical systems will often have a much smaller RTO than less critical ones. They can also have comparable RTOs but with more stringent tolerance for loss. A tiered-assignment approach can be used. This involves defining levels of system criticality and then assigning an RTO value to each. So, for example, a three-level RTO target might look like this: • Level 1—restore to same service level; • Level 2—restore to 75% service level; • Level 3—restore to 50% service level. A time interval can be associated to each level as well as a descriptor of the level of service provided. For example, a system assigned a level 2 RTO of 1 hour must complete recovery within that time frame and disrupt no more than 25% of service. A system can be assigned a level 1 RTO of 1 hour as well, but must restore to the same level of service. Level 1 may require failover procedures or recovery to a secon- dary system. Assuming that the service level is linearly proportional to time, RTOs across different levels can be equated on the same time scale. A time-equivalent RTO, RTO E , can thus be computed as: () RTO RTO E =−/1 α (3.1) where α is the maximum percentage of service that cannot be disrupted. In our example, an RTO (level 2) requires an α of 25%, which equates to an RTO E of 80 minutes. 3.1 Recovery Metrics 33 $ R T O Loss Cost Recovery costs offset loss Figure 3.1 RTO versus loss and cost. 3.1.1.1 Recovery Time Components The RTO interval must incorporate all of the activities to restore a network or com - ponent back to service. A flaw in any of the component activities could lead to sig - nificant violation of the RTO. To this end, each component activity can be assigned an RTO as well. The addition of each of these component RTOs may not necessarily equal the overall RTO because activities can be conducted in parallel. Some of these component RTOs can include time to detection and declaration of an adverse event, time to failover (sometimes referred to as a failover time objective), time to diagnose, and time to repair. The last two items are typically a function of the network or system complexity and typically pose the greatest risk. In complex networks, one can expect that the likelihood of achieving an RTO for the time to diagnose and repair is small. Failover to a redundant system is usually the most appropriate countermeasure, as it can buy time for diagnostics and repair. A system or network operating in a failed state is somewhat like a twin-engine airplane flying on one engine. Its level of reliability is greatly reduced until diagnostics and repairs are made. Figure 3.2 illustrates the continuum of areas activity relative to a mission-critical network. Of course, these may vary but are applicable to most situations. The areas include the following: • Network recovery. This is the time to restore voice or data communication following an adverse event. Network recovery will likely influence many other activities as well. For instance, recovery of backup data over a network could be affected until the network is restored. • Data recovery. This is time to retrieve backup data out of storage and deliver to a recovery site, either physically or electronically. It also includes the time to load media (e.g., tape or disk) and install or reboot database applications. This is also referred to as the time to data (TTD) and is discussed further in the chapter on storage. • Application recovery. This is the time to correct a malfunctioning application. • Platform recovery. This is the time to restore a problematic platform to service operation. • Service recovery. This represents recovery in the broadest sense. It represents the cumulative time to restore service from an end user’s perspective. It is, in essence, the result of an amalgamation of all of the preceding recovery times. 34 Continuity Metrics Network recovery Data recovery Application recovery Platform recovery Time scale Outage Failover Resumption Service recovery Figure 3.2 Continuum of recovery activities. All of these areas are discussed at greater length in the subsequent chapters of this book. 3.1.2 Recovery Point Objective The recovery point objective (RPO) is used as target metric for data recovery. It is also measured in terms of time, but it refers to the age or freshness of data required to restore operation following an adverse event. Data, in this context, might also include information regarding transactions not recorded or captured. Like RTO, the smaller the RPO, the higher the expected data recovery cost. Reloading a daily backup tape can satisfy a tolerance for no more than 24 hours’ worth of data. How - ever, a tolerance for only one minute’s worth of data or transaction loss might require more costly data transfer methods, such as mirroring, which is discussed in the chapter on storage. Some view the RPO as the elapsed time of data recovery in relation to the adverse event. This is actually the aforementioned TTD. RPO is the point in time to which the data must be recovered—sometimes referred to as the freshness window. It is the maximum tolerable elapsed time between the last safe backup and the point of recovery. An organization that can tolerate no data loss (i.e., RPO = 0) implies that data would have to be restored instantaneously following an adverse event and would have to employ a continuous backup system. Figure 3.3 illustrates the relationship between TTD and RPO using a timeline. If we denote the time between the last data snapshot and an adverse event as a random variable å, then it follows that the TTD +εmust meet the RPO objective: TTD RTO+≤ε (3.2) A target RPO should be chosen that does not exceed the snapshot interval (SI), and at best equals the SI. If data is not restored prior to the next scheduled snapshot, then the snapshot should be postponed or risk further data corruption: TTD RTO SI+≤ ≤ε (3.3) 3.1 Recovery Metrics 35 Error Snapshot Restore data RPO RTO Time scale ε TTD Snapshot SI Restore operation Figure 3.3 Relationship of RPO, RTO, and TTD. [...]... Special Advertising Supplement, pp 2 11 [22 ] Liebmann, L., “Quantifying Risk,” Network Magazine, February 20 01, p 122 3.9 Summary and Conclusions 61 [23 ] Kuver, P B., “Managing Risk in Information Systems,” Tech Republic, June 26 , 20 01, www.techrepublic.com [24 ] Bell, D R., “Calculating and Avoiding the Hidden Cost of Downtime,” Control Solutions, January 20 02, pp 22 27 [25 ] Hinton, W., and R Clements,... Packet Networks,” Network World, April 12, 1999, p 35 [9] Thorson, E., “Why the Industry Needs Open Service Availability API Specifications,” Integrated Communications Design, April 15, 20 02, pp 20 21 [10] Bryant, E C., Statistical Analysis, New York: McGraw-Hill Book Company, Inc., 1960, pp 20 24 [11] Dooley, K., Designing Large-Scale LANs, Sebastopol, CA: O’Reilly & Associates, Inc., 20 02, pp 14–44 [ 12] ... Journal, Summer 20 02, pp 64–67 [26 ] Schwartz, M., Computer-Communication Network Design and Analysis, Englewood Cliffs, NJ: Prentice Hall, 1977, pp 28 6–318 [27 ] Mina, R., Introduction to Teletraffic Engineering, Chicago, IL: Telephony Publishing Corporation, 1974, pp 67–74 [28 ] Hills, M T., “Traffic Engineering Voice for Voice over IP,” Business Communications Review, September 20 02, pp 54–57 [29 ] Held,... Design, November 20 00, pp 47–55 [17] DeVoney, C., “Power & Pain: Dollars & Sense,” Smart Partner, September 18, 20 00, pp 56–57 [18] Nawabi, W., “Avoiding the Devastation of Downtime,” America’s Network, February 20 01, pp 65–70 [19] Young, D., “Ante Uptime,” CIO—Section 1, May 1, 1998, pp 50–58 [20 ] Paul, L G., “Practice Makes Perfect,” CIO Enterprise—Section 2, January 15, 1999, pp 24 –30 [21 ] Hoard, B.,... used to characterize system failures MTBF = 50K hours 35% MTBF = 100K hours 30% MTBF = 150K hours MTBF = 20 0K hours Probability 25 % of one 20 % or more failures 15% MTBF = 25 0K hours 10% 5% 0% 10 20 30 40 50 60 70 Number of network nodes Figure 3.4 Probability of failure versus network size 80 90 100 3 .2 Reliability Metrics 41 Reliability deals with frequency of failure, while availability is concerned... Cliffs, NJ: Prentice Hall, 1981, pp 24 –58 [13] Paulsen, K., “Understanding Reliability in Media Server Systems,” TV Technology, October 9, 20 02, pp 29 –33 [14] Wiley, J., “Strategies for Increasing System Availability,” Cable Installation & Maintenance, September 20 01, pp 19 23 [15] Oggerino, C., High Availability Network Fundamentals, Indianapolis, IA: Cisco Press, 20 01 [16] Hill, C., “High Availability... Recovery Journal, Summer 20 01, pp 14–18 [5] Toigo, J., Disaster Recovery Planning, New York: John Wiley & Sons, Inc., 1996, pp 99– 122 [6] Beeler, D., “The Internet Changes Everything! Techniques and Technology for Continuous Offsite Data Protection,” Disaster Recovery Journal, Fall 20 00 [7] Oracle, Inc., “Database Disaster Recovery Planning, ” Oracle Technical White Paper, January 20 00, pp 1–9 [8] Rehbehn,... necessary for operation, but consume raw capacity as well In networks, this includes such things as protocol overhead, impedance mismatches, and collisions References [1] Smith, J., “A Brand New Role for IT,” Optimize, April 20 02, pp 71–74 [2] Bakman, A., “The Weakest Link in Disaster Recovery,” Disaster Recovery Journal, Spring 20 02, pp 26 –30 [3] Douglas, W J., “Systematic Approach to Continuous Operations,”... operating characteristics of a particular network It represents the effective capacity of a connection or service once all things are considered Figure 3.13 illustrates this concept and how it relates to estimating the throughput of a device or network link [29 ] For a network link, the following formula can be used: L   q L =(Q / K) − ∑ θ i  d   i=1 (3 .25 ) where qL is the realized channel throughput... approaches, such as network simulation, can be used This requires building a computerized model of a network, which often reflects the network s logical topology Calibration of the model is required in order to baseline the model’s accuracy Availability for a network can be estimated from field data in many ways A general formula for estimating observed availability, Ao, from a large network is the following: . risk because the 2. 4 Principles of Design 27 0% 2% 4% 6% 8% 10% 12% 1.00 0.33 0 .20 0.14 0.11 0.09 0.08 0.07 0.06 0.05 Scale () σ Minimum risk ()ρ p = .01 p = .05 p=.1 Figure 2. 14 Network risk example. consolidated. Metrics 0% 5% 10% 15% 20 % 25 % 30% 35% 10 20 30 40 50 60 70 80 90 100 Number of network nodes Probability of one or more failures MTBF = 50K hours MTBF = 100K hours MTBF = 150K hours MTBF = 20 0K hours MTBF = 25 0K. 4–19. [10] Nolle, T., “Balancing Risk,” Network Magazine, December 20 01, p. 96. [11] Sanborn, S., “Spreading Out the Safety Net,” Infoworld, April 1, 20 02, pp. 38–41. [ 12] Porter, D., “Nothing Is Unsinkable,”

Định dạng
Số trang	43
Dung lượng	443,3 KB