Mission-Critical Network Planning phần 4 pdf

. CHAPTER 6 Processing, Load Control, and Internetworking for Continuity Until recent years, centralized network architectures using mainframe systems were a staple in many IT environments. They provided vast processing power and gradu - ally acquired fault-tolerant capabilities as well. However, as distributed transaction processing requirements have heightened, mainframes were found to lack the versa - tility to support today’s real-time and dynamic software development and process - ing environment. An Internet-based transaction, for instance, will often require the use of several independent processors situated at different network locations. The need for scalability in implementing high-performance computing has driven con - sideration of alternatives to centralized mainframe-based network architectures. This chapter reviews technologies and techniques that can be used to optimize survivability and performance within a distributed internetworking environment. 6.1 Clusters For mission-critical networks, cost-effective ways of ensuring survivability is always an objective. The concept of clusters is designed with this objective in mind. A cluster is a group of interrelated computers that work together to perform various tasks. The underlying principal behind clusters is that several redundant computers work- ing together as single resource can do more work than a single computer and can provide greater reliability. Physically, a cluster is comprised of several computing devices that are interconnected to behave as a single system. Other computers in the network typically view and interact with a cluster as if it was a single system. The computing elements that comprise a cluster can be grouped in different ways to dis - tribute load and eliminate single points of failure. Because multiple devices comprise a cluster, if one device fails in a cluster, another device can take over. The loss of any single device, or cluster node, does not cause the loss of data or application availability [1]. To achieve this capability, resources such as data and applications must either be replicated or pooled among the nodes so that any node can perform the functions of another if it fails. Further - more, the transition from one node to another must be such that data loss and appli - cation disruption are minimized. Other than reliability, clustering solutions can be used to improve processing or balance workload so that processing bottlenecks are avoided. If high-performance processing is required, a job can be divided into many tasks and spread among the cluster nodes. If a processor or server is in overload, fails, or is taken off line for 113 maintenance, other nodes in the cluster can provide relief. In these situations, clus - ters require that nodes have access to each other’s data for consistency. Advances in storage technology have made sharing data among different systems easier to achieve (refer to the chapter on storage). Clustering becomes more attractive for large, distributed applications or sys - tems. Clusters can improve scalability because workload is spread among several machines. Individual nodes can be upgraded or new nodes can be added to increase central processor unit (CPU) or memory to meet the performance growth and response time requirements. This scalability also makes it more cost effective to pro - vide the extra computing capacity to guard against the unpredictable nature of today’s data traffic. Cluster connectivity can be achieved in numerous ways. Connecting servers over a network supporting transmission control protocol/Internet protocol (TCP/IP) is a very common approach. Another approach is to connect computer processors over a high-speed backplane. They can be connected in various topologies, including star, ring, or loop. Invariably, in each approach nodes are given primary tasks and assigned secondary nodes to automatically assume processing of those tasks upon failure of the primary node. The secondary node can be given tasks to do so that it is kept useful during normal operation or kept idle as a standby. A reciprocating arrangement can be made as well between the nodes so that each does the same tasks. Such arrangements can be achieved at several levels, including hardware, operating system (OS), or application levels. Clusters require special software that can make several different computers behave as one system. Cluster software is typically organized in a hierarchical fashion to provide local or global operational governance over the cluster. Software sophistication has grown to the point where it can manage a cluster’s systems, storage, and communication components. An example is IBM’s Parallel Sysplex technology, which is intended to provide greater availability [2, 3]. Parallel Sysplex is a technology that connects several processors over a long distance (40 km) using a spe - cial coupling facility that enables them to communicate and share data [4]. 6.1.1 Cluster Types Categorizing clusters could seem futile given the many cluster products that have flooded the market in recent years. A cluster in which a node failure results in that node’s transactions, accounts, or data being unavailable is referred to as a static clus - ter. Dynamic clusters, on the other hand, can dynamically allocate resources as needed to maintain transaction processing across all users, as long as there is one surviving node [5]. These clusters provide greater availability and scalability, typi - cally limited by data access and storage capabilities. Cluster management is often easier with dynamic clusters, as the same image is retained across all nodes. For situations involving large volumes of users, super clusters can be constructed, which is a static cluster comprised of dynamic clusters. These types of configurations are illustrated in Figure 6.1. Each of these cluster types can be constructed in several ways using different technologies. The following list contains some of the most widely used technology approaches, illustrated in Figure 6.2: 114 Processing, Load Control, and Internetworking for Continuity 6.1 Clusters 115 A B Cluster controller Tasks A–M Tasks N–Z Cluster Server A failure confines cluster to Tasks N–Z Tasks A–Z Cluster Server B continues all tasks upon server A failure Tasks A–Z A B A B Dynamic clusters Cluster B continues all tasks upon cluster A failure Super cluster Cluster server Cluster switch Static cluster Dynamic cluster Su p er cluster Figure 6.1 Examples of cluster types. Backplane AA BB CC Shared storage Shared memory Shared memory Backplane A-C A-C A-C Memory Memory Memory SMP MPP Multiprocessor clusters CPUs CPUs A-C A-C Memory Memory OS OS Fault-tolerant s y stems A-C A-C A-C Server clusters CPU assigned tasks Storage A-C Figure 6.2 Examples of cluster technologies. • Multiprocessor clusters are multiple CPUs internal to a single system that can be grouped or “clumped” together for better performance and availability. Standalone systems having this type of feature are referred to as multiproces - sor or scale-up systems [6, 7]. Typically, nodes perform parallel processing and can exchange information with each other through shared memory, mes - saging, or storage input/output (I/O). Nodes are connected through a system area network that is typically a high-speed backplane. They often use special OSs, database management systems (DBMSs), and management software for operation. Consequently, these systems are commonly more expensive to operate and are employed for high-performance purposes. There are two basic types of multiprocessor clusters. In symmetric multi - processing (SMP) clusters, each node performs a different task at the same time. SMPs are best used for applications with complex information process - ing needs [8]. For applications requiring numerous amounts of the same or similar operations, such as data warehousing, massively parallel processing (MPP) systems may be a better alternative. MPPs typically use off-the-shelf CPUs, each with their own memory and sometimes their own storage. This modularity allows MPPs to be more scalable than SMPs, whose growth can be limited by memory architecture. MPPs can be limitless in growth and typically run into networking capacity limitations. MPPs can also be constructed from clusters of SMP systems as well. • Fault-tolerant systems are a somewhat simplified hardware version of multiprocessor clusters. Fault-tolerant systems typically use two or more redundant processors and heavily rely on software to enhance performance or manage any system faults or failures. The software is often complex, and the OS and applications are custom designed to the hardware platform. These systems are often found in telecom and plant operations, where high reliability and availability is necessary. Such systems can self-correct software process failures, or automatically failover to another processor if a hardware or software failure is catastrophic. Usually, alarms are generated to alert personnel for assistance or repair, depending on the failure. In general, these systems are often expensive, requiring significant upfront capital costs, and are less scalable than multi - processor systems. Fault-tolerant platform technology is discussed in more depth in a later chapter of this book. • Server clusters are a low-cost and low-risk approach to provide performance and reliability [9]. Unlike a single, expensive multiprocessor or fault-tolerant system, these clusters are comprised of two or more less expensive servers that are joined together using conventional network technology. Nodes (servers) can be added to the network as needed, providing the best scalability. Large server clusters typically operate using a shared-nothing strategy, whereby each node processor has its own exclusive storage, memory, and OS. This avoids memory and I/O bottlenecks that are sometimes encountered using shared strategies. However, shared-nothing strategies must rely on some form of mirroring or net - worked storage to establish a consistent view of transaction data upon failure. The following are some broad classes of cluster services that are worth noting. Each can be realized using combinations or variations of the cluster configurations 116 Processing, Load Control, and Internetworking for Continuity and technologies just discussed. Each successive class builds on the previous with regard to capabilities: • Administrative clusters are designed to aid in administering and managing nodes running different applications, not necessarily in unison. Some go a step further by integrating different software packages across different nodes. • High-availability clusters provide failover capabilities. Each node operates as a single server, each with its own OS and applications. Each node has another node that is a replicate image, so that if it fails, the replicate can take over. Depending on the level or workload and desired availability, several failover policies can be used. Hot and cold standby configurations can be used to ensure that a replicate node is always available to adequately assume another node’s workload. Cold standby nodes would require extra failover time to ini - tialize, while hot standby nodes can assume processing with little, if any, delay. In cases where each node is processing a different application, failover can be directed to the node that is least busy. • High performance clusters are designed to provide extra processing power and high availability [10]. They are used quite often in high-volume and high- reliability processing, such as telecommunications or scientific applications. In such clusters, application workload is spread among the multiple nodes, either uniformly or task specific. They are sometimes referred to as parallel application or load balancing clusters. For this reason, they are often found to be the most reliable and scalable configurations. A prerequisite for high-availability or high-performance clusters is access to the same data so that transactions are not lost during failover. This can be achieved through many of the types of storage techniques that are described in the chapter on storage. Use of mirrored disks, redundant array of independent disks (RAID), or networked storage not only enable efficient data sharing, but also eliminate single points of failure. Dynamic load balancing is also used to redistribute workload among the remaining nodes if a node fails or becomes isolated. Load balancing is discussed further in this chapter. 6.1.2 Cluster Resources Each node in a cluster is viewed as an individual system with a single image. Clusters typically retain a list of member nodes among which resources are allocated. Nodes can take on several possible roles, including the primary, secondary, or replicate roles that were discussed earlier. Several clusters can operate in a given environment if needed, where nodes are pooled into different clusters. In this case, nodes are kept aware of nodes and resources within their own cluster and within other clusters as well [11]. Many cluster frameworks use an object-oriented approach to operate clus - ters. Objects can be defined comprised of physical or logical entities called resources. A resource provides certain functions for client nodes or other resources. They can reside on a single or multiple nodes. Resources can also be grouped together in classes so that all resources in given class can respond similarly upon a 6.1 Clusters 117 failure. Resource groups can be assigned to individual nodes. Recovery configura - tions, sometimes referred to as recovery domains, can be specified to arrange objects in a certain way in response to certain situations. For example, if a node fails, a domain can specify to which node resources or a resource group’s work should be transferred. 6.1.3 Cluster Applications For a node to operate in a cluster, the OS must have a clustering option. Further - more, many software applications require modifications to take advantage of clus - tering. Many software vendors will offer special versions of their software that are cluster aware, meaning that they are specifically designed to be managed by cluster software and operate reliably on more than one node. Cluster applications are usu - ally those that have been modified to failover through the use of scripts. These scripts are preconfigured procedures that identify backup application servers and convey how they should be used for different types of faults. Scripts also specify the transfer of network addresses and ownership of storage resources. Because failover times between 30s and 5 min are often quoted, it is not uncommon to restart an application on a node for certain types of faults, versus failing over to another processor and risking transaction loss. High-volume transaction applications, such as database or data warehousing and Web hosting, are becoming cluster aware. Clusters enable the scaling that is often required to reallocate application resources depending on traffic intensity. They have also found use in mail services, whereby one node synchronizes account access utilization by the other nodes in the cluster. 6.1.4 Cluster Design Criteria Cluster solutions will radically vary among vendors. When evaluating a clustered solution, the following design criteria should be applied: • Operating systems. This entails what OSs can be used in conjunction with the cluster and whether different versions of the OS can operate on different nodes. This is critical because an OS upgrade may entail having different ver - sions of an OS running in the cluster at a given moment. • Applications. The previous discussion highlighted the importance of cluster- aware applications. In the case of custom applications, an understanding of what modifications are required needs to be developed. • Failover. This entails to what extent failover is automated and how resources are dynamically reallocated. Expected failover duration and user transparency to failovers needs to be understood. Furthermore, expected performance and response following a failover should be known. • Nodes. A number of nodes should be specified that could minimize the impact of a single node outage. An N + I approach is often a prudent one, but can result in the higher cost of an extra, underutilized cluster node. A single system image (SSI) approach to clustering allows the cluster nodes to appear and behave as a single system, regardless of the quantity [12]. 118 Processing, Load Control, and Internetworking for Continuity • Storage. Cluster nodes are required to share data. Numerous storage options and architectures are available, many of which are discussed in the chapter on storage. Networked storage is fast becoming a popular solution for nodes to share data through a common mechanism. • Networking. Cluster nodes must communicate with each other and other nodes external to the cluster. Separate dedicated links are often used for the nodes to transmit heartbeat messages to each other [13]. 6.1.5 Cluster Failover Clusters are designed such that multiple nodes can fail without bringing down the entire cluster. Failover is a process that occurs when a logical or physical cluster component fails. Clusters can detect when a failure occurs or is about to occur. Location and isolation mechanisms typically can identify the fault. Failover is not necessarily immediate because a sequence of events must be executed to transfer workload to other nodes in the cluster. (Manual failover is often done to permit sys - tem upgrades, software installation, and hardware maintenance with data/applica - tions still available on another node.) To transfer load, the resources that were hosted on the failed node must transfer to another node in the cluster. Ideally, the transfer should go unnoticed to users. During failover, an off-line recovery process is undertaken to restore the failed node back into operation. Depending on the type of failure, it can be complex. The process might involve performing additional diagnostics, restarting an application, replacing the entire node, or even manually repairing a failed component within the node. Once the failed node becomes active again, a process called failback moves the resources and workload back to the recovered node. There are several types of cluster failover, including: • Cold failover. This is when a cluster node fails, another idle node is notified, and applications and databases are started on that node. This is typically viewed as a slow approach and can result in service interruption or transac - tion loss. Furthermore, the standby nodes are not fully utilized, making this a more expensive approach. • Warm failover. This is when a node fails, and the other node is already opera - tional, but operations must still be transferred to that node. • Hot failover. This is when a node fails, and the other node is prepared to serve as the production node. The other node is already operational with applica - tion processing and access to the same data as the failed node. Often, the sec - ondary node is also a production server and can mirror the failed server. Several activities occur to implement a complete failover process. The following is a general description of the types of events that take place. This process will vary widely by the type of cluster, cluster vendor, applications, and OS involved: • Detection. Detection is the ability to recognize a failure. A failure that goes undetected for a period of time could result in severe outage. A sound detec - tion mechanism should have wide fault coverage so that faults can be detected 6.1 Clusters 119 and isolated either within a node or among nodes as early as possible. The abil - ity of a system to detect all possible failures is measured in its fault coverage. Failover management applications use a heartbeat process to recognize a fail - ure. Monitoring is achieved by sending heartbeat messages to a special moni - toring application residing on another cluster node or an external system. Failure to detect consecutive heartbeats results in declaration of a failure and initiation of a failover process. Heartbeat monitoring should not only test for node failure but should also test for internode communication. In addition to the network connectivity used to communicate with users, typically Ethernet, some clusters require a separate heartbeat interconnect to communicate with other nodes. • Networking. A failover process typically requires that most or all activity be moved from the failed node to another node. Transactions entering and leav - ing the cluster must then be redirected to the secondary node. This may require the secondary node to assume the IP address and other relevant information in order to immediately connect users to the application and data, without reas - signing server names and locations in the user hosts. If a clustering solution supports IP failover, it will automatically switch users to the new node; other - wise, the IP address needs to be reallocated to the backup system. IP failover in many systems requires that both the primary and backup nodes be on the same TCP/IP subnet. However, even with IP failover, some active transactions or sessions at the failed node might time out, requiring users to reinitiate requests. • Data. Cluster failover assumes that the failed node’s data is accessible by the backup node. This requires that data between the nodes is shared, recon- structed, or transferred to the backup node. As in the case of heartbeat monitoring, a dedicated shared disk interconnect is used to facilitate this activity. This interconnect can take on many forms, including shared disk or disk array and even networked storage (see Section 6.1.7). Each cluster node will most likely have its own private disk system as well. In either case, nodes should be provided access to the same data, but not necessarily share that data at any sin - gle point in time. Preloading certain data in the cache of the backup nodes can help speed the failover process. • Application. Cluster-aware applications are usually the beneficiary of a failo - ver process. These applications can be restarted on a backup node. These applications are designed so that any cluster node can resume processing upon direction of the cluster-management software. Depending on the application’s state at the time of failure, users may need to reconnect or encounter a delay between operations. Depending on the type of cluster configuration in use, performance degradation in data access or application accessing might be encountered. 6.1.6 Cluster Management Although clusters can improve availability, managing and administering a cluster can be more complex than managing a single system. Cluster vendors have addressed this issue by enabling managers to administer the entire cluster as a single system versus several systems. However, management complexity still persists in several areas: 120 Processing, Load Control, and Internetworking for Continuity • Node removal. Clustering often allows deactivating a node or changing a node’s components without affecting application processing. In heavy load situations, and depending on the type of cluster configuration, removal of a cluster node could overload those nodes that assume the removed node’s application processing. The main reason for this is that there are less nodes and resources to sustain the same level of service prior to the removal. Fur - thermore, many users attempt to reconnect at the same time, overwhelming a node. Mechanisms are required to ensure that only the most critical applica - tions and users are served following the removal. Some cluster solutions pro - vide the ability to preconnect users to the backup by creating all of the needed memory structures beforehand. • Node addition. In most cases, nodes are added to an operational cluster to restore a failed node to service. When the returned node is operational, it must be able to rejoin the cluster without disrupting service or requiring the cluster to be momentarily taken out of operation. • OS migration. OS and cluster software upgrades will be required over time. If a cluster permits multiple versions of the same OS and cluster software to run on different nodes, then upgrades can be made to the cluster one node at a time. This is often referred to as a rolling upgrade. This capability minimizes service disruption during the upgrade process. • Application portability. Porting cluster applications from one node to another is often done to protect against failures. Critical applications are often spread among several nodes to remove single points of failure. • Monitoring. Real-time monitoring usually requires polling, data collection, and measurement features to keep track of conditions and changes across nodes. Each node should maintain status on other nodes in the cluster and should be accessible from any node. By doing so, the cluster can readily recon- figure to changes in load. Many cluster-management frameworks enable the administration of nodes, networks, interfaces, and resources as objects. Data collection and measurement is done on an object basis to characterize their status. Management is performed by manipulation and modification of the objects. • Load balancing. In many situations, particularly clustered Web servers, traffic must be distributed among nodes in some fashion to sustain access and per - formance. Load balancing techniques are quite popular with clusters and are discusses further in this chapter. 6.1.7 Cluster Data Data access can be a limiting factor in cluster implementation. Limited storage capacity as well as interconnect and I/O bottlenecks are often blamed for perform - ance and operational issues. The most successful cluster solutions are those that combine cluster-aware databases with high-availability platforms and networked storage solutions. Shared disk cluster approaches offer great flexibility because any node can access any block of data, providing the maximum flexibility. However, only one node can write to a block of data at any given time. Distributed locking 6.1 Clusters 121 [...]... geographically distributed network of servers A CDN is comprised of a network of servers, caches, and routing devices [41 ] In the case of Web access, ISP networks will pull content to the CDN cache server closest to users accessing the Web site (Figure 6. 14) Public CDN providers offer their CDN services to enterprise customers, primarily ISPs or content providers They can 6 .4 Caching 147 Local CDN POP Content... during routine maintenance 1 24 Processing, Load Control, and Internetworking for Continuity There are a several ways to classify load balancers Two classifications are illustrated in Figure 6 .4 One class consists of network load balancers, which distribute network traffic, most commonly TCP/IP traffic, across multiple ports or host connections using a set of predefined rules Network load balancers originated... ways For example, load balancers can be used in conjunction with wide area clusters to direct requests to the cluster closest to the user Cluster/ server farm Network Network load balancing Network Component load balancing Load balancer Figure 6 .4 Network and component load balancing examples 6.2 Load Balancing 125 Not only does balancing help control cluster performance, it also enables the addition... can be configured among sites depending on organizational need For networking, there are several options Previously, a technique called data link switching (DLSw), a means of tunneling SNA traffic over an IP network, was used for recovery and network failures Recently, enterprise extender functions have been introduced that convert SNA network transport to IP The systems can use a virtual IP address... centrally from devices that reside outside a network that intercept requests for content prior to reaching a firewall (Figure 6.6) and direct those requests to an appropriate location It works best when caches are distributed throughout a network, but each destination does not have dedicated cache Global redirection has become an integral part of mission-critical networking solutions for data-center solutions... Last, they can simplify network topology, 130 Processing, Load Control, and Internetworking for Continuity • • • especially if the server is used for other functions The servers are typically equipped with dual network adapters—one that connects to a router and another that connects to a switch or hub that interconnects with other servers Switch-based balancers are just that—a network switch or router... 6.3 .4 Web Site Recovery Management Web sites are gradually being subjected to higher availability requirements Many e-businesses as well as established physical ones depend on them for survival Problems can arise with site servers, the applications, or the network Much of this book is devoted to discussion of problems with networks and servers From the previous 138 Processing, Load Control, and Internetworking... buying or building network access from the hosting site to the POP site As one can see, the points of failure or congestion are numerous, and an outage at a CO or POP site can be catastrophic Central office ISP A POP ISP B Dial up RAS Class 5 switch Modems Internet PSTN DSLAM DSL User Figure 6.11 User ISP access architecture example Data network Hosting center ISP C 6.3 Internetworking 141 Access from... WAN B Transit Peering ISP D point B POP 142 Processing, Load Control, and Internetworking for Continuity each ISP POP should reside on different network backbones and connect through different Internet peering points Some large ISP carriers will also have transit arrangements with smaller ISPs, which are charged for terminating their traffic over the larger ISP’s network Many times, local ISPs may not... avoids having the cache device second guessing the freshness of an object Another approach uses 144 Processing, Load Control, and Internetworking for Continuity edge side includes (ESIs), which are Web page fragments that are cached and later included with other dynamic objects when the Web site is accessed [36] 6 .4. 1 Types of Caching Solutions Caching solutions can come in many forms Server, appliance, . the user. 1 24 Processing, Load Control, and Internetworking for Continuity Network Network Cluster/ server farm Network load balancing Component load balancing Load balancer Figure 6 .4 Network and. techniques, including mirroring and networked storage. Coordination is required to manage the data access from each site [ 14] . All sites are interconnected via a wide area network (WAN). As in a collocated. balancers. Two classifications are illus - trated in Figure 6 .4. One class consists of network load balancers, which distribute network traffic, most commonly TCP/IP traffic, across multiple

Định dạng
Số trang	43
Dung lượng	521,55 KB