162 Networking: A Beginner’s Guide Disaster Recovery Plans A disaster recovery plan is a document that explores how a network recovers from a disaster that either imperils its data or stops its functioning. A company’s external financial auditors often require annual disaster recovery plans, because of the data’s importance to the business and the effect that such a network failure would have on a company. Moreover, disaster recovery plans are also important because they force the manager of the network to think through all possible disaster scenarios. By taking these scenarios into account, the manager can make more effective plans to protect the network’s data from loss and to restore full operations of the business as quickly as possible. As mentioned at the beginning of this chapter, planning for disaster recovery and managing the company’s backup systems are a network manager’s two most important jobs. Most companies do not have extremely long disaster recovery plans. For a single network of up to several hundred nodes and 15 or so servers, such a plan usually consists of about 10 to 20 pages or fewer, although its length varies depending on the complexity of the company’s network operations. Fortune 500 companies, for instance, may have disaster recovery plans that are several hundred pages long, when all sites are considered in aggregate. One strategy to keep disaster recovery plans concise and to maximize their usefulness is to focus on problems that, while remote, are at least somewhat likely to occur. Alternatively, you can focus on disaster results—what happens—rather than trying to cover disaster causes—why it happened. Focusing your plan on disaster results means contemplating problems such as loss of a single server, loss of the entire server room, loss of all of the customer service workstation computers, and so forth, without worrying about the possible disasters that might cause those results. The following sections discuss the minimum key issues that a disaster recovery plan should address. Depending on your own company, your plan may need to address additional issues. Assessing Disaster Recovery Needs Before drafting the actual plan, you should first assess the needs that the plan must meet. These needs will vary depending on who requires input into the disaster recovery planning process and what issues these people want the plan to address. Consider these types of needs: N Formally planning for contingencies and ensuring that all possible disasters have been considered, and defining countermeasures in the plan N Assuring the company’s external accounting auditors that the company has considered and developed plans to handle disasters N Informing the company’s top management about the risks that exist for the network and its data in different situations, and how much time you expect to need to resolve any problems that occur N Soliciting input from top management of the company as to recovery priorities and acceptable minimum requirements to reestablish services 163 Chapter 12: Network Disaster Recovery N Formally planning with the key areas of your company’s business (for example, manufacturing, customer service, and sales) considerations surrounding different types of computer-related disasters or serious problems N Assuring customers of the firm that the firm’s data operations are safe from disaster Identifying these needs will not only give you a clear vision of what the plan must address, but also which other people from the different parts of the company should be involved in the planning process. Considering Disaster Scenarios You should start your planning process by considering different possible disaster scenarios. For example, consider the following disasters: N A fire in your server room—or somewhere else in the building—destroys computers and tapes. N Flooding destroys computers and backup batteries low enough to the server room floor to be affected. Remember that floods may be caused by something within the building itself, such as a bad water leak in a nearby room or a fire that activates the fire sprinklers. N An electrical problem of some kind causes power to fail. N Some problem causes total loss of connectivity to the outside world. For example, a critical wide area network (WAN) or Internet link may go down. N A structural building failure of some kind affects the network or its servers. N Any of the preceding problems affects computers elsewhere in the building that are critical to the company’s operations. For example, such an event may happen in the manufacturing areas, in the customer service center, or in the telephone system closet or room. While none of these events is very likely, it is still important to consider them all. The whole point of disaster recovery planning is to prevent or minimize serious losses, and the process is much less useful if you consider only those disasters that you think are the most likely. After considering disasters such as those mentioned, you should next consider serious failures that could also affect the operations of the network. Here are some examples: N The motherboard in your main server fails, and the vendor cannot get a replacement to you for three or more days. N Disks in one of your servers fail in such a way that data is lost. If you are running some kind of redundant array of independent disks (RAID) scheme (discussed in Chapter 13), plan for failures that are worse than the RAID system can protect. For example, if you use RAID 1 mirrored drives, plan for both sides of the mirror to fail in the same time frame. If you are using RAID 5, plan for any two drives failing at the same time. 164 Networking: A Beginner’s Guide N Your tape backup drive fails and cannot be repaired for one to two weeks. While this doesn’t cause a loss of data in and of itself, it certainly increases your exposure to such an event. You should plan how you would respond to these and any other possible failures. If the motherboard in your main server fails, you may want to move its drives to a compatible computer temporarily. To address disk failure, you should design a plan under which you can rebuild the disk array and restore data from your backups as rapidly as possible. Regarding your tape backup drive, you will likely want to find out how quickly you can acquire an equivalent drive or whether the maker of the tape drive can provide reconditioned replacement drives quickly in exchange for your failed drive. For all of these failures, you will also want to consider the cost of keeping spare parts, or even entire backup servers, available so that you can restore operations as rapidly as possible. You should consider and investigate all of the following types of possible responses: N Should you carry a maintenance contract? If so, make sure you thoroughly understand its guarantees and procedures. N Should you stock certain types of parts on hand so that they are readily available in case of failure? N Are other computers available that might work as a short-term replacement for a key server? What about noncomputer components that are important, such as routers, hubs, and switches? N If you need to take temporary measures, are the affected employees trained to do their jobs with the replacement, or with no system at all, if necessary? For example, if a restaurant’s electronic systems are down, can the restaurant (and the food servers, kitchen staff, cashiers, and so on) still operate the business manually until the system is repaired? N Should you maintain a cold or hot recovery site? A “cold” recovery site is a facility maintained by your company and near the protected data center. The cold site has all of the power, air conditioning, and other facility features needed to host your site should the data center experience some disaster. A “hot” site is the same as a cold site, except that it also has all of the necessary computer equipment and software to duplicate the processing of the data center. Hot sites usually synchronize their data on a real-time basis with the main processing center, so that they can literally take over the work of the main site in seconds. Companies with very sensitive, mission-critical data operations often maintain cold or hot recovery sites. The process of considering possible problems, such as disasters or failures of key pieces of equipment, and then making plans for handling them is certainly the meat of disaster recovery planning. However, your written plan should also discuss or address other issues, which are covered in the following sections. 165 Chapter 12: Network Disaster Recovery Handling Communications An important part of any disaster recovery plan concerns how you will handle communications. Without effective communications, your attempts at handling the disaster will be hampered, and other people will not be able to do their jobs as well as they might otherwise. Start by listing all of the different parties who may need to be notified of a problem, its progress toward resolution, and its final resolution. Your list might look something like this: N The board of directors N The chief executive officer or president N The vice presidents of all areas N The vice president or head of an affected area N Your supervisor N Employees affected by the problem For each of these parties—and any others you may identify—you next need to consider what level of problem requires their notification. The board of directors, for example, might not need to know about a disaster unless it is likely that it will have a material effect on the company’s performance. Your supervisor, on the other hand, probably wants to be notified about every problem, and certainly any affected employees need to be notified. Once you have listed the parties to notify and what they need to be informed about, you should then decide how you will inform them. If you’re the primary person resolving the disaster, it’s best to delegate notification to someone else who is less directly involved so that you can focus on resolving the problem as quickly as possible. For example, the job of communicating with the appropriate people should be delegated to your supervisor or to an employee who works in your department and is free to handle this job. Whoever has this job should be clear on the communication procedures and should have access to the necessary contact information—such as home phone numbers, pager numbers, cell phone numbers, and so forth—for situations that require notification after working hours. You may also want to consider setting up a telephone tree for rapid notification. Finally, for your environment and for different types of disasters, you may need to specify the order in which people are notified, which may not match their order in the company’s organization chart. The written disaster recovery plan should include all of the preceding information. Planning Off-Site Storage Off-site storage is an important way of protecting some of your backup tapes in the event that a physical disaster, such as a fire, destroys all of your on-site copies. Because off-site storage is such an important aspect of disaster protection, it should be discussed in your disaster recovery plan. 166 Networking: A Beginner’s Guide NOTE If you do not yet have an off-site storage procedure, you should seriously consider adopting one. While fireproof file cabinets can protect tape media from small fires, they are not necessarily invulnerable to very large or hot fires. Plus, tapes are more sensitive to smoke and heat than the papers that a fireproof file cabinet is designed to protect. Companies that provide off-site storage of files often also offer standardized tape- storage practices. These usually work on a rotation basis, where a storage company employee comes to your office periodically—usually weekly—and drops off one set of tapes and picks up the next set of tapes. The companies typically use stainless steel boxes to hold the tapes, and the network administrator is responsible for keeping the boxes locked and safeguarding the keys. You need to decide which tapes you should keep on-site and which ones to send off-site. One rule of thumb is always to keep the two most recent complete backups on-site (so that they’re available to restore deleted files for users) and send the older tapes off-site. This way, you keep on hand the tapes that you need on a regular basis, and you minimize your exposure to a disaster. After all, if a disaster destroys your server room and all of the tapes in it, you probably won’t be too worried about losing just a week’s worth of data. NOTE The amount of data that you can accept exposing to a disaster will vary widely depending on the nature of your company’s business and the nature of the data. Some operations are so sensitive that the loss of even a few minutes’ worth of data would be catastrophic. For example, a banking firm simply cannot lose any transactions. Businesses that need to protect supersensitive data sometimes enlist a third-party vendor to provide off-site online data storage. Such a vendor replicates a business’s data onto the vendor’s servers over a high-speed connection, such as a T-1 or T-3. These vendors usually also offer failover services, where their computers can pick up the jobs of your computers should your computers fail. Alternatively, if a business runs multiple sites, it might set up software and procedures that enable it to accomplish the same services using its own sites. Describing Critical Components Your plan should describe the computer equipment and software that will be required to resume operations if the entire building is lost. This list should roughly estimate the cost of the equipment and how it can be procured rapidly. By preparing such a list, you can reduce the time required to resume operations in a temporary facility. Also, if your company purchases insurance against business interruptions, you will need these estimates for that insurance policy. Network Backup and Restore Procedures A network disaster recovery plan is worthless without some way of recovering the data stored on the server. This is where network backup and restore procedures come in. If you’re a network administrator, or aspire to become one, you should already know about the importance of good backups of the system and of important data. If you don’t . certain types of parts on hand so that they are readily available in case of failure? N Are other computers available that might work as a short-term replacement for a key server? What about. functioning. A company’s external financial auditors often require annual disaster recovery plans, because of the data’s importance to the business and the effect that such a network failure would have. company’s network operations. Fortune 500 companies, for instance, may have disaster recovery plans that are several hundred pages long, when all sites are considered in aggregate. One strategy