157 Chapter 12 Network Disaster Recovery 158 Networking: A Beginner’s Guide N etwork servers contain vital resources for a company, in the form of information, knowledge, and invested work product of the company’s employees. If they were suddenly and permanently deprived of these resources, most companies would not be able to continue their business uninterrupted and would face losing millions of dollars, both in the form of lost data and the effects of that loss. Therefore, establishing a network disaster recovery plan and formulating and implementing the network’s backup strategy are the two most important jobs in network management. In this chapter, you learn about the issues that you should address in a disaster recovery plan, and also about network backup strategies and systems. Before getting into these topics, however, you should read about the City of Seattle’s disaster recovery experiences. Notes from the Field: The City of Seattle The technical editor of the first through third editions of this book, Tony Ryan, had a personal experience with network disaster recovery. Tony worked in the IT department for the City of Seattle. On February 28, 2001, Seattle experienced an earthquake that caused the city’s disaster recovery plans to be tested. What follows is Tony’s discussion about Seattle’s disaster recovery operations and how it handled the problems that occurred in the wake of the earthquake. This is an excellent example of why you need a disaster recovery plan that encompasses all possible events that could occur during a disaster. Notes on the Seattle 2001 Earthquake and Its Disaster Recovery By Tony Ryan Seattle has seen some very unusual and attention-grabbing events over the past few years. Notable among them were the World Trade Organization (WTO) conference of 1999 and the violent demonstrations that accompanied it, which were broadcast worldwide on television and the Internet. Also, riots broke out during Mardi Gras celebrations in 2000. However, nothing compared to the potential and realized damage wrought by the 6.8 magnitude earthquake that struck Wednesday, February 28, 2001. The EOC Situation The City of Seattle has an Emergency Operations Center, or EOC, which is activated during any event or crisis that has a potential impact on public safety, or that might otherwise affect any number of services provided by the city to its citizens. Sometimes that EOC can be activated ahead of time; for example, for 159 Chapter 12: Network Disaster Recovery the Y2K event and the anniversary of the WTO demonstrations. Looking at the preparation made for those events and comparing it to what happens during unplanned events such as the earthquake helps to illustrate some important principles about IT disaster recovery and disaster preparedness. Never Assume During the preparation for Y2K, members of my staff were asked to augment the staff normally assigned to support the EOC’s desktop and laptop PCs, and printers. The staff members who normally support the EOC are from a different IT organization than ours, and as can be expected, their way of doing things differed from ours for a number of valid reasons. However, once my staff members had a chance to look at the EOC’s environment, they were able to share some new perspectives and methods that were welcomed and adopted by EOC support staff, and all involved had a new idea of what would be expected to be the “standard” way of configuring EOC PCs. Examples ranged from hard-coding certain models of PC network interface cards (NICs) to run better on the switches in their wiring closet to developing and implementing a base image for all the laptops to be deployed in the building. The Y2K event, as a result, was lauded as an example of ideal cooperation between IT groups and excellent preparation overall. It was a very calm Saturday morning! Change Management? Between events, however, there was a great deal of time and opportunity for things to change. The facility might have been used for other business purposes; equipment such as laptops might have been loaned out, or customers could have come in and used the equipment; and other IT groups besides ours might have assisted the staff and performed alterations to the configurations that went undocumented or were not communicated to all involved. The Results Whatever it was that might have happened remains unknown. What we did discover following the earthquake was that when customers who normally use the EOC in emergency situations went to use the equipment, in some cases the machines did not work as expected. Software could not be loaded on this PC; that laptop would not connect to the network anymore; some PCs were not the same or had been swapped for less-powerful processors. Things had changed, and the result was that some of the emergency work IT professionals such as web support technicians, had to perform took more time than we had anticipated. Ironically, the Web played a crucial role in our overall communications “strategy.” The impact of that equipment not immediately working was not yet evident; however, the following events illustrate how they might have been. (Continued) 160 Networking: A Beginner’s Guide A few minutes after the earthquake struck, several of the downtown buildings in which Seattle employees worked were evacuated due to fear of structural damage. No one was injured, and amazingly only two keyboards were broken throughout all the buildings in which we provide support. But imagine a couple thousand very frightened and concerned people streaming onto the sidewalks and streets, flooding cellular telephone networks in frantic attempts to contact loved ones, and looking for any possible focus for communication—especially managers such as myself and other supervisory staff, all possessing varying levels of training in disaster preparedness. Luckily, the mayor’s office had sent representatives to the gathering sites indicated for staff to walk to in such events, and informed everyone in the core buildings that were directly affected that they were to go home. With that announcement, the CTO announced to all to “check the Web” for information, meaning the city’s internal web site. But what if the EOC PC had been swapped out (let’s say) for a Pentium 133 with 64MB RAM and that PC could not run Microsoft’s FrontPage 2000? If that web site had to be updated with news and official information on a routine basis, the results could have been at best inconvenient and confusing. Contingency and Costs Because we are a publicly funded entity, we are very careful about how we spend our customers’ money, as it is subject to great scrutiny (and rightfully so). Customers often do not have the funds to afford both modern PC equipment to run the latest version of Windows and a spare PC to sit in the closet, “just in case.” After the earthquake, a couple of buildings were temporarily unavailable for occupancy until inspectors had a chance to examine the damage to see if the buildings were safe for employees. One of those buildings actually houses a lot of our IT staff, and as a result, not only were we trying to find “spare PCs” for our customers to use (while they looked for office space), but as IT support staff, we found ourselves doing the same thing. The direct impact was that we found it difficult in a few cases to support our customers as quickly as our service-level agreements (SLAs) required, especially since we could not immediately reenter our building to gather our PCs or other necessary equipment. Lesson Learned: Keep Spares … At Least a Few So it seems that you either pay up front or pay later. It makes sense to keep a percentage of PCs available for these rainy-day events; 10 to 15 percent of replaceable inventory should work. Consider that businesses of any kind are obligated in such situations to perform a kind of “triage” as to which of their business functions are most critical and which can be postponed—until their entire stock of equipment can be reconnected or replaced—and 10 to 15 percent is justified. 161 Chapter 12: Network Disaster Recovery Have a Plan for Communications and How You Will Communicate Following the CTO’s announcement, some asked, “What about those who don’t have web access at home?” As IT staff, we asked, “What if the web servers themselves had all been destroyed?” (In fact, ceiling debris in the room in which they were housed fell very close to them, but the servers were not damaged and the service was never down.) Still others asked, “What about those who missed the message and don’t know to check the Web? These questions, as well as “What to do in the event of …?” could be addressed with a clear, ever-ready communications plan. Ironically, such plans had been developed down to the last detail for other events, but in the case of a real “emergent” event, we as a department had not identified a plan to follow. A priority for our department now is to reexamine that situation and develop a plan, using communications plans developed for the Y2K event and the like as models. Another point: As previously mentioned, our staff is not responsible for supporting the EOC on a routine basis. We are more than happy to be directed to assist in that support, and as evidenced, have done so on a few occasions. Almost immediately following the earthquake, I received a page indicating that I was to dispatch technicians to the EOC to support the city officials who report there during emergencies. While our team was under no agreement with the EOC to provide support even “on demand,” I immediately asked two of my senior technicians, who had worked at the EOC in the past, to respond. They reported for duty there and supported the facility until the assigned staff arrived. There was never a doubt that we would pitch in whenever asked, but I made it a point to ask our divisional director if developing some clearer expectations, or even an SLA, between our staff and the EOC would be appropriate, and he agreed. I did find out that those in the EOC are granted power by legislation to use “all” city resources in the event of an emergency, but a clear agreement could also permit me to identify a rotating on-call staff person who could be proactive and call the EOC in such instances. I must point out that none of these preparations can substitute for dedicated, intelligent people. The shining example is one of my technicians who supports programmers responsible for the city’s payroll application. He had the presence of mind to come early to work the day after the quake, and he somehow persuaded the construction crew and inspectors to permit him access to the building. He walked up 13 flights of stairs, picked up a PC and peripherals, carried it back down the stairs and to another building, and configured it to work on the segment in the new building. This made it possible for the programmer to run the operations necessary for the city’s payroll run that weekend, and employees received their checks on time, as expected. You cannot ask for more than that. . a base image for all the laptops to be deployed in the building. The Y2K event, as a result, was lauded as an example of ideal cooperation between IT groups and excellent preparation overall most important jobs in network management. In this chapter, you learn about the issues that you should address in a disaster recovery plan, and also about network backup strategies and systems “just in case.” After the earthquake, a couple of buildings were temporarily unavailable for occupancy until inspectors had a chance to examine the damage to see if the buildings were safe for