1. Trang chủ
  2. » Công Nghệ Thông Tin

Unix Backup and Recovery phần 4 doc

73 332 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 73
Dung lượng 495,93 KB

Nội dung

Volume Verification Another often-ignored area of backup and recovery software is its ability to verify its own backups. There are plenty of horror stories out there about people who did backups for years or months assuming that they were working just fine. Then when they went to read the backup volumes, the backup software told them that it couldn't read them. The only way to ensure that this never happens to you is to run regular verification tests against your media. There are several different types of verification: Reading part of volume and comparing it There is at least one major vendor that works this way. If you turn on media verification, it forwards to the end of the volume and read a file or two. It compares those files against what it believes should be there. This is obviously the lowest level of verification. Page 229 Comparing table of contents to index This is a step up from the first type of verification. This is the equivalent of doing a tar tvf. It does not verify the contents of the file; it verifies only that the backup software can read the header of the file. Comparing contents of backup against contents of filesystem This type of verification is common in low-end PC backup software. Basically, the backup software looks at its backup of a particular filesystem, then compares its contents against the actual contents of the filesystem. Some software packages that do this will automatically back up any files that are different than what's on the backup or that do not exist on the backup. This type of verification is very difficult, since most systems are changing constantly. Comparing checksum to index Some backup software products record a checksum for each file that they back up. They then are able to read the backup volume and compare the checksum of the file on the volume with the checksum that is recorded in the index for that file. This makes sure that the file on the backup volume will be readable when the time comes. Verify, Verify, Verify! We were using commercial backup software to back up our file servers and database servers. One day, a multimillion-dollar client wanted some files back that were archived about a year and a half ago. We got the tapes and tried a restore. Nothing. We tried other tapes. Nothing. The system administrator and her manager were both fired. The company lost the client and got sued. The root cause was never identified, but they had definitely never tried to verify their backups. -Eugene Lee Cost The pricing aspect of backup software is too complex to cover in detail here, but suffice it to say that there are a number of factors that may be included in the total price, depending on which vendor you buy from: • The number of clients that you want to back up • The number of backup drives you wish to use • What type of backup drives you want to use (high-speed devices often cost more) Page 230 • The number of libraries and the number of drives and slots that they have • The size of the systems (in CPU power) • The speed of backup that you need • The number of database servers you have • The number of different types of databases that you have • The number of other special-treatment clients (MVS, Back Office) that require special interfaces • The type of support you expect (24×7, 8×5, etc.) Vendor There is a lot of important information you need to know about a company from which you plan to purchase such a mission-critical product as backup software. How long have they been providing backup solutions? What kinds of resources are dedicated to the products' development? What type of support do they have? Are they open to suggestions about their product? Get the name of at least one reference site and talk to them. Be aware, it is very hard for companies to come up with references. A lot of clients do not want to be a reference, just for political and legal reasons. Be flexible here. Don't require the salesperson to come up with a reference site that is exactly like your environment. If you do get a reference site, make sure that you get in touch with them. The number one complaint of salespeople is that they go through the trouble of obtaining reference sites, only to have the customer never call them. The Internet is also a wonderful asset at a time like this. Search for the product's name in the Usenet archives at http://www.deja.com. (Make sure you search the complete archives.) Look for names of people who say good and bad things, and then email them. Also, do general searches on all search engines. Try one of the megasites like http://www.dogpile.com that can search several other sites for you. A really good product will have references on consultants' web sites. A really bad product might even have an ''I hate product X" web site. Read everything with a grain of salt, and recognize that every single vendor of every single product has a group of people somewhere who hate it and chose someone else's product. Some clients have been through three backup products and are looking for a fourth. Page 231 Conclusions Picking a commercial backup utility is a hard job. The only one that's harder is writing one. The data provided here covers a lot of areas and can be confusing at times. Be sure to compare the headings here with the questions that are in the RFI at http://www.backupcentral.com. The questions may help to explain some of the finer points. The RFI that I use is extensive; it has more than 300 questions. Its main purpose is to put all vendors on a level playing field, and I have used it many times to evaluate backup software companies. Although it is extremely difficult to word a single RFI to cover the entire backup product industry, this is my best attempt at doing so. Most of the questions are worded in such a way that a "Yes" answer is considered to be a good answer, based on my opinion, of course. Even though you may not agree with me on every point of the RFI, you should find it very useful in evaluating backup software companies. This RFI is not biased toward any particular product. It is biased toward how I believe backups should work. You may find that some of the questions are looking for features that you believe you may not need, especially the more advanced enterprise-level features like dynamic parallelism. I would submit that you might need them at some point. You never know how your environment will grow. For example, my first commercial backup software setup was designed to handle around 20 machines with a total of 200 GB. Within two years, that became 250 machines and several terabytes. You just never know how big your environment is going to grow. However, if you are simply looking for a backup solution for a small environment, and you are sure that it will never grow much larger, feel free to ignore questions about such features. The same goes for any other features that you know you will never need. If you know that you're never going to have an MVS mainframe, then don't worry about a company's response to that question. If you know that you're never going to have connectivity between your company and the Internet, then don't worry about how a product deals with firewalls. I also should mention that there are special-use backup products that serve a particular market very well but may not have some of the other enterprise-level features that I consider to be important. For example, there is a product that does a great job of backing up Macintoshes. They do that job better than anybody else, because they've done it longer than anybody else. (There are other products that back up Macintoshes, environment thinks.) This product does just that. For a purely Macintosh environment, that product might be just the product for you. (Of course, if you have a purely Macintosh environment, you probably wouldn't be reading this book.) The RFI is available at http://backupcentral.com. Page 232 6 High Availability Good backup and recovery strategies are key to any organization in protecting its valuable data. However, many environments are starting to realize that while a system is being recovered, it is not available for general use. With a little planning and financial backing, you can design and implement logical schemes to make systems more accessible-seemingly all the time. The concept of high availability encompasses several solutions that target different parts of this problem. This chapter was written by Gustavo Vegas of Collective Technologies, with input from Josh Newcomb of Motorola. Gustavo may be reached at gustavo@colltech.com, and Josh may be reached at jnewcomb@paging.mot.com. What Is High Availability? High availability (HA) is defined as the ability of a system to perform its function without interruption for an extended length of time. This functionality is accomplished through special-purpose software and redundant system and network hardware. Technologies such as volume management, RAID, and journaling filesystems provide the essential building blocks of any HA system. Some would consider that an HA system doesn't need to be backed up, but such an assumption can leave your operation at significant risk. HA systems are not immune to data loss resulting from user carelessness, hostile intrusions, or applications corrupting data. Instead, HA systems are designed in such a way that they can survive hardware failures, and some software failures. If a disk drive or CPU fails, or even if the system needs routine maintenance, an HA system may remain Page 233 available to the users; thus it is viewed as being more highly available than other systems. That does not mean that its data will be forever available. Make sure you are backing up your HA systems. Overview Systems are becoming more critical every day. The wrong system in an organization going down could cost millions of dollars-and somebody's job. What if there were software tools that could detect system failures and then try to recover from them? If the system could not recover from a hardware failure, it would relinquish its functionality (or "fail over") to another system and restart all of its critical applications. This is exactly what HA software can do. Consider the example of two servers in the highly available configuration depicted in Figure 6-1. This is an illustration of what is called an asymmetric configuration. This kind of configuration contains a primary server and a takeover server. A primary server is the host that provides a network service or services by default. A takeover server is the host that would provide such services when the primary server is not available to perform its function. In another type of configuration called symmetric, the two servers would provide separate and different services and would act as each other's takeover server for their corresponding services. One of the best-suited services to be provided by an HA system is a network file access service, like the Network Filesystem, or NFS. In the example in Figure 6-1, each server has an onboard 100-megabit Ethernet interface (hme0) and two Ethernet ports on two quad fast Ethernet cards (qfe0 and qfe1). These network card names could be different for your system depending upon your hardware and operating system. qfe0 and hme0 are being used as the heartbeat links. These links monitor the health of the HA servers and are connected to each system via a private network, which could be implemented with a minihub or with a crossover twisted-pair cable. There are two of these for redundancy. qfe1 is used as the system's physical connection to the service network, which is the network for which services are being provided by the HA system. The two shared disk arrays are connected via a fiber-channel connection and are under volume management control. These disk arrays contain the critical data. Such a design allows for immediate recovery from a number of problems. If Server A lost its connectivity to the network, the HA software would notice this via the heartbeat network. Server A could shut down its applications and its database automatically. Server B could then assume the identity of Server A, import the database, and start the necessary applications by becoming the primary server. Also, if Server A was not able to complete a task due to an application problem, the HA software could then fail over the primary system to the takeover server. Page 234 Figure 6-1. Asymmetric configuration The takeover server would be a system that absorbs the applications and identity of the primary server. Highly available systems depend on good hardware, good software, and proper implementation. The configuration in Figure 6-1 is an example of a simple configuration and may not necessarily be suitable for your organization but may help to get you started. How Is HA Different from Fault-Tolerant Solutions? A fault-tolerant system uses a more robust and hardware-oriented configuration than does a high-availability system. Fault-tolerant systems usually include more than two systems and use specific-purpose hardware that is geared to withstand massive failures. They also are designed around a voting system in which the principle of quorum is used to make decisions, and all processing units involved in computations run the same processes in parallel. On the other hand, high available solutions typically are software oriented. They combine duplication of hardware on the involved systems with various configuration techniques to cope with failures. Functions usually are run in only one of the systems, and when a failure is realized, the takeover system is signaled to start a duplicate function. The asymmetric configuration is possible in this scenario. Page 235 Good examples of fault-tolerant systems are found in military and space applications. Companies like Tandem (now Compaq) and Stratus market this type of system. Sun Microsystems has a division that specializes in providing fault-tolerant systems. How Is HA Different from Mirroring? Mirroring is really the process of making a simultaneous copy of your critical data that is instantly available online. Having data written redundantly to two disks, two disk groups, or even two different disk arrays can be highly beneficial if an emergency occurs. Mirroring is a primary ingredient in the recipe for data recovery and should be part of your total backup and disaster recovery plan. However, having a system highly available is much more than just installing software; it is creating an environment in which failures can be tolerated because they can be recovered from quickly. Not only can the system's primary data storage fail, but also the system itself can fail. When this happens, HA systems fail over to the takeover system with the mirrored data (if necessary) and continue to run with very minimal downtime. Can HA Be Handled Across a LAN/WAN? In general terms, high availability can be handled even across a local or wide area network. There are some caveats to the extent, feasibility, and configuration that can be considered in either case, however. Currently available commercial solutions are more geared to local area network (LAN) environments; our example depicted in Figure 6-2 is a classical example of HA over a LAN. Conversely, a wide area network (WAN) environment presents some restrictions on the configuration that may be used. For instance, it would be cumbersome and costly to implement a private network connection for the heartbeat links. Additionally, a WAN environment would require more support from the network devices, especially the routers. As an illustration, we will show you how to use routers to allow the activation of a takeover server in such a way that it uses the same IP address as the primary server. Naturally, only one host can use a particular IP address at one time, so the routers will be set up as a kind of automatic A/B switch. In this way, only one of the two HA systems can be reached at that IP address at any one time. For the switchover to be automatic, the routers must be able to update their routing databases automatically. When the switchover is to occur, Router R3 will be told not to route packets to its server, and Router R4 will be told to start routing those packets to the takeover server. Because the traffic is turned off completely, only the HA server and its takeover counterpart should be behind Routers R3 and R4, Page 236 respectively. All other hosts should be "in front" of R3 or R4 so that they continue to receive packets when the router A/B switch is thrown. Figure 6-2. HA over a LAN In order to accomplish this feat, the routing protocol must be a dynamic protocol, one capable of updating routing tables within the routers without human intervention (please refer to Figure 6-2). We have laid out a structure in such a way that Page 237 the primary server resides behind Router R3, a layer-3 routing device. Local traffic flows between R1 and R3. R1 is the gateway to the WAN. The takeover server is located behind R4, which remains inactive as long as no failures of the primary server are detected. R2 is the WAN gateway for the takeover server. If a failure is detected on the primary server, R3 would be disabled and R4 would be enabled, in much the fashion of the switch described earlier. At this point, the routers will begin to restructure their routing tables with new information. R4 will now pass to R2 routing information about the takeover server's segment, and R3 will announce to R1 the loss of its route, which will in return announce it to the WAN. Some protocols that deal with routing may require users to delete the primary server's network from a router's tables and to add the takeover server's network to the router, which will now support the takeover server's segment. By using R1 as a default gateway for the primary server's segment, the routing switchover should happen more quickly. In order to get the mission-critical data from one point to another, a product such as Qualix DataStar, which can provide remote mirroring for disaster recovery purposes, should be used. It use will enable an offsite copy at the fail-over location. A more sophisticated solution for a WAN environment could be implemented using two servers mirroring each other's services across the network by running duplication software such as Auspex's ServerGuard. Network routers could be configured in such a way that they would manage the switching of IP addresses by using the same philosophy as in regular HA. Cisco has developed just such a protocol, called Hot Standby Routing Protocol (HSRP). Another possibility is to use SAN technology to share highly available peripherals. Since SAN peripherals can be attached via Fibre Channel, they could be placed several miles away from the system that is using them. Why Would I Need an HA Solution? As organization grow, so does the need to be more proactive in setting up systems and procedures to handle possible problems. Because an organization's data becomes more critical every day, having systems in place to protect against data loss becomes daily more desirable. Highly available designs are reliable and cost-effective solutions to help make the critical systems in any environment more robust. HA designs can guarantee that business-critical applications will run with few, if any, interruptions. Although fault-tolerant systems are even more robust, HA solutions are often the best strategy because they are more cost-effective. Page 238 HA Building Blocks Many people begin adding availability to their systems in levels. They start by increasing the availability of their disks by using volume management software to place their disks in some sort of RAID* configuration. They also begin increasing filesystem availability by using a journaling filesystem. Here is an overview of these concepts. Volume Management There are two ways to increase the availability of your disk drives. The first is to buy a hardware-based RAID box, and the second is to use volume management software to add RAID functionality to "regular" disks. The storage industry uses the term "volume management" when talking about managing multiple disks, especially when striping or mirroring them with software. Please don't confuse this with managing backup volumes (i.e., tapes, CDs, optical platters, etc.). The amount of availability that you add will be based on the level of RAID that you choose. Common numbered examples of RAID are RAID-0, RAID-1, RAID-0+1, RAID-1+0, RAID-10 (1+0 and 10 refer to the same thing), RAID-2, RAID-3, RAID-4, RAID-5, and RAID-6. See Table 6-1 for a brief description of each RAID level. A more detailed description of each level follows. Table 6-1. RAID Definitions Level Description RAID: A disk array in which part of the physical storage capacity is used to store redundant information about user data stored on the remainder of the storage capacity. The redundant information enables regeneration of user data in the event that one of the array's member disks or the access data path to it fails. Level 0 Disk striping without data protection. (Since the "R" in RAID means redundant, this is not really RAID.) Level 1 Mirroring. All data is replicated on a number of separate disks. Level 2 Data is protected by Hamming code. Uses extra drives to detect 2-bit errors and correct 1-bit errors on the fly. Interleaves by bit or block. (table continued on next page.) * Redundant Array of Independent Disks. I believe that the original definition of this was Redundant Array of Inexpensive Disks, as opposed to one large very expensive disk. However, this seems to be the commonly held definition today. Based on the prices of today's RAID systems, "Independent" seems much more appropriate than "Inexpensive." Page 239 (table continued from previous page.) Table 6-1. RAID Definitions (continued) Level Description Level 3 Each virtual disk block is distributed across all array members but one, with parity check information stored on a separate disk. Level 4 Data blocks are distributed as with disk striping. Parity check is stored in one disk. Level 5 Data blocks are distributed as with disk striping. Parity check data is distributed across all members of the array. Level 6 Like RAID-5, but with additional independently computed check data. The RAID "hierarchy" begins with RAID-0 (striping) and RAID-1 (mirroring). Combining RAID-0 and RAID-1 is called RAID-0+1 or RAID-1+0, depending on how you combine them. (RAID-0+1 is also called RAID-01, and RAID-1+0 is also called RAID-10.) The performance of RAID-10 and RAID-01 are identical, but they have different levels of data integrity. RAID-01 (or RAID-0+1) is a mirrored pair (RAID-1) made from two stripe sets (RAID-0), hence the name RAID-0+1, because it is created by first creating two RAID-0 sets and adding RAID-1. If you lose a drive on one side of a RAID-01 array, then lose another drive on the other side of that array before the first side is recovered, you will suffer complete data loss. It also is important to note that all drives in the surviving mirror are involved in rebuilding the entire damaged stripe set, even if only a single drive were damaged. Performance during recovery is severely degraded unless the RAID subsystem allows adjusting the priority of recovery. However, shifting the priority toward production will lengthen recovery time and increase the risk of the kind of catastrophic data loss mentioned earlier. RAID-10 (or RAID-1+0) is a stripe set made up from n mirrored pairs. Only the loss of both drives in the same mirrored pair can result in any data loss, and the loss of that particular drive is 1/nth as likely as the loss of some drive on the opposite mirror in RAID-01. Recovery involves only the replacement drive and its mirror so the rest of the array performs at 100 percent capacity during recovery. Also, since only the single drive needs recovery, bandwidth requirements during recovery are lower and recovery takes far less time, reducing the risk of catastrophic data loss. RAID-2 is a parity layout that uses a Hamming code* that detects errors that occur and determines which part is in error by computing parity for distinct overlapping sets of disk blocks. (RAID-2 is not used in practice-the redundant computations of a Hamming code are not required, since disk controllers can detect the failure of a single disk.) * A Hamming code is a basic mathematical Error Correction Code (ECC). Page 240 RAID-3 is used to accelerate applications that are single-stream bandwidth oriented. All I/O operations will access all disks since each logical block is distributed across the disks that comprise the array. The heads of all disks move in unison to service each I/O request. RAID-3 is very effective for very large file transfers, but it would not be a good choice for a database [...]... 620.8MB in 44 cyl groups (16 c/g, 14. 25MB/g, 6 848 i/g) super-block backups (for fsck -F ufs -o b=#) at: 32, 29312, 58592, 87872, 117152, 146 432, 175712, 2 049 92, 2 342 72, 263552, 292832, 322112, 351392, 380672, 40 9952, 43 9232, 46 8512, 49 7792, 527072, 556352, 585632, 6 149 12, 644 192, 67 347 2, 702752, 732032, 761312, 790592, 819872, 849 152, 87 843 2, 907712, 933920, 963200, 99 248 0, 1021760, 1051 040 , 1080320,... ================================================================= /dev/dsk/c0t3d0s0 59 243 8 47 4356 58839 89% / /dev/dsk/c0t3d0s1 123231 62 943 47 965 57% /var /dev/dsk/c0t0d0s6 18 548 01 1268851 40 047 0 77% /db /dev/dsk/c0t1d0s4 48 479 146 6 42 166 4% /vol2 /dev/dsk/c0t1d0s5 1767031 2923 1711098 1% /junk /dev/dsk/c0t1d0s3 48 479 144 1 42 191 4% /voll /dev/dsk/c0t3d0s3 525160 swap Solaris data ============ sar(1M) status Veritas Volume Mgr : : ... bare-metal in general and describes how to use Native Unix utilities to recover SunOS/Solaris systems • Chapter 9, Compaq True- 64 Unix, describes Digital's recovery system and shows how to develop how to develop a custom recovery plan using Native Unix utilities • Chapter 10, HP-UX, discusses bare-metal recovery using tools provided by Hewlett-Packard in combination with Native Unix utilities • Chapter... CD-ROM Platform Boot Command 4/ 110, 4/ 2xx, 4/ 3xx, 4/ 4xx b sd(0,3,1) -s Sparc 1, Sparc 1+, Sparc SLC, Sparc IPC boot sd(0,6,2) -s SPARC 1E boot sd(0,6,5) -s SPARC ELC, IPX, LX, classic 2, 10, LX, 6xxMP, 1000, 2000, 4x00, 6x00, 10000 (and probably any newer architectures) boot cdrom -s 3 Partition the new drive to look like the old drive To do this on Solaris, we use the format command: SunOS# format sd0... disk failure, you will need to recover the root disk from some type of backup Recovering the root disk is called a bare-metal recovery, and there are many platform-specific, bare-metal recovery utilities The earliest example of such a utility on a Unix platform is AIX's mksysb command mksysb is still in use today and makes a special backup tape that stores all of the root volume-group information The... Estimated 991718 blocks (48 4.24MB) on 0.01 tapes DUMP: Dumping (Pass III) [directories] DUMP: Dumping( Pass IV) [regular files] DUMP: 38.81% done, finished in 0:16 DUMP: 78.06% done, finished in 0:05 DUMP: 991678 blocks (48 4.22MB) on 1 volume at 319 KB/sec DUMP: DUMP IS DONE DUMP IS DONE DUMP: Level 0 dump on Mon Jan 04 16:08:20 1999 /usr/sbin/ufsdump 0 Done Mon 01/ 04/ 99 16: 34: 21 from curtis / on curtis:/dev/rmt/0n... II) [directories] DUMP: Estimated 1301 34 blocks (63.54MB) on 0.00 tapes DUMP: Dumping (Pass III) [directories] DUMP: Dumping (Pass IV) [regular files] DUMP: 130110 blocks (63.53MB) on 1 volume at 345 KB/sec DUMP: DUMP IS DONE DUMP: Level 0 dump on Mon Jan 04 16: 34: 25 1999 /usr/sbin/ufsdump 0 Done Mon 01/ 04/ 99 16:37:37 from curtis /var on curtis:/dev/rmt/0n 2 Backups complete Beginning reading of tape... Disk data: sun4m Unknown 72304ecb SPARCstation 10MP (2 X 390Z55) 5.6 Generic_105181-07 1.15 (bundled) 128 Megabytes 358 Megabytes oracle ================================================================= Filesystem kbytes used avail capacity Mounted on ================================================================= /dev/dsk/c0t3d0s0 59 243 8 47 4356 58839 89% / /dev/dsk/c0t3d0s1 123231 62 943 47 965 57% /var... steps and the logic behind them were covered earlier Following is an example of how such a recovery would look on a Solaris system This example covers a Sparc 20 Its operating system is Solaris 2.6, and it has two filesystems, /and/ var Page 257 Preparing for Disaster First, we will back up the system using hostdump.sh This utility is covered in Chapter 3, Native Backup & Recovery Utilities, and will... /iommu@f,e0000000/sbus@f,e0001000/espdma@f ,40 0000/esp@f,800000/ sd@0,0 1 c0t1d0 /iommu@f,e0000000/sbus@f,e0001000/espdma@f ,40 0000/esp@f,800000/ sd@1,0 2 c0t3d0 /iommu@f,e0000000/sbus@f,e0001000/espdma@f ,40 0000/esp@f,800000/ sd@3,0 Specify disk (enter its number): 2 * Yes, I know this is a fake example, and a real "blank" disk never . performs at 100 percent capacity during recovery. Also, since only the single drive needs recovery, bandwidth requirements during recovery are lower and recovery takes far less time, reducing. bare-metal in general and describes how to use Native Unix utilities to recover SunOS/Solaris systems. • Chapter 9, Compaq True- 64 Unix, describes Digital's recovery system and shows how to develop. Verification Another often-ignored area of backup and recovery software is its ability to verify its own backups. There are plenty of horror stories out there about people who did backups for years or months

Ngày đăng: 13/08/2014, 04:21