The Practice of System and Network Administration Second Edition phần 6 doc

486 Chapter 20 Maintenance Windows 20.1.7.2 KVM and Console Service Two data center elements that make management easier are KVM switches and serial console servers Both can be instrumental in making maintenance windows easier to run by making it possible to remotely access the console of a machine A KVM switch permits multiple computers to all share the same keyboard, video display, and mouse A KVM switch saves space in a data center— monitors and keyboards take up a lot of space—and makes access more convenient; indeed, more sophisticated console access systems can be accessed from anywhere in the network A serial console server connects devices with serial consoles—systems without video output, such as network routers, switches, and many UNIX servers—to one central device with many serial inputs By connecting to the console server, a user can then connect to the serial console of the other devices All the computer room equipment that is capable of supporting a serial console should have its serial console connected to some kind of console concentrator, such as a networked terminal server Much work during a maintenance window requires direct console access Using console access devices permits people to work from their own desks rather than having to try to coordinate access for many people to the very limited number of monitors in the computer room or having to waste computer room space, power, and cooling with more monitors It is also more convenient for the individual SAs to work in their own workspace with their preparatory notes and reference materials around them 20.1.7.3 Radios Because the maintenance window is tightly scheduled, the high number of dependencies, and the occasional unpredictability of system administration work, all the SAs have to let the flight director know when they are finished with a task, and before they start a new task, to make sure that the prerequisite tasks have all been completed We recommend using handheld radios to communicate within the group Rather than seeking out the flight director, an SA can simply call over the radio Likewise, the flight director can contact the SAs to find out status, and team members and team leaders can find one another and coordinate over the radio If SAs need extra help, they can also ask for it over the radio There are multiple radio channels, and long conversations can move to another channel to keep the primary one free The radios are also essential for systemwide testing at the end of the maintenance window (see Section 20.1.9) 20.1 The Basics 487 It is useful to use radios, cellphones, or some other effective form of twoway communication for campuswide instant communication between SAs We recommend radios because they are not billed by the minute and typically work better in data center environments than cellphones Remember, anything transmitted on the airwaves can be overheard by others, so sensitive information, such as passwords, should not be communicated over radios, cellphones, or pagers Several options exist for selecting the radios, and what you choose depends on the coverage area that you need, the type of terrain in that area, availability, and your skill level It is useful to have multiple channels, or frequencies, available on the handheld radios, so that long conversations can switch to another channel and leave the primary hailing channel open for others (see Table 20.3) Line-of-sight radio communications are the most common and typically have a range of around 15 miles, depending on the surrounding terrain and buildings Your retailer should be able to set you up with one or more frequencies and a set of radios that use those frequencies Make sure that the retailer knows that you need the radios to work through buildings and the coverage that you need Table 20.3 Comparison of Radio Technologies Type Requirements Line of sight • Frequency license • Transmits through walls Advantages Disadvantages Simple • Limited range • Doesn’t transmit through mountains Repeater • Frequency license • Better range • More complex to run • Radio operator license • Repeater on mountain • Skill qualifications enables communication over mountain Cellular Service availability • • • • Simple Wide range Unaffected by terrain Less to carry • Higher cost • Available only in cellphone providers’ coverage area • Company contracts may limit options • Multiple channels may not be available 488 Chapter 20 Maintenance Windows Repeaters can be used to extend the range of a radio signal and are particularly useful if a mountain between campus buildings would block line-of-sight communication It can be useful to have a repeater and an antenna on top of one of the campus buildings in any case, for additional range, with at least the primary hailing channel using the repeater This configuration usually requires that someone with a ham radio license set up and operate the equipment Check your local laws Some cellphone companies offer push-to-talk features on cellphones so that phones work more like walk-talkies This option will work wherever the telephones operate The provider should be able to provide maps of the coverage areas The company should supply all SAs with a cellphone with this service This has the advantage that the SAs have to carry only the phone, not a phone and radio This can be a quick and convenient way to get a new group established with radios but may not be feasible if it requires everyone to change to the same cellphone provider If radios won’t work or work badly in your data center because of radio frequency (RF) shielding, put an internal phone extension with a long cord at the end of every row, as shown in Figure 6.14 That way, SAs in the data center can still communicate with other SAs while working in the data center At worst, they can go outside the data center, contact someone on the radio, and arrange to talk to that person on a specific telephone inside the data center Setting up a conference call bridge for everyone to dial in to can have the benefits of radio communication with the benefit that people can dial in globally to participate Having a permanent bridge number assigned to the group makes it easier to memorize and can save critical minutes when needed for emergencies Communication During an Emergency A major news web site was flooded by users during the attacks of September 11, 2001 It took a long time to request and receive a conference call bridge and even longer for all the key players to receive dialing instructions 20.1.8 Deadlines for Change Completion A critical role of the flight director is tracking how the various tasks are progressing and deciding when a particular change should be aborted and the back-out plan for that change executed For a general task with no other dependencies and for which those involved had no other remaining tasks, that time would be 11 PM on Saturday evening, minus the time required to 20.1 The Basics 489 implement the back-out plan, in the case of a weekend maintenance window The flight director should also consider the performance level of the SA team If the members are exhausted and frustrated, the flight director may decide to tell them to take a break or to start the back-out process early if they won’t be able to implement it as efficiently as they would when they were fresh If other tasks depend on that system or service being operational, it is particularly critical to predefine a cut-off point for task completion For example, if a console server upgrade is going badly, it can run into the time regularly allotted for moving large data files Once you have overrun one time boundary, the dependencies can cascade into a full catastrophe, which can be fixed only at the next scheduled downtime, perhaps another week away Make note of what other tasks are regularly scheduled near or during your maintenance window, so you can plan when to start backing out of a problem 20.1.9 Comprehensive System Testing The final stage of a maintenance window is comprehensive system testing If the window has been short, you may need to test only the few components that you worked on However, if you have spent your weekend-long maintenance window taking apart various complicated pieces of machinery and then putting them back together and all under a time constraint, you should plan on spending all day Sunday doing system testing Sunday system testing begins with shutting down all of the machines in the data center, so that you can then step through your ordered boot sequence Assign an individual to each machine on the reboot list The flight director announces the stages of the shutdown sequence over the radio, and each individual responds when the machine under their responsibility has completely shut down When all the machines at the current stage have shut down, the flight director announces the next stage When everything is down, the order is reversed, and the flight director steps everyone through the boot stages If any problems occur with any machine at any stage, the entire sequence is halted until they are debugged and fixed Each person assigned to a machine is responsible for ensuring that it shut down completely before responding and that all services have started correctly before calling it in as booted and operational Finally, when all the machines in the data center have been successfully booted in the correct order, the flight director splits the SA team into groups Each group has a team leader and is assigned an area in one of the campus buildings The teams are given instructions about which machines they are responsible for and which tests to perform on them The instructions 490 Chapter 20 Maintenance Windows always include rebooting every desktop machine to make sure that it comes up cleanly The tests could also include logging in, checking for a particular service, or trying to run a particular application, for example Each person in the group has a stack of colored sticky tabs used for marking offices and cubicles that have been completed and verified as working The SAs also have a stack of sticky tabs of a different color to mark cubicles that have a problem When SAs run across a problem, they spend a short time trying to fix it before calling it in to the central core of people assigned to stay in the main building to help debug problems As it finishes its area, a team is assigned to a new area or to help another team to complete an area, until the whole campus has been covered Meanwhile, the flight director and the senior SA troubleshooters keep track of problems on a whiteboard and decide who should tackle each problem, based on the likely cause and who is available By the end of testing, all offices and cubicles should have tags, preferably all indicating success If any offices or cubicles still have tags indicating a problem, a note should be left for that customer, explaining the problem; someone should be assigned to meet with that person to try to resolve it first thing in the morning This systematic approach helps to find problems before people come in to work the next day If there is a bad network segment connection, a failed software depot push, or problems with a service, you’ll have a good chance to fix it before anyone else is inconvenienced Be warned, however, that some machines may not have been working in the first place The reboot teams should always make sure to note when a machine did not look operational before they rebooted it They can still take time to try to fix it, but it is lower on the priority list and does not have to happen before the end of the maintenance window Ideally, the system testing and sitewide rebooting should be completed sometime on Sunday afternoon This gives the SA team time to rest after a stressful weekend before coming into work the next day 20.1.10 Post-maintenance Communication Once the maintenance work and system testing have been completed, the flight director sends out a message to the company, informing everyone that service should now be fully restored The message briefly outlines the main successes of the maintenance window and briefly lists any services that are known not to be functioning and when they will be fixed This message should be in a fixed format and written largely in advance, because the flight director will be too tired to be very coherent or upbeat to 20.1 The Basics 491 write the message at the end of a long weekend There is also little chance that anyone who proofreads the message at that point is going to be able to help, either Hidden Infrastructure Sometimes, customers depend on a server or a service but neglect to inform us, perhaps because they implemented it on their own This is what we call hidden infrastructure A site had a planned outage, and all power to the building was shut off Servers were taken down in an orderly manner and brought back successfully The following morning, the following email exchange took place: From: IT To: Everyone in the company All servers in the Burlington office are up and running Should you have any issues accessing servers, please open a helpweb ticket From: A Developer To: IT Devwin8 is down From: IT To: Everyone in the company Whoever has devwin8 under their desk, turn it on, please 20.1.11 Re-enable Remote Access The final act before leaving the building should be to reenable remote access and restore the voicemail on the helpdesk phone to normal Make sure that this appears on the master plan and the individual plans of those responsible It can be very easily forgotten after an exhausting weekend, but it is a very visible, inconvenient, and embarrassing thing to forget, especially because it can’t be fixed remotely if all remote access was turned off successfully 20.1.12 Be Visible the Next Morning It is very important for the entire SA group to be in early and to be visible to the company the morning after a maintenance window, no matter how hard they have worked during the outage If everyone has company or group shirts, coordinate in advance of the maintenance window so that all the SAs wear those shirts on the day after the outage Have the people who look after particular departments roam the corridors of those departments, keeping eyes and ears open for problems 492 Chapter 20 Maintenance Windows Have the flight director and some of the senior SAs from the central coreservices group, if there is one, sit in the helpdesk area to monitor incoming calls and listen for problems that may be related to the maintenance window These people should be able to detect and fix them sooner than the regular helpdesk staff, who won’t have such an extensive overview of what has happened A large visible presence when the company returns to work sends the message: “We care, and we are here to make sure that nothing we did disrupts your working hours.” It also means that any undetected problems can be handled quickly and efficiently, with all the relevant staff on-site and not having to be paged out of their beds Both of these factors are important in the overall satisfaction of the company with the maintenance window If the company is not satisfied with how the maintenance windows are handled, the windows will be discontinued, which will make preventive maintenance more difficult 20.1.13 Postmortem By about lunchtime of the day after the maintenance window, most of the remaining problems should have been found At that point, if it is sufficiently quiet, the flight director and some of the senior SAs should sit down and talk about what went wrong, why, and what can be done differently That should all be noted and discussed with the whole group later in the week Over time, with the postmortem process, the maintenance windows will become smoother and easier Common mistakes early on are taking on too much, not doing enough work ahead of time, and underestimating how long something will take 20.2 The Icing Although a lot of basics must be implemented for a successful large-scale maintenance window, a few more things are nice to have After completion of some successful maintenance windows, you should start thinking about the icing that will make your maintenance windows more successful 20.2.1 Mentoring a New Flight Director It can be useful to mentor new flight directors for future maintenance windows Therefore, flight directors must be selected far enough in advance so that the one for the next maintenance window can work with the current flight director 20.2 The Icing 493 The trainee flight director can produce the first draft of the master plan, using the change requests that were submitted, adding in any dependencies that are missing, and tagging those additions The flight director then goes over the plan with the trainee, adds or subtracts dependencies, and reorganizes the tasks and personnel assignments as appropriate, explaining why Alternatively, the flight director can create the first draft along with the trainee, explaining the process while doing so The trainee flight director can also help out during the maintenance window, time permitting, by coordinating with the flight director to track status of certain projects and suggesting reallocation of resources where appropriate The trainee can also help out before the downtime by discussing projects with some of the SAs if the flight director has questions about the project and by ensuring that the prerequisites listed in the change proposal are met in advance of the maintenance window 20.2.2 Trending of Historical Data It is useful to track how long particular tasks take and then analyze the data later and improve on the estimates in the task submission and planning process For example, if you find that moving a certain amount of data between two machines took hours and you have a large data move between two similar machines on similar networks another time, you can more accurately predict how long it will take If a particular software package is always difficult to upgrade and takes far longer than anticipated, that will be tracked, anticipated, allowed for in the schedule, and watched closely during the maintenance interval Trending is particularly useful in passing along historical knowledge When someone who used to perform a particular function has left the group, the person who takes over that function can look back at data from previous maintenance windows to see what sorts of tasks are typically performed in this area and how long they take This data can give people new to the group and to planning a maintenance window a valuable head start so that they don’t waste a maintenance opportunity and fall behind For each change request, record actual time to completion for use when calculating time estimates next time around Also record any other notes that will help improve the process next time 20.2.3 Providing Limited Availability It is highly likely that at some point, you will be asked to keep service available for a particular group during a maintenance window It may be something 494 Chapter 20 Maintenance Windows unforeseen, such as a newly discovered bug that engineering needs to work on all weekend, or it may be a new mode of operation for a division, such as customer support switching to 24/7 service and needing continuous access to its systems to meet its contracts Internet services, remote access, global networks, and new-business pressure reduce the likelihood that a full and complete outage will be permitted Planning for this requirement could involve rearchitecting some services or introducing added layers of redundancy to the system It may involve making groups more autonomous or distinct from one another Making these changes to your network can be significant tasks by themselves, likely requiring their own maintenance window; it is best to be prepared for these requests before they arrive, or you may be left without time to prepare To approach this task, find out what the customers will need to be able to during the maintenance window Ask a lot of questions, and use your knowledge of the systems to translate these needs into a set of service-availability requirements For example, customers will almost certainly need name service and authentication service They may need to be able to print to specific printers and to exchange email within the company or with customers They may require access to services across widearea connections or across the Internet They may need to use particular databases; find out what those machines depend on Look at ways to make the database machines redundant so that they can also be properly maintained without loss of service Make sure that the services they depend on are redundant Identify what pieces of the network must be available for the services to work Look at ways to reduce the number of networks that must be available by reducing the number of networks that the group uses and locating redundant name servers, authentication servers, and print servers on the group’s networks Find out whether small outages are acceptable, such as a couple of 10-minute outages for reloading network equipment If not, the company needs to invest in redundant network equipment Devise a detailed availability plan that describes exactly what services and components must be available to that group Try to simplify it by consolidating the network topology and introducing redundant systems for those networks Incorporate availability planning into the master plan by ensuring that redundant servers are not down simultaneously 20.2 The Icing 495 20.2.4 High-Availability Sites By the very nature of their business, high-availability sites cannot afford to have large planned outages.2 This also means that they cannot afford not to make the large investment necessary to provide high availability Sites that have high-availability requirements need to have lots of hot redundant systems that continue providing service when any one component fails The higher the availability requirement, the more redundant systems that are required to achieve it.3 These sites still need to perform maintenance on the systems in service Although the availability guarantees that these sites make to their customers typically exclude maintenance windows, they will lose customers if they have large planned outages 20.2.4.1 The Similarities Most of the principles described here for maintenance windows at a corporate site apply at high-availability sites • They need to schedule the maintenance window so that it has the least impact on their customers For example, ISPs often choose AM (local time) midweek; e-commerce sites need to choose a time when they the least business These windows will typically be quite frequent, such as once a week, and shorter, perhaps to hours in duration • They need to let their customers know when maintenance windows are scheduled For ISPs, this means sending an email to the customers For an e-commerce site, this means having a banner on the site In both cases, it should be sent only to those customers who may be affected and should contain a warning that small outages or degraded service may occur during the maintenance window and give the times of that window There should be only a single message about the window High availability is anything above 99.9 percent Typically, sites will be aiming for three nines (99.9 percent) (9 hours downtime per year), four nines (99.99 percent) (1 hour per year), or five nines (99.999 percent) (5 minutes per year) Six nines (99.9999 percent) (less than minute a year) is more expensive than most sites can afford Recall that n + redundancy is used for services such that any one component can fail without bringing the service down, n + means any two components can fail, and so on 576 Chapter 24 Print Service Customers will recycle when it is easy and not when it is impossible If there are no recycling bins, people won’t recycle Putting a recycling bin next to each trash can and near every printer makes it easy for customers to recycle As SAs, it is our responsibility to institute such a program if one doesn’t exist, but creation of such a system may be the direct responsibility of others We should all take responsibility for coordinating with the facilities department to create such a system or, if one is in place, to make sure that proper paper-only wastebaskets are located near each printer, instructions are posted, and so on The same can be said for recycling used toner cartridges This is an opportunity to collaborate with the people in charge of the photocopying equipment in your organization because many of the issues overlap Some issues are the responsibility of the customers, but SAs can facilitate them by giving customers the right tools SAs must also avoid becoming roadblocks that encourage their customers to adopt bad habits For example, making sure that preview tools for common formats, such as PostScript, are available to all customers helps them print less Printers and printing utilities should default to duplex (double-sided) printing Don’t print a burst page before each printout Replacing paper forms with web-based, paperless processes also saves paper, though making them as easy to use is a challenge Case Study: Creating a Recycling Program A large company didn’t recycle toner cartridges, even though new ones came with shipping tags and instructions on how to return old ones The vendor even included a financial incentive for returning old cartridges The company’s excuse for not taking advantage of this system was simply that nobody had created a procedure within the company to it! Years of missed cost savings went by until someone finally decided to take the initiative to create a process Once the new system was in place, the company saved thousands of dollars per year Lesson learned: These things don’t create themselves Take the initiative Everyone will use the process if someone else creates it 24.2 The Icing SAs can build some interesting add-ons into their print service to make it more of a Rolls-Royce quality of service 24.2 The Icing 577 24.2.1 Automatic Failover and Load Balancing We discussed redundant print servers; often, the failover for such systems is manual SAs can build automatic failover systems to reduce the downtime associated with printer problems If the printing service deals with high volumes, the SAs might consider using the redundant systems to provide load balancing For example, there may be two print spoolers, with each handling half of the printers to balance the load and substituting for each other if one dies, to mitigate downtime Automated failover has two components: detection of a problem and the cutover to the other spooler Detecting that a service is down is difficult to properly A spooler may be unable to print even if it responds to pings, accepts new connections, answers requests for status, and permits new jobs to be submitted It is best if the server can give a more detailed diagnostic without generating printout If no jobs are in the queue, you can make sure that the server is accepting new jobs If jobs are in the queue, you can make sure that the current job has not been in the process of being printed for an inordinate amount of time, which may indicate a problem You must be careful, because small PostScript jobs can generate many pages or can run for a long time without generating any pages It is important to devise a way to avoid false positives You must also differentiate between the server being out of service and the printer being out of service Outside of server problems, we find most printing problems to be on the PC side, particularly the drivers and the application software At least having reliable print servers to print documents to can reduce the chance of multiple simultaneous failures Outsourced Printing If you need a fancy print feature only occasionally, such as binding, three-hole punch, high-volume printing, or extra-wide formats, many print shops accept PDF files Walking a PDF file to a store once in a while is much more cost-effective than buying a fancy printer that is rarely used Nearly all print shops will accept PDFs via email or on their web site Print shop chains have created printer drivers that “print to the web.” These drivers submit the printout to their printing facility directly from applications You can then log in to their web site to preview the output, change options, and pay Most will ship the resulting output to you, which means that you never have to leave your office 578 Chapter 24 Print Service 24.2.2 Dedicated Clerical Support Printers are mechanical devices, and often their reliability can be increased by keeping untrained people away from their maintenance tasks We don’t mean to disparage our customers’ technical abilities, but we’ve seen otherwise brilliant people break printers by trying to change the toner cartridge Therefore, it can be advantageous to have a dedicated clerk or operator service them Many sites have enough printers that it can be a part-time job for someone to simply visit every printer every few days to verify that it is printing properly, has enough paper, and so on This person can also be responsible for getting printers repaired Although a company may have a service contract to “take care of that kind of thing,” someone still has to contact, schedule, and babysit the repair person Arranging service and describing the problem can be very time consuming, and it’s less expensive to have a clerk this task than an SA Often, one can recruit a secretary or office manager to this Ordering supplies can often be delegated to secretaries or office clerks if you first get their manager’s permission Write up a document with the exact order codes for toner cartridges so there is no guesswork involved 24.2.3 Shredding People print the darnedest things: private email, confidential corporate information, credit card numbers, and so on We’ve even met someone who tested printers by printing UNIX’s /etc/passwd file, not knowing that the encrypted second field could be cracked Some sites shred very little, some have specific shredding policies, and others shred just about everything We don’t have much to say about shredding except to note that it is good to shred anything that you wouldn’t want on the front page of the New York Times and that you should err on the side of caution about what you wouldn’t want to see there Something that isn’t printed is even more secure than something shredded The other thing to note is that on-site shredding services bring a huge shredder on a truck to your site to perform the shredding services, and offsite shredding services take your papers back to their shredding station for treatment Now and then, we hear stories that an off-site shredding service was found to not actually shred the paper as they had promised We aren’t sure whether these are true or merely urban legends, but we highly recommend regular spot-checks of your off-site shredding service if you choose one Shredding services are typically quite expensive, so you should make sure that you are getting what you pay for 24.2 The Icing 579 We prefer on-site shredding, and we designate a person to watch the process to make sure that it is done properly 24.2.4 Dealing with Printer Abuse According to an old Usenet saying, “You can’t solve social problems using technology,” and this applies to printing as well You can’t write a program to detect nonbusiness use of printers, and you can’t write a program to detect wasteful printing However, the right peer pressure and policy enforcement can go a long way Your acceptable-use policy (Section 11.1.2) should include what constitutes printer abuse Billing on a per page basis can create a business reason for conserving paper If the goal is to control printing costs rather than to recover funds spent by the SA team on supplies, you might give each person a certain amount of “free” printing per month or let departments pool their “free” allotment A lot of psychology is involved in a scheme like that You wouldn’t want to create a situation in which people waste time doing other things because they fear that the boss will punish them for going over their allotment One site simply announced the top-ten page generators each month as a way to shame people into printing less The theory was that people would print less if they learned that they were one of the largest consumers of printing services However, a certain set of employees took this as a challenge and competed to appear in the listing for the most consecutive months This technique might have been more effective if the list had been shown only to management or if an SA had personally visited the people and politely let them know of their status Shame can work in certain situations and not in others We suggest a balanced, nontechnical, approach Place the printer in a very visible, public place This will discourage more personal use than any policy or accounting system If someone does something particularly wasteful, address the person Don’t get petty about every incident; it’s insulting The 500-Page Printout SAs were perturbed to find a 500-page printout of “game cheats” tricks for winning various computer games at the printer one day When this was brought to the attention of the director, he dealt with it in a very smart way Rather than scolding anyone, he sent email to all employees, saying that he had found this printout by the printer without a 580 Chapter 24 Print Service cover page to indicate who made the printout He reminded people that a small amount of nonbusiness printing was reasonable and that he didn’t want this obviously valuable document to accidentally go to the wrong person Therefore, he asked the owner of the document to stop by his office to pick it up After a week, the printout was recycled, unclaimed Nothing more needed to be said 24.3 Conclusion Printing is a utility Customers expect it to always work The basis of a solid print system is well-defined policies on where printers will be deployed— desktop, centralized, or both—what kinds of printers will be used, how they will be named, and what protocols and standards will be used to communicate with them Print system architectures can run from very decentralized— peer to peer—to very centralized It is important to include redundancy and failover provisions in the architecture The system must be monitored to ensure quality of service Users of the print system require a certain amount of documentation How to print and the location of the printers they have access to should be documented The printers themselves must be labeled Printing has an environmental impact; therefore, SAs have a responsibility to not only work with other departments to create and sustain a recycling program but also provide the right tools so that customers can avoid printing whenever possible The best print systems also have automated failover and load balancing rather than manual failover They have clerks who maintenance and refill supplies rather than SAs spending time with these tasks or inflicting untrained users on the printers’ delicate components The best print systems provide shredding services for sensitive documents and recognize that many printing issues are social problems and therefore can’t be solved purely with technology Exercises Describe the nontechnical print policies in your environment Describe the print architecture in your environment Is it centralized, decentralized, or a mixture? How reliable is your print system? How you quantify that? Exercises 581 When there is an outage in your print system, what happens, and who is notified? When new users arrive, how they know how to print? How they know your policies about acceptable use? How you deal with the environmental issues associated with printing at your location? List both policies and processes you have, in addition to the social controls and incentives What methods to avoid printing are provided to your customers? This page intentionally left blank Chapter 25 Data Storage The systems we manage store information The capacity for computers to store information has doubled every year or two The first home computers could store 120 kilobytes on a floppy disk Now petabytes—millions of millions of kilobytes—are commonly bandied about Every evolutionary jump in capacity has required a radical shift in techniques to manage the data You need to know two things about storage The first is that it keeps getting cheaper—unbelievably so The second is that it keeps getting more expensive—unbelievably so This paradox will become very clear to you after you have been involved in data storage for even a short time The price of an individual disk keeps getting lower The price per megabyte has become so low that people now talk about price per gigabyte When systems are low on disk space, customers complain that they can go to the local computer store and buy a disk for next to nothing Why would anyone ever be short on space? Unfortunately, the cost of connecting and managing all these disks seems to grow without bound Previously, disks were connected with a ribbon cable or two, which cost a dollar each Now fiber-optic cables connected to massive storage array controllers cost thousands Data is stored multiple times, and complicated protocols are used to access the data from multiple simultaneous hosts Massive growth requires radical shifts in disaster-recovery systems, or backups Compared to what it takes to manage data, the disks themselves are essentially free The shift in emphasis from having storage to managing the data through its life cycle is enormous Now the discussion is no longer about price per gigabyte but price per gigabyte-month A study published in early 2006 by a major IT research firm illustrated the variability of storage costs For arraybased simple mirrored storage, the report found two orders of magnitude difference between low-end and high-end offerings 583 584 Chapter 25 Data Storage Storage is a huge topic, with many fine books written on it Therefore, we focus on basic terminology, some storage-management philosophy, and key techniques Each of these is a tool in your toolbox, ready to be pulled out as needed 25.1 The Basics Rather than storage being a war between consumers and providers, we promote the radical notion that storage should be managed as a community resource This reframes storage management in a way that lets everyone work toward common goals for space, uptime, performance, and cost Storage should be managed like any other service, and we have advice in this area Performance, troubleshooting, and evaluating new technologies are all responsibilities of a storage service team But first, we begin with a whirlwind tour of common storage terminology and technology 25.1.1 Terminology As a system administrator, you may already be familiar with a lot of storage terminology Therefore, we briefly highlight the terminology and key concepts used later in the chapter 25.1.1.1 Key Individual Disk Components In order to understand the performance issues of various storage systems, it is best to have an understanding of the underlying media and how basic disk operations work Understanding the bottlenecks of the individual components gives a basis for understanding the bottlenecks and improvements that appear in the more complex systems • Spindle, platters, and heads: A disk is made up of several platters on which data is stored The platters are all mounted on a single spindle and rotate as a unit Data is stored on the platters in tracks Each track is a circle with the spindle as its center, and each track is at a different radius from the center A cylinder is all the tracks at a given radius on all the platters Data is stored in sectors, or blocks, within the track, and tracks have different numbers of blocks based on how far from the center they are Tracks farther from the center are longer and therefore have more blocks The heads read and write the data on the disk by hovering 25.1 The Basics 585 over the appropriate track There is one head per platter, but they are all mounted on the same robotic arm and move as a unit Generally, an entire track or an entire cylinder will be read at once, and the data cached, as the time it takes to move the heads to the right place (seek time) is longer than it takes for the disk to rotate 360 degrees • Drive controller: The electronics on the hard drive, the drive controller implements the drive protocol, such as SCSI or ATA The drive controller communicates with the host to which the disk is attached Drive controllers are important for their level of standards compliance and for any performance enhancements that they implement, such as buffering and caching • Host bus adapter (HBA): The HBA is in the host and manages communication between the disk drive(s) and the server The HBA uses the data access protocol to communicate with the drive controller A smart HBA can also be a source of performance enhancements It is usually on the motherboard of the computer or an add-on card 25.1.1.2 RAID: A Redundant Array of Independent Disks RAID is an umbrella category for techniques that use multiple independent hard drives to provide storage that is larger, more reliable, or faster than a single drive can provide Each RAID technique is called a level (see Table 25.1) • RAID 0, also known as striping, spreads data across multiple disks in such a way that they can still act as one large disk A RAID virtual disk is faster than a single disk; multiple read and write operations can be executed in parallel on different disks in the RAID set RAID is less reliable than a single disk, because the whole set is useless if a single disk fails With more disks, failures are statistically more likely to happen • RAID 1, also known as mirroring, uses two or more disks to store the same data The disks should be chosen with identical specifications Table 25.1 Commonly Used RAID Levels Raid Level 10 Methods Characteristics Stripes Mirrors Distributed parity Mirrored stripes Faster reads and writes; poor reliability Faster reads; good reliability; very expensive Faster reads; slower writes; more economical Faster reads; best reliability; most expensive 586 Chapter 25 Data Storage Each write operation is done to both (or all) disks, and the data is stored identically on both (all) Read operations can be shared between the disks, speeding up read access Writes are as slow as the slowest disk RAID increases reliability If one disk fails, the system keeps working Remembering RAID RAID and RAID are two of the most commonly used RAID strategies People often find it difficult to remember which is which Here’s our mnemonic: “RAID provides zero help when a disk dies.” • RAID and are rarely used strategies that are similar enough to RAID that we explain the general concept there However, it should be noted that RAID gives particularly good performance for sequential reads Therefore, large graphics files, streaming media, and video applications often use RAID If your organization is hosting such files, you may wish to consider a RAID implementation for that particular storage server, especially if files tend to be archived and are not changed frequently • RAID is also similar to RAID but used rarely because it is usually slower RAID is faster than RAID only when a file system is specifically designed for it One example is Network Appliance’s file server, with its highly tuned WAFL file system • RAID 5, also known as distributed parity, seeks to gain reliability, like mirroring, with lower cost RAID is like RAID 0—striping, to get a larger volume—but with a single additional disk used to store recovery information If a single disk fails, the RAID set continues to work When the failed disk is replaced, the data on that disk is rebuilt, using the recovery disk Performance is reduced during the rebuild RAID gives increased read speed, as with RAID However, writes can take longer, as creating and writing the recovery information requires reading information on all the other disks • RAIDs 6–9, either don’t exist or are marketing hype for variations on the preceding Really • RAID 10, originally called RAID + 0, uses striping for increased size and speed and mirroring for reliability RAID 10 is a RAID group 25.1 The Basics 587 that has been mirrored onto another group Each individual disk in the RAID group is mirrored Since mirroring is RAID 1, the joke is that this is + 0, or 10 Rebuilds on a RAID 10 system are not as disruptive to performance as rebuilds on a RAID system As with RAID 1, multiple mirrors are possible and are commonly used RAID systems often allow for a hot spare, an extra unused disk in the chassis When a disk fails, the system automatically rebuilds the data onto a hot spare (This is not applicable to RAID 0, where the lost data cannot be rebuilt.) Some RAID systems can have multiple RAID sets but a single hot spare The first RAID set to require a new disk takes the spare, which saves the cost of multiple spares 25.1.1.3 Volumes and Filesystems A volume is a chunk of storage as seen by the server Originally, a volume was a disk, and every disk was one volume However, with partitioning, RAID systems, and other techniques, a volume can be any kind of storage provided to the server as a whole The server sees a volume as one logical disk; even if behind the scenes, it is made up of more complicated parts Each volume is formatted with a filesystem Each of the many file system types was invented for a different purpose or to solve a different performance problem Some common Windows filesystems are FAT, DOS FAT32, and NTFS UNIX/Linux systems have UFS, UFS2, EXT2/EXT3, ReiserFS, and many experimental ones Some filesystems journaling, or simply keeping a list of changes requested to the filesystem and applying them in batch This improves write speed and makes recovery faster after a system crash 25.1.1.4 DAS: Directly Attached Storage DAS is simply the conventional hard disk connected to the server DAS describes any storage solution whereby the storage is connected to the server with cabling rather than with a network This includes a RAID array that is directly attached to a server 25.1.1.5 NAS: Network-Attached Storage NAS is a new term for something that’s been around for quite a while: clients accessing the storage attached to a server For example, UNIX clients that use NFS to access files on a server, or Microsoft Windows systems that use CIFS to access files on a Windows server Many vendors package turnkey network file servers that work out of the box with several file-sharing protocols Network Appliance and EMC make such systems for large storage needs; 588 Chapter 25 Data Storage Linksys and other companies make smaller systems for consumers and small business 25.1.1.6 SANs: Storage-Area Networks A SAN is a system in which disk subsystems and servers both plug into a dedicated network, a special low-latency, high-speed network optimized for storage protocols Any server can attach to any storage system—at least as access controls permit What servers can attach to is a storage volume that has been defined and is referred to by its logical unit number (LUN) A LUN might be a disk, a slice of a RAID group, an entire storage chassis, or anything the storage systems make available Servers access LUNs on the block level, not the file system level Normally, only one server can attach to a particular LUN at a time; otherwise, servers will get confused as one system updates blocks without the others realizing it Some SAN systems provide cluster file systems, which elect one server to arbitrate access so that multiple servers can access the same volume simultaneously Tape backup units can also be attached to the network and shared, with the benefit that many servers share a single expensive tape drive Another benefit of SANs is that they reduce isolated storage With DAS, some servers may be lacking free disk space while others have plenty The free space is not available to servers that need it With SAN technology, each server can be allocated volumes as big as they need, and no disk space is isolated from being used 25.1.2 Managing Storage Management techniques for storage rely on a combination of process and technology The most successful solutions enlist the customer as an ally, instead of making the SA into the “storage police.” It is not uncommon for customers to come to an SA with an urgent request for more storage for a particular application There is no magic answer, but applying these principles can greatly reduce the number of so-called emergency storage requests that you receive It is always best to be proactive rather than reactive, and that is certainly the case for storage 25.1.2.1 Reframe Storage as a Community Resource Storage allocation becomes less political and customers become more selfmanaging when storage servers are allocated on a group-by-group basis It is particularly effective if the cost of the storage service comes from a given group’s own budget That way, the customers and their management chain feel that they have more control and responsibility 25.1 The Basics 589 Studies show that roughly 80 percent of the cost of storage is spent on overhead—primarily support and backups—rather than in the price of the hard drives It should be possible to work with management to also pass on at least some of the overhead costs to each group This approach is best for the company because the managers whose budgets are affected by increasing storage needs are also the ones who can ask their groups to clean up obsolete work and save space However, it is sometimes not possible to dedicate a storage server to a single group When a storage server must serve many groups, it is always best to start with a storage-needs assessment with your customer base; when the assessment is complete, you will know how much storage a group needs currently, whether the existing storage is meeting that need, and how much future capacity that group anticipates By combining data from various groups’ needs assessments, you will be able to build an overall picture of the storage needs of the organization In many cases, reshuffling some existing storage may suffice to meet the needs identified In other cases, an acquisition plan must be created to bring in additional storage Doing a storage assessment of departments and groups begins the process of creating a storage community As part of the assessment, groups will be considering their storage needs from a business and work-oriented perspective rather than simply saying, “The more, the better!” One great benefit of bringing this change in attitude to your customers is that the SAs are no longer “the bad guy” and instead become people who are helping the customers implement their desired goals Customers feel empowered to pursue their storage agendas within their groups, and a whole set of common customer complaints disappear from the support staff radar ❖ Raw versus Usable Disk Capacity When purchasing storage, it is important to remember that raw storage is significantly different from usable storage A site needed a networked storage array to hold terabytes of existing data, projected to grow to terabytes within years The customer told the vendor, who cheerfully sent an terabyte array system The customer began configuring the array, and a significant amount of space was consumed for file system overhead plus disks used for RAID redundancy, snapshots, and hot spares Soon, the customer discovered that current usage amounted to 100 percent of what was remaining The system would not support any growth 590 Chapter 25 Data Storage Luckily, the customer was able to work with the vendor to replace the disks with ones twice as large Because this was done before the customer took official delivery, the disks were not considered “used.” Although the site’s future capacity problem was now solved, the problems continued More disk required more CPU overhead The application was extremely sluggish for almost weeks while an emergency upgrade to the array processor controller was arranged The finance people were not very happy about this, having approved the original system, then the drive upgrades, and, finally needing to write a big check to upgrade the processor 25.1.2.2 Conduct a Storage-Needs Assessment You might think that the first step in storage-needs assessment is to find out who is using what storage systems That is actually the second step The first step is to talk to the departments and groups you support to find out their needs Starting with a discovery process based on current usage will often make people worry that you are going to be taking resources away from them or redistributing resources without their input By going directly to your customers and asking what does and doesn’t work about their current storage environment, you will be building a bridge and establishing trust If you can show graphs of their individual storage grown over the last year and use it to educate them rather than scold them, it can help understand their real needs You may be surprised at what you discover, both positive and negative Some groups may be misrepresenting their needs, for fear of being affected by scarcity Other groups may be struggling with too few resources but not complaining because they assumed that everyone is in the same situation What kinds of questions should you ask in the storage assessment? Ask about total disk usage, both current and projected, for the next to 18 months It’s a good idea to use familiar units, if applicable, rather than simply “months.” It may be easier for customers to specify growth in percentages rather than gigabytes Ask about the next to quarters, for instance, in a company environment, and the upcoming terms in an academic environment You should also inquire about what kinds of applications are being run and any problems that are being encountered in day-to-day usage You may feel that you are adequately monitoring a group’s usage and seeing a pattern emerge, such that you are comfortable predicting the group’s needs, but some aspects of your storage infrastructure may already be stressed in ways that ... be unable to meet the demands placed on it by some of the departments in the company Before abandoning the centralized service, try to understand the reason for the failures of the central group... questions, the SAs realize that they need to monitor the systems in question and gather utilization data over a period of time in order to see the trends and the peaks in usage There are many other... Ideally, the amount of condensing that the system does and the expiration time of the data should be configurable You also need to consider how the monitoring system gathers its data Typically, a system

Định dạng
Số trang	105
Dung lượng	7,1 MB