The practice of system and network administration (second edition) part 1

The Practice of System and Network Administration Second Edition This page intentionally left blank The Practice of System and Network Administration Second Edition Thomas A Limoncelli Christina J Hogan Strata R Chalup Upper Saddle River, NJ • Boston • Indianapolis • San Francisco New York • Toronto • Montreal • London • Munich • Paris • Madrid Capetown • Sydney • Tokyo • Singapore • Mexico City Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein The publisher offers excellent discounts on this book when ordered in quantity for bulk purchases or special sales, which may include electronic versions and/or custom covers and content particular to your business, training goals, marketing focus, and branding interests For more information, please contact: U.S Corporate and Government Sales, (800) 382-3419, corpsales@pearsontechgroup.com For sales outside the United States please contact: International Sales, international@pearsoned.com Visit us on the Web: www.awprofessional.com Library of Congress Cataloging-in-Publication Data Limoncelli, Tom The practice of system and network administration / Thomas A Limoncelli, Christina J Hogan, Strata R Chalup.—2nd ed p cm Includes bibliographical references and index ISBN-13: 978-0-321-49266-1 (pbk : alk paper) Computer networks—Management Computer systems I Hogan, Christine II Chalup, Strata R III Title TK5105.5.L53 2007 004.6068–dc22 2007014507 Copyright c 2007 Christine Hogan, Thomas A Limoncelli, Virtual.NET Inc., and Lumeta Corporation All rights reserved Printed in the United States of America This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise For information regarding permissions, write to: Pearson Education, Inc Rights and Contracts Department 75 Arlington Street, Suite 300 Boston, MA 02116 Fax: (617) 848-7047 ISBN 13: 978-0-321-49266-1 ISBN 10: 0-321-49266-8 Text printed in the United States on recycled paper at RR Donnelley in Crawfordsville, Indiana First printing, June 2007 Contents at a Glance Part I Getting Started What to Do When Climb Out of the Hole Chapter Chapter Part II Foundation Elements Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Part III Workstations Servers Services Data Centers Networks Namespaces Documentation Disaster Recovery and Data Integrity Security Policy Ethics Helpdesks Customer Care Change Processes Chapter 15 Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20 Chapter 21 Debugging Fixing Things Once Change Management Server Upgrades Service Conversions Maintenance Windows Centralization and Decentralization 27 39 41 69 95 129 187 223 241 261 271 323 343 363 389 391 405 415 435 457 473 501 v vi Contents at a Glance Part IV Providing Services Chapter 22 Chapter 23 Chapter 24 Chapter 25 Chapter 26 Chapter 27 Chapter 28 Chapter 29 Part V Service Monitoring Email Service Print Service Data Storage Backup and Restore Remote Access Service Software Depot Service Web Services Management Practices Chapter 30 Chapter 31 Chapter 32 Chapter 33 Chapter 34 Chapter 35 Chapter 36 Epilogue Organizational Structures Perception and Visibility Being Happy A Guide for Technical Managers A Guide for Nontechnical Managers Hiring System Administrators Firing System Administrators 521 523 543 565 583 619 653 667 689 725 727 751 777 819 853 871 899 909 Appendixes 911 Appendix A The Many Roles of a System Administrator Appendix B Acronyms Bibliography Index 913 939 945 955 Contents Preface Acknowledgments About the Authors Part I xxv xxxv xxxvii Getting Started 1 What to Do When 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 Building a Site from Scratch Growing a Small Site Going Global Replacing Services Moving a Data Center Moving to/Opening a New Building Handling a High Rate of Office Moves Assessing a Site (Due Diligence) Dealing with Mergers and Acquisitions Coping with Machine Crashes Surviving a Major Outage or Work Stoppage What Tools Should Every Team Member Have? Ensuring the Return of Tools Why Document Systems and Procedures? Why Document Policies? Identifying the Fundamental Problems in the Environment Getting More Money for Projects Getting Projects Done Keeping Customers Happy 4 5 10 11 12 12 13 13 14 14 15 vii viii Contents 1.20 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28 1.29 1.30 1.31 1.32 1.33 1.34 1.35 1.36 1.37 1.38 1.39 1.40 1.41 1.42 1.43 1.44 1.45 1.46 1.47 1.48 Keeping Management Happy Keeping SAs Happy Keeping Systems from Being Too Slow Coping with a Big Influx of Computers Coping with a Big Influx of New Users Coping with a Big Influx of New SAs Handling a High SA Team Attrition Rate Handling a High User-Base Attrition Rate Being New to a Group Being the New Manager of a Group Looking for a New Job Hiring Many New SAs Quickly Increasing Total System Reliability Decreasing Costs Adding Features Stopping the Hurt When Doing “This” Building Customer Confidence Building the Team’s Self-Confidence Improving the Team’s Follow-Through Handling Ethics Issues My Dishwasher Leaves Spots on My Glasses Protecting Your Job Getting More Training Setting Your Priorities Getting All the Work Done Avoiding Stress What Should SAs Expect from Their Managers? What Should SA Managers Expect from Their SAs? What Should SA Managers Provide to Their Boss? Climb Out of the Hole 2.1 2.2 Tips for Improving System Administration 15 16 16 16 17 17 18 18 18 19 19 20 20 21 21 22 22 22 22 23 23 23 24 24 25 25 26 26 26 27 28 2.1.1 Use a Trouble-Ticket System 28 2.1.2 Manage Quick Requests Right 29 2.1.3 Adopt Three Time-Saving Policies 30 2.1.4 Start Every New Host in a Known State 32 2.1.5 Follow Our Other Tips Conclusion 33 36 Contents Part II Foundation Elements Workstations 3.1 3.2 3.3 4.2 4.3 41 The Basics 44 Loading the OS 46 3.1.2 Updating the System Software and Applications 54 3.1.3 Network Configuration 57 3.1.4 Avoid Using Dynamic DNS with DHCP The Icing 61 65 3.2.1 High Confidence in Completion 65 3.2.2 Involve Customers in the Standardization Process 66 3.2.3 A Variety of Standard Configurations Conclusion 66 67 69 The Basics 69 4.1.1 Buy Server Hardware for Servers 69 4.1.2 Choose Vendors Known for Reliable Products 72 4.1.3 Understand the Cost of Server Hardware 72 4.1.4 Consider Maintenance Contracts and Spare Parts 74 4.1.5 Maintaining Data Integrity 78 4.1.6 Put Servers in the Data Center 78 4.1.7 Client Server OS Configuration 79 4.1.8 Provide Remote Console Access 80 4.1.9 Mirror Boot Disks 83 The Icing 84 4.2.1 Enhancing Reliability and Service Ability 84 4.2.2 An Alternative: Many Inexpensive Servers 89 Conclusion Services 5.1 39 3.1.1 Servers 4.1 ix 92 95 The Basics 96 5.1.1 Customer Requirements 5.1.2 Operational Requirements 98 5.1.3 Open Architecture 104 5.1.4 Simplicity 107 5.1.5 Vendor Relations 108 100 442 Chapter 18 Server Upgrades test might be useful, go for it Sometimes, the human eyeball catches things that the best automation can’t ❖ Test-Driven Development (TDD) TDD is a relatively new trend in the industry Previously, developers wrote code and then wrote tests to verify the code (Well, not really; rarely did anyone have time to write the tests.) TDD is the reverse The tests are written first and then the code This ensures that the tests get written for all new code Since the tests are executed in an automated fashion, you build up a body of tests that stay with the project As code evolves, there is less risk that a change will break functionality without being noticed Developers are free to rewrite, or refactor, big or small parts of the code, knowing that if they break something, it will be noticed right away As a result, software has fewer bugs Tests are better than comments in code (documentation) because comments often become out of date without anyone noticing Tests that cover all the edge cases are more thorough than documentation could ever be Tests not get out of date, because they can be triggered as part of the build process to alert developers of bugs when they are introduced into the code We would like to see the field of system administration learn from TDD and adopt these practices Keeping Tests Around for Later A large business their wanted to test 400 UNIX servers just after midnight of Y2K to ensure that the core functionality of the operating system and associated infrastructure were working correctly A series of noninvasive tests was created, each with a PASS/FAIL response: Is the box up, can we log in, can it see the NIS servers, is the time correct, can it resolve DNS, can it mount from the NFS servers and read a file, is the automounter working, and so on Using a central administration point, the tests could be fired off on multiple boxes at a time and the results collected centrally All 400 boxes were tested within 20 minutes, and the team was able to report their PASS to the Y2K tracking-management team well in advance of other, smaller, units So popular did the tests become with the SA team that they became part of the daily monitoring of the environment The tests found other uses An obscure bug in the Solaris 2.5.1 and 2.6 automounters could be triggered after a major network outage but only on a few random machines By running this test suite, the affected machines were quickly identified after any outage 18.1 The Basics 443 18.1.4 Step 4: Write a Back-Out Plan If something goes wrong during the upgrade, how will you revert to the former state? How will you “undo”? How long will that take? Obviously, if something small goes wrong, the usual debugging process will try to fix it However, you can use up the entire maintenance window—time allocated for the outage—trying just one more thing to make an upgrade work It is therefore important to have a particular time at which the back-out plan will be activated Take the agreed on end time and subtract the back-out time, as well as the time it would take to test that the back-out is complete When you reach that time, you must either declare success or begin your back-out plan It is useful to have the clock watcher be someone outside the group directly performing the upgrade, such as a manager The back-out plan might also be triggered by one or more key tests failing, or by unexpected behavior related to the upgrade Small-to medium-size systems can be backed up completely before an upgrade begins It can be even easier to clone the disks and perform the upgrade on the clones If there are serious problems, the original disks can be reinstalled Larger systems are more difficult to replicate Replicating the system disks and doing incremental backups of the data disks may be sufficient in this case ❖ Upgrade the Clone Q: If you are going to clone a hard disk before the server is upgraded, should you perform the upgrade on the clone or on the original? A: Upgrade the clone If the upgrade fails you don’t want to discover that the clone wasn’t properly made You’ve just destroyed the original We’ve seen this happen many times Cloning a disk is easy to get wrong Sometimes, the data was copied but the boot block has a problem and the disk was not bootable; sometimes, the data wasn’t copied completely; sometimes, the data wasn’t copied at all To avoid this situation, boot up on the clone Make sure that the clone works Then perform the upgrade on the clone 18.1.5 Step 5: Select a Maintenance Window The next step is a test of your technical and nontechnical skills You must come to agreement with your customers on a maintenance window, that is, 444 Chapter 18 Server Upgrades when the upgrade will happen To that, you must know how long the process will take and have a plan if the upgrade fails That is more of a technical issue • When? Your SLA should include provisions for when maintenance can be done Customers usually have a good idea of when they can withstand an outage Most business systems are not needed at night or on the weekend However, SAs might not want to work those hours, and the vendor support might not be available at certain times A balance must be found Sites that are required to be up 24/7 have a maintenance plan engineered into the entire operation, perhaps including fall-back systems • How long? The length of the maintenance window equals the time the upgrade should take, plus the time testing should take, plus the time it will take to fix problems, plus the time it takes to execute the back-out plan, plus the time it takes to ensure that the back-out worked Initially, it is best to double or triple your estimates to adjust for hubris As time goes on, your estimates will become more accurate Whatever length of time you have calculated, announce the window to be much longer Sometimes, you may get started late Sometimes, things take longer than you expect for technical reasons (hardware, software, or unrelated or unexpected events) or nontechnical reasons (weather or car problems) The flip side to calling a longer time window is that if you complete the upgrade and testing early, you should always notify the customers • What time is the back-out plan initiated? It is a good idea to clearly document the exact time that the back-out plan will be initiated for reasons described in step ❖ Scotty Always Exaggerated In the Star Trek: The Next Generation episode “Relics,” James Doohan made a cameo appearance as Scotty from the original series Among Scotty’s interesting revelations was that he always exaggerated when giving estimates to Captain James T Kirk Thus, he always looked like a miracle worker when problems were solved more quickly than expected Now we know why the warp drive was always working sooner than predicted and the environmental systems lasted longer than were indicated Follow Scotty’s advice! Exaggerate your estimates! But also follow Scotty’s practice of letting people know as soon as the work is tested and complete 18.1 The Basics 445 Case Study: The Monday Night Carte Blanche When Tom worked at a division of Mentor Graphics, the SA staff had the luxury of a weekly maintenance window Monday night was SA Carte Blanche Night Users were expected to be logged out at PM, and the SA staff could use that evening to perform any kind of major upgrades that would require bringing down services Every Monday by PM, the customers were informed of what changes would be happening and when the systems should be usable again Customers eventually developed a habit of planning non-work-related activities on Monday nights Rumor has it that some spent the time with their family Although it required a big political investment to get the practice approved through management, it was an important factor in creating high reliability in the division’s network There was rarely a reason to put off timely system upgrades Problems during the week could be taken care of with quick fixes, but long-term fixes were done efficiently on Monday night Unlike some environments in which the long-term fixes were never implemented, those were always put in relatively soon When there wasn’t much to be done, one supervisor believed it was important to reboot some critical servers at PM to ‘‘encourage’’ users to go home for the night He believed that this helped the users maintain their habit of not planning anything critical for Monday night Of course, the SAs were flexible When the customers were up against a critical deadline and would be working around the clock, the SAs would cancel the Monday night maintenance or collaborate with the customers to determine which outages could happen without interfering with their work 18.1.6 Step 6: Announce the Upgrade as Appropriate Now announce the upgrade to the customers Use the same format for all announcements so that customers get used to them Depending on the culture of your environment, the message may best be distributed by email, voicemail, desk-to-desk paper memo, newsgroup posting, web page, note on door, or smoke signals No matter what format, the message should be brief and to the point Many people read only the Subject line, so make it a good one, as shown in Figure 18.1 It is better to have a blank template that is filled out each time than to edit previous announcements to include new information This prevents the form from mutating over time It also prevents the common problem of forgetting to change some parts For example, when creating Figure 18.1, we initially used a real announcement that referred to a router reboot We changed it to be about servers instead but forgot to change the Subject: line The example went four rounds of proofreading before anyone noticed this This wouldn’t have happened if we had started with a blank template instead 446 Chapter 18 Server Upgrades To: all-users Subject: SERVER REBOOT: PM TODAY From: System Administration Group Reply-To: tom@example.com Date: Thu, 16 Jun 2001 10:32:13 -0500 WHO IS AFFECTED: All hosts on DEVELOPER-NET, TOWNVILLE-NET, and BROCCOLI-NET WHAT WILL HAPPEN: All servers will be rebooted WHEN? Today between 6-8 PM (should take hour) WHY? We are in the process of rolling out new kernel tuning parameters to all servers This requires a reboot The risk is minimal For more information please visit: http://portal.example.com/sa/news0005 I OBJECT! Send mail to "help" and we will try to reschedule Please name the server you want us to keep up today Figure 18.1 Sample upgrade message 18.1.7 Step 7: Execute the Tests Right before the upgrade begins, perform the tests This last-minute check ensures you that you won’t be chasing problems after the upgrade that existed before the upgrade Imagine the horror of executing the back-out plan, only to discover that the failing test is still failing 18.1.8 Step 8: Lock out Customers It is generally better to let customers log out gracefully than to let them be kicked out by a reboot or disconnection of service Different services have different ways to this Use the facilities available in the OS to prevent new 18.1 The Basics 447 logins from occurring during the maintenance window Many customers use an attempt to log in or to access a resource as their own test of an upgrade If the attempt succeeds, the customer believes that the system is available for normal use, even if no announcement has been made Thus it is important to lock out customers during a maintenance window 18.1.9 Step 9: Do the Upgrade with Someone Watching This is where most SA books begin Aren’t you glad you bought this book instead? Now, the moment you’ve all been waiting for: Perform the upgrade as your local procedures dictate Insert the DVD, reboot, whatever System upgrades are too critical to alone First of all, we all make mistakes, and a second set of eyes is always useful Upgrades aren’t done every day, so everyone is always a little out of practice Second, a unique kind of mentoring goes on when two people a system upgrade together System upgrades often involve extremes of our technical knowledge We use commands, knowledge, and possibly parts of our brains that aren’t used at other times You can learn a lot by watching and understanding the techniques that someone else uses at these times The increasingly popular practice of codevelopment or so-called peer programming has developers working in pairs and taking turns being the one typing This is another development practice that SAs can benefit from using If the upgrade isn’t going well, it is rarely too early to escalate to a colleague or senior member of your team A second set of eyes often does wonders, and no one should feel ashamed about asking for help 18.1.10 Step 10: Test Your Work Now repeat all the tests developed earlier Follow the usual debugging process if they fail The tests can be repeated time and time again as the problem is debugged It is natural to run a failing test over again each time a fix is attempted However, since many server processes are interrelated, be sure to run the full suite before declaring the upgrade a success The fix for the test that failed may have broken a previously successful test! Customers should be involved here As with the helpdesk model in Chapter 14, the job isn’t done until customers have verified that everything is complete This may mean having the customer called at a prearranged time, or the customer may agree to report back the next day, after the maintenance window has elapsed In that case, getting the automated tests right is even more critical 448 Chapter 18 Server Upgrades 18.1.11 Step 11: If All Else Fails, Rely on the Back-Out Plan If the clock watcher announces that it is time to begin the back-out plan, you have to begin the back-out plan This may happen if the upgrade is taking longer than expected or if it is complete but the tests continue to fail The decision is driven entirely by the clock—it is not about you or the team Its can be disappointing and frustrating to back out of a complex upgrade but maintaining the integrity of the server is the priority Reverting the system back to its previous state should not be the only component of the back-out plan Customers might agree that if only certain tests fail, they may be able to survive without that service for a day or two while it is repaired Decide in advance the action plan for each potential failure After the back-out plan is executed, the services should be tested again At this point it is important to record in your checklist the results of your changes This is useful in reporting status back to management, record keeping for improving the process next time, or recalling what happened during a postmortem Record specifics such as “implemented according to plan,” “implemented but exceeded change window,” “partial implemention; more work to be done,” “failed; change backed out,” “failed; service unusable; end of world predicted.” If possible, capture the output of the test suite and archive it along with the status information This will help immensely in trying to remember what happened next week, month, or year 18.1.12 Step 12: Restore Access to Customers Now it is safe to let customers start using the system again Different services have different ways to permit this However, it is often difficult to testing without letting all users in There are some ways to this, however For example, when upgrading an email server, you can configure other email servers to not relay email to the server being upgraded While those servers are holding email, you can manually test the upgraded server and then enable the surrounding servers one at a time, keeping a mindful eye on the newly upgraded server 18.1.13 Step 13: Communicate Completion/Back-Out At this point, the customers are notified that the upgrade is complete or, if the back-out plan was initiated, what was accomplished, what didn’t get 18.2 The Icing 449 accomplished, and the fact that the systems are usable again This has three goals First, it tells people that the services they have been denied access to are now usable Second, it reminds the customers what has changed Finally, if they find problems that were not discovered during your own testing, it lets them know how to report problems they have found If the back-out plan was initiated, customers should be informed that the system should be operating as it had before the upgrade attempt Just as there are many ways to announce the maintenance window, there are many ways to communicate the completion There is a catch-22 here Customers cannot read an email announcement if the email service is affected by the outage However, if you keep to your maintenance window, then email, for example, will be working and customers can read the email announcement If customers hear nothing, they will assume that at the end of the announced maintenance window, everything is complete Announcements should be short Simply list which systems or service, are functioning again, and provide a URL that people can refer to for more information and a phone number to call if a failed return to service might prevent people from being able to send email One or two sentences should be fine The easiest way to keep the message short is to forward the original email that said that services were going down, and add a sentence to the top, saying that services are re-enabled and how to report problems This gives people the context for what is being announced in a very efficient way Big Red Signs Customers tend to ignore messages from SAs Josh Simon reports that at one client site, he tried leaving notes—black text on bright red paper taped to the monitors—saying “DO NOT LOG IN—CONTACT YOUR SYSTEM ADMINISTRATOR AT [phone number] FIRST!” in huge type More than 75 percent of the customers ripped the paper off and proceeded to log in rather than call the phone number The lesson to be learned here is that it is often better to actually disable a service than to ask customers not to use it 18.2 The Icing Once you have mastered the basics of upgrading a server, what can you to expand on the process? 450 Chapter 18 Server Upgrades 18.2.1 Add and Remove Services at the Same Time During an upgrade, you must sometimes add or remove services simultaneously This complicates matters because more than one change is being made at a time Debugging a system with two changes is much more difficult because it affects the tests that are being executed Adding services has all the same problems as bringing up a new service on a new host, but you are now in a new and possibly unstable environment and cannot prepare by creating appropriate tests However, if the new service is also available on a different host, tests can be developed and run against that host Removing a service can be both easy and difficult at the same time It can be easy for the same reason that it is easier to tear down a building than to build one However, you must make sure that all the residents are out of the building first Sometimes, we set up a network sniffer to watch for packets, which indicates that someone is trying to receive that service from the host That information can be useful to find stragglers We prefer to disable a service in a way that makes it easy to reenable quickly if forgotten dependencies are discovered later For example, the service can be halted without removing the software It is usually safe to assume that if no forgotten dependencies are discovered in the next month or year, it is safe to remove the software Some services may be used only once a quarter or once a year, especially certain financial reports Don’t forget to come back to clean up! Create a ticket in your helpdesk system, send yourself email, or create an at job that emails you a reminder sometime in the future If multiple SA groups or privileged customers have access to the box, it can be a good idea to add a comment to the configuration file, or rename it to include ‘OFF’ or ‘DISABLED’ Otherwise, another SA might assume the service is supposed to be up and turn it back on 18.2.2 Fresh Installs Sometimes, it is much better to a fresh install than an upgrade Doing upgrade after upgrade can lead to a system with a lot of damage It can result in files left over from old patches, fragmented file systems, and a lot of “history” from years of entropy Earlier, we mentioned the luxury of cloning the appropriate disks and doing the upgrade on the clone An even more luxurious method is to perform the upgrade as a fresh install on a different system because it doesn’t require an outage of the old system You can the fresh install on a temporary machine at a leisurely pace, make sure that all services are working, and then 18.2 The Icing 451 move the disks into the upgrade machine and adjust network configuration settings as appropriate Note that the machine on which the rebuild takes place must be almost identical to the machine that is to be upgraded, to ensure that the new OS disks have all the appropriate hardware support and configurations 18.2.3 Reuse of Tests If the tests are properly scripted, they can be integrated into a real-time monitoring system In fact, if your monitoring system is already doing all the right tests, you shouldn’t need anything else during your upgrade (See Chapter 22 for more discussion about service monitoring.) It is rare that all tests can be automated and added to the monitoring system For example, load testing—determining how the system performs under simulated amounts of work—often cannot be done on a live system However, being able to run these tests during otherwise low-usage hours or on demand when debugging a problem can make it easy to track down problems 18.2.4 Logging System Changes Building the service checklist is much easier if you’ve kept a log of what’s been added to the machine as it was added For example, on a UNIX system, simply keep a record of changes in a file called /var/adm/CHANGES The easier it is to edit the file, the more likely people are to update it, so consider creating a shell alias or short script that simply brings up that file in a text editor Of course, if the machine goes down, the change log may be inaccessible Keeping the change log on a wiki or shared file server solves that problem, but may lead to confusion if someone tries to start a new change log on the host Set a policy on where the change log will be kept and follow it 18.2.5 A Dress Rehearsal Take a lesson from the theater world: Practice makes perfect Why not perform a dress rehearsal on a different machine before you perform the upgrade? Doing so might reveal unexpected roadblocks, as well as give you an indication of how long the process will take A dress rehearsal requires a lot of resources However, if you are about to perform the first upgrade of many, this can be a valuable tool to estimate the time the upgrades will require An absolutely complete dress rehearsal results in a new machine that can 452 Chapter 18 Server Upgrades simply replace the old machine If you have those resources, why not just that? The theater also has what’s referred to as the tech rehearsal, a rehearsal for the lighting and sound people more than for the actors The actors run through their lines with the right blocking as the lighting and sound directions are put through their paces The SA equivalent is to have all the involved parties walk through the tasks We also borrow from theater the fine art of pantomime Sometimes, a major system change involves a lot of physical cables to be changed Why not walk though all the steps, looking for such problem areas as cable lengths, crossover/straight-through mismatches, male/female connector mismatches, incorrect connectors, and conflicting plans? Pantomime the change exactly how it will be done It can be helpful to have someone else with you and explain the tasks as you act them out Verify to the other person that each connector is correct, and so on It may seem silly and embarrassing at first, but the problems you prevent will be worth it 18.2.6 Installation of Old and New Versions on the Same Machine Sometimes, one is simply upgrading a single service on a machine, not the entire OS In that situation, it is helpful if the vendor permits the old versions of the software to remain on the machine in a dormant state while the new software is installed and certified The web server Apache on UNIX is one such product We usually install it in /opt/apache-x.y.z, where x.y.z is the version number, but place a symbolic link from /opt/apache to the release we want to be using All configurations and scripts refer to /opt/apache exclusively When the new version is loaded, the /opt/apache link is changed to point to the new version If we find problems with the new release, we revert the symbolic link and restart the daemon It is a very simple back-out plan (Using symbolic links in a software depot is discussed in Section 28.1.6) In some situations, the old and new software can run simultaneously If a lot of debugging is required, we can run the new version of Apache on a different port while retaining the old version 18.2.7 Minimal Changes from the Base Upgrades become easier when there is little work to With a little planning, all add-on packages for UNIX can be loaded in a separate partition, thus 18.2 The Icing 453 leaving the system partitions as generic as possible Such additions to the system can be documented in a CHANGELOG file Most changes will be in /etc, which is small enough to be copied before any upgrades begin and used as a reference That is preferable to the laborious process of restoring files from tape In a dataless UNIX environment—all machines have an OS local but otherwise get all data from a server—usually only /var needs to be preserved between upgrades and then only the crontabs and at jobs, the mail spool, and, for systems such as Solaris, the calendar manager files A version control system, such as RCS, is good for tracking changes to configuration files Case Study: Upgrading a Critical DNS Server This case study combines many of the techniques discussed in this chapter During the rush to fix Y2K bugs before January 1, 2000, Tom found a critical DNS server that was running on non-Y2K-compliant hardware and that the vendor had announced would not be fixed Also, the OS was not Y2K compliant This was an excellent opportunity to perform a fresh load of the OS on entirely new hardware Tom developed a service checklist Although he thought that the host provided only two services, using netstat -a and listing all the running processes he found many other services running on the machine He discovered that some of those extra services were no longer in use and found one service that nobody could identify! People knew that most of the software packages involved would work on the new OS because they were in use on other machines with the newer OS However, many of the services were homegrown, and there was panic when it was thought that the author of a homegrown package was no longer at the company and the source code couldn’t be found immediately Luckily, the code was found Tom built the new machine and replicated all the services onto it The original host had many configuration files that were edited on a regular basis He needed to copy these data files to the new system to verify that the scripts that processed them worked properly on the new machine However, because the upgrade was going to take a couple of weeks, those files would be modified many times before the new host would be ready The tests were be done on aging data When the new system was cut in, Tom stopped all changes on the old host, recopied the files to the new system, and verified that the new system accepted the new files The tests that were developed were not run only once before the cutover but were run over and over as various services on the new system became usable However, Tom did leave most services disabled when they weren’t being tested because of concern that the old and new machines might conflict with each other The cut-over worked as follows: The old machine was disconnected from the network but left running The new machine’s IP address was changed to that of the 454 Chapter 18 Server Upgrades old one After five minutes, the ARP caches on the local network timed out, and the new host was recognized If problems appeared, he could unplug the new machine from the network and reconnect the network cable of the legacy machine The legacy machine was left running so that not even a reboot would be required to bring it back into service: Just halt the new server and plug in the old server’s network cable The actual maintenance window could have been quite short -a minimum of five minutes if everything went right and the machine could be reconnected instantly However, a 30-minute window was announced Tom decided to have two people looking over his shoulder during the upgrade because he wasn’t as familiar with this version of UNIX as he is with others and didn’t get much sleep the night before It turned out that having an extra pair of hands helped with unplugging and plugging wires The group pantomimed the upgrade hours before the maintenance window Without changing anything, they walked through exactly what was planned They made sure that every cable would be long enough and that all the connectors were the right type This process cleared up any confusion that anyone on the team might have had The upgrade went well Some tests failed, but the group was soon able to fix the problems One unexpected problem resulted in certain database updates not happening until a script could be fixed The customers who depended on that data being updated were willing to live with slightly stale data until the script could be rewritten the next day 18.3 Conclusion We have described a fairly complete process for upgrading the OS of a computer, yet we have not mentioned a particular vendor’s OS, particular commands to type, or buttons to click The important parts of the process are not the technology, which is a matter of reading manuals, but rather communication, attention to detail, and testing The basic tool we used is a checklist We began by developing the checklist, which we then used to determine which services required upgrading, how long the upgrade would take, and when we could it The checklist drives what tests we develop, and those tests are used over and over again We use the tests before and after the upgrade to ensure quality If the upgrade fails, we activate the back-out plans included in the checklist When the process is complete, we announce this to the list of concerned customers on the checklist A checklist is a simple tool It is a single place where all the information is maintained Whether you use paper, a spreadsheet, or a web page, Exercises 455 the checklist is the focal point It keeps the team on the same page, figuratively speaking, keeps the individuals focused, lets the customers understand the process, helps management understand the status, and brings new team members up to speed quickly Like many SA processes, this requires communication skills Negotiation is a communication process, and we use it to determine when the upgrade will happen, what needs to happen, and what the priorities are if things go wrong We give the customers a feeling of closure by communicating to them when we are finished This helps the customer/SA relationship We cannot stress enough the importance of putting the checklist on a web page The more eyes that can review the information, the better When the tests are automated, we can repeat them with accuracy and ensure completeness These tests should be general enough that they can be reused not only for future upgrades on the same host but also on other similar hosts In fact, the tests should be integrated into your real-time monitoring system Why perform these tests only after upgrades? This simple process can be easily understood and practiced This is one of the basic processes that an SA must master before moving on to more complicated upgrades The real-world examples we used all required some kind of deviation from the basic process yet still encompassed the essential points Some OS distributions make upgrading almost risk-free and painless, and some are much more risky Although there are no guarantees, it is much better when an operating system has a way to upgrades reliably, repeatably, and with the ability to easily revert The minimum number of commands or mouse clicks reduces the possibility of human error Being able to upgrade many machines in a repeatable way has many benefits; especially important is that it helps maintain consistent systems Any ability to revert to a previous state gives a level of undo that is like an insurance policy: You hope you never need it but are glad it exists when you Exercises Select a server in your environment and figure out what services it provides If you maintain a documented list of services, what system commands would you use to cross-check the list? If you not have the services documented, what are all the resources you might use to build a complete list? 456 Chapter 18 Server Upgrades In your environment, how you know who depends on which services? Select a location that should be easy to walk to from your machine room or office, such as a nearby store, bank, or someplace at the other end of your building, if it is very large Have three or four fellow students, coworkers, or friends estimate how long it will take to walk there and back Now, all of you should walk there and back as a group, recording how long it takes (Do this right now, before you read the rest of the question Really!) How long did it take? Did you start walking right away, or were you delayed? How many unexpected events along the way—runing into customers, people who wanted to know what you were doing, and so on—extended your trip’s time? Calculate how close each of you was to being accurate, the average of these, and the standard deviation What did you learn from this exercise? If you repeat it, how much better you think your estimate will be if you select the same location? A different location? Would bringing more people have affected the time? Relate what you learned to the process of planning a maintenance window In Section 18.1.3, the claim is made that the tests that are developed will be executed at least three times; more if there are problems What are the three minimum times? What are some additional times the tests may be run? Section 18.2.7 includes a case study in which the source code to a homegrown service almost couldn’t be found What would you in that situation if the source code couldn’t be found? How you announce planned outages and maintenance windows in your environment? What are the benefits and problems with this method? What percentage of your customers ignore these announcements? Customers often ignore announcements from SAs What can be done to improve this situation? Select a host in your environment and upgrade it (Ask permission first!) What steps would you take if you had to replace the only restroom in your building? ... 4 5 10 11 12 12 13 13 14 14 15 vii viii Contents 1. 20 1. 21 1.22 1. 23 1. 24 1. 25 1. 26 1. 27 1. 28 1. 29 1. 30 1. 31 1.32 1. 33 1. 34 1. 35 1. 36 1. 37 1. 38 1. 39 1. 40 1. 41 1.42 1. 43 1. 44 1. 45 1. 46 1. 47 1. 48... Acknowledgments About the Authors Part I xxv xxxv xxxvii Getting Started 1 What to Do When 1. 1 1. 2 1. 3 1. 4 1. 5 1. 6 1. 7 1. 8 1. 9 1. 10 1. 11 1 .12 1. 13 1. 14 1. 15 1. 16 1. 17 1. 18 1. 19 Building a Site from... 5.3 5 .1. 6 Machine Independence 10 9 5 .1. 7 Environment 11 0 5 .1. 8 Restricted Access 11 1 5 .1. 9 Reliability 11 2 5 .1. 10 Single or Multiple Servers 11 5 5 .1. 11 Centralization and Standards 11 6 5 .1. 12 Performance

Định dạng
Số trang	497
Dung lượng	2,71 MB