Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 105 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
105
Dung lượng
7,12 MB
Nội dung
25.1 The Basics 591 are not readily visible from simple metrics These can become evident when customers provide clues Can’t Get There from Here A midsize chip development firm working closely with a new partner ordered highend vendor equipment for a new cluster that would be compatible with the partner’s requirements Performance, affordability, reliability, and similar were analyzed and hotly debated After the new hardware showed up on the dock, however, it was discovered that one small detail had been overlooked The site ordering the new hardware was working with different data sets than the partner site, and the storage solution ordered was not scalable using the same hardware Over half of the more expensive components (chassis, controller) would have to be replaced in order to make a larger cluster Instead of serving the company’s storage needs for all its engineering group for a year, it would work for one department for about months It is also possible that upcoming events outside the current norm may affect storage needs For example, the department you support may be planning to host a visiting scholar next term, someone who might be bringing a large quantity of research data Or the engineering group could be working on adding another product to its release schedule or additional use cases to its automated testing: each of these, of course, requiring a significant increase in storage allocation Often, the systems staff are the last to know about these things, as your customers may not be thinking of their plans in terms of the IS requirements needed to implement them Thus, it is very useful to maintain good communication and explicitly ask about customers’ plans Work Together to Balance System Stress At one of Strata’s sites, the build engineers were becoming frustrated trying to track down issues in automated late-night builds Some builds would fail mysteriously on missing files, but the problem was not reproducible by hand When the engineers brought their problem to the systems staff, the SAs were able to check the server logs and load graphs for the affected hosts It turned out that a change in the build schedule, combined with new tests implemented in the build, had caused the build and the backups to overlap in time Even though they were running on different servers, this simultaneous load from the build and the nightly backups was causing the load on one file server to skyrocket to several times normal, resulting in some remote file requests timing out The missing files would cause sections of the build to fail, thus affecting the entire build at the end of its run when it tried to merge everything Since this build generally required 12–18 hours to run, the failures were seriously affecting engineering’s schedule Since backups are also critical, they couldn’t be shut 592 Chapter 25 Data Storage off during engineering’s crunch time A compromise was negotiated, involving changing the times at which both the builds and the backups were done, to minimize the chances of overlap This solved the immediate problem A storage reorganization, to solve the underlying problem, was begun so that the next production builds would not encounter similar problems 25.1.2.3 Map Groups onto Storage Infrastructure Having gleaned the necessary information about your customers’ current and projected storage needs, the next step is to map groups and subgroups onto the storage infrastructure At this point, you may have to decide whether to group customers with similar needs by their application usage or by their reporting structure and work group If at all possible, arrange customers by department or group rather than by usage Most storage-resource difficulties are political and/or financial Restricting customers of a particular server or storage volume to one work group provides a natural path of escalation entirely within that work group for any disagreements about resource usage Use group-write permissions to enforce the prohibition against nongroup members using that storage Some customers scattered across multiple departments or work groups may have similar but unusual requirements In that case, a shared storage solution matching those requirements may be necessary That storage server should be partitioned to isolate each work group on its own volume This removes at least one element of possible resource contention The need for the systems staff to become involved in mediating storage-space contention is also removed, as each group can self-manage its allocated volume If your environment supports quotas and your customers are not resistant to using them, individual quotas within a group can be set up on that group’s storage areas When trying to retrofit this type of storage arrangement on an existing set of storage systems, it may be helpful to temporarily impose group quotas while rearranging storage allocations Many people will resist the use of quotas, and with good reason Quotas can hamper productivity at critical times An engineer who is trying to build or test part of a new product but runs into the quota limit either has to spend time trying to find or free up enough space or has to get in touch with an SA and argue for a quota increase If the engineer is near a deadline, this time loss could result in the whole product schedule slipping If your customers are resistant to quotas, listen to their rationale, and see whether there is a common ground that you can both feel comfortable with, such as emergency increase requests with a guaranteed turnaround time Although you need to 25.1 The Basics 593 understand each individual’s needs, you also need to look at the big picture Implementing quotas on a server in a way that prevents another person from doing her job is not a good idea 25.1.2.4 Develop an Inventory and Spares Policy Most sites have some kind of inventory of common parts We discuss spares in general in Section 4.1.4, but storage deserves a bit of extra attention There used to be large differences between the types of drives used in storage systems and the ones in desktop systems This meant that it was much easier for SAs to dedicate a particular pool of spare drives to infrastructure use Now that many storage arrays and workgroup servers are built from off-the-shelf parts, those drives on the shelf might be spares that could be used in either a desktop workstation or a workgroup storage array A common spares pool is usually considered a good thing However, it may seem arbitrary to a customer who is denied a new disk but sees one sitting on a shelf unused, reserved for the next server failure How can SAs make sure to reserve enough drives as spares for vital shared storage while not hoarding drives that are also needed for new desktop systems or individual customer needs? It is something of a balancing act, and an important component is a policy that addresses how spares will be distributed Few SAs are able to stock as many spares as they would like to have around, so having a system for allocating them is crucial It’s best to separate general storage spares from infrastructure storage spares You can make projections for either type, based on failures observed in the past on similar equipment If you are tracking shared storage usage— and you should be, to avoid surprises—you can make some estimates on how often drives fail, so that you have adequate spares For storage growth, include not only the number of drives required to extend your existing storage but also whatever server upgrades, such as CPU and memory, might also be needed If you have planned to expand by acquiring whole new systems, such as stand-alone network storage arrays, be sure to include spares for those systems through the end of the fiscal year when they will be acquired 25.1.2.5 Plan for Future Storage The particularly tricky aspect of storage spares is that a customer asking for a drive almost every time needs something more than simply a drive A customer whose system disk has failed really needs a new drive, along with a standardized OS install A customer who is running out of shared disk space 594 Chapter 25 Data Storage and wants to install a private drive really needs more shared space or a drive, along with plus backup services And so on We don’t encourage SAs to have a prove-that-you-need-this mentality SAs strive to be enablers, not gatekeepers That said, you should be aware that every time a drive goes out the door of your storage closet, it is likely that something more is required Another way to think of it is that a problem you know how to solve is happening now, or a problem that you might have to diagnose later is being created Which one would you rather deal with? Fortunately, as we show in many places in this book, it’s possible to structure the environment so that such problems are more easily solved by default If your site chooses to back up individual desktops, some backup software lets you configure it to automatically detect a new, local partition and begin backing it up unless specifically prevented Make network boot disks available for customers, along with instructions on how to use them to load your site’s default supported installation onto the new drive This approach lets customers replace their own drives and still get a standardized OS image Have a planned quarterly maintenance window to give you the opportunity to upgrade shared storage to meet projected demands before customers start becoming impacted by lack of space Thinking about storage services can be a good way to become aware of the features of your environment and the places where you can improve service for your customers 25.1.2.6 Establish Storage Standards Standards help you to say no when someone shows up with random equipment and says, “Please install this for me.” If you set storage standards, people are less likely to be able to push a purchase order for nonstandard gear through accounting and then expect you to support whatever they got The wide range of maturity of various storage solutions means that finding one that works for you is a much better strategy than trying to support any and everything out there Having a standard in place helps to keep one-off equipment out of your shop A standard can be as simple as a note from a manager saying, “We buy only IBM” or as complex as a lengthy document detailing requirements that a vendor and that vendor’s solution must meet to be considered for purchase The goal of standards is to ensure consistency by specifying a process, a set of characteristics, or both Standardization has many benefits, ranging from keeping a common spares pool to minimizing the number of different systems that an SA must cope with during systems integration As you progress to having a storage 25.1 The Basics 595 plan that accounts for both current and future storage needs, it is important to address standardization Some organizations can be very difficult places to implement standards control, but it is always worth the attempt Since the life cycle of many systems is relatively short, a heterogeneous shop full of differing systems can become a unified environment in a relatively short period of time by setting a standard and bringing in only equipment that is consistent with the standard If your organization already has a standards process in place for some kinds of requests or purchases, start by learning that system and how to add standards to it There may be sets of procedures that must be followed, such as meetings with potential stakeholders, creation of written specifications, and so on If your organization does not have a standards process, you may be able to get the ball rolling for your department Often, you will find allies in the purchasing or finance departments, as standards tend to make their jobs easier Having a standard in place gives them something to refer to when unfamiliar items show up on purchase orders It also gives them a way to redirect people who start to argue with them about purchasing equipment, namely, to refer those people to either the standard itself or the people who created it Start by discussing, in the general case, the need for standards and a unified spares pool with your manager and/or the folks in finance Request that they route all purchase orders for new types of equipment through the IT department before placing orders with the vendor Be proactive in working with department stakeholders to establish hardware standards for storage and file servers Make yourself available to recommend systems and to work with your customers to identify potential candidates for standards This strategy can prevent the frustration of dealing with a one-off storage array that won’t interoperate with your storage network switch, or some new interface card that turns out to be unsupported under the version of Linux that your developers are using The worst way to deal with attempts to bring in unsupported systems is to ignore customers and become a bottleneck for requests Your customers will become frustrated and feel the need to route around you to try to address their storage needs directly Upgrading to a larger server often results in old disks or storage subsystems that are no longer used If they are old enough to be discarded, we highly recommend fully erasing them Often, we hear stories of used disks purchased on eBay and then found to be full of credit card numbers or proprietary company information 596 Chapter 25 Data Storage Financial decision makers usually prefer to see the equipment reused internally Here are some suggested uses • Use the equipment as spares for the new storage array or for building new servers • Configure the old disks as local scratch disks for write-intensive applications, such as software compilation • Increase reliability of key servers by installing a duplicate OS to reboot from if the system drive fails • Convert some portion to swap space, if your OS uses swap space • Create a build-it-yourself RAID for nonessential applications or temporary data storage • Create a global temp space, accessible to everyone, called /home/ not backed up People will find many productivity-enhancing uses for such a service The name is important: People need a constant reminder if they are using disk space that has no reliability guarantee 25.1.3 Storage as a Service Rather than considering storage an object, think of it as one of the many services Then, you can apply all the standard service basics To consider something a service, it needs to have an SLA and to be monitored to see that the availability adheres to that SLA 25.1.3.1 A Storage SLA What should go into a storage SLA? An engineering group might need certain amounts of storage to ensure that automated release builds have enough space to run daily A finance division might have minimal day-to-day storage needs but require a certain amount of storage quarterly for generating reports A QA group or a group administering timed exams to students might express its needs in response time as well as raw disk space SLAs are typically expressed in terms of availability and response time Availability for storage could be thought of as both reachability and usable space Response time is usually measured as latency—the time it takes to complete a response—at a given load An SLA should also specify MTTR expectations Use standard benchmarking tools to measure these metrics This has the advantage of repeatability as you change platforms The system should still be tested in your own environment with your own applications to make sure 25.1 The Basics 597 that the system will behave as advertised, but at least you can insist on a particular minimum benchmark result to consider the system for an in-house evaluation that will involve more work and commitment on the part of you and the vendor 25.1.3.2 Reliability Everything fails eventually You can’t prevent a hard drive from failing You can give it perfect, vendor-recommended cooling and power, and it will still fail eventually You can’t stop an HBA from failing Now and then, a bit being transmitted down a cable gets hit by a gamma ray and is reversed If you have eight hard drives, the likelihood that one will fail tomorrow is eight times more likely than if you had only one The more hardware you have, the more likely a failure Sounds depressing, but there is good news There are techniques to manage failures to bring about any reliability level required The key is to decouple a component failure from an outage If you have one hard drive, its failure results in an outage: a 1:1 ratio of failures to outages However, if you have eight hard drives in a RAID configuration, a single failure does not result in an outage Two failures, one happening faster than a hot spare can be activated, is required to cause an outage We have successfully decoupled component failure from service outages (Similar strategy can be applied to networks, computing, and other aspects of system administration.) The configuration of a storage service can increase its reliability In particular, certain RAID levels increase reliability, and NASs can also be configured to increase overall reliability The benefit of centralized storage (NAS or SAN) is that the extra cost of reliability is amortized over all users of the service • RAID and reliability: All RAID levels except for RAID increase reliability The data on a redundant RAID set continues to be available even when a disk fails In combination with an available hot spare, a redundant RAID configuration can greatly improve reliability It is important to monitor the RAID system for disk failures, however, and to keep in stock some replacement disks that can be quickly swapped in to replace the failed disk Every experienced SA can tell a horror story of a RAID system that was unmonitored and had a failed disk go unreplaced for days Finally, a second disk dies, and all data on the system is lost Many RAID systems can be configured to shut down after 24 hours of running in degraded mode It can be safer to have a system halt safely than to go unmonitored for days 598 Chapter 25 Data Storage • NAS and reliability: NAS servers generally support some form of RAID to protect data, but NAS reliability also depends on network reliability Most NAS systems have multiple network interfaces For even better reliability, connect each interface to a different network switch • Choose how much reliability to afford: When asked, most customers ask for 100 percent reliability Realistically, however, few managers want to spend what it takes to get the kind of reliability that their employees say they would like Additional reliability is exponentially more expensive A little extra reliability costs a bit, and perfect reliability is more than most people can imagine The result is sticker shock when researching various storage uptime requirements Providers of large-scale reliability solutions stress the uptime and ease of recovery when using their systems and encourage you to calculate the cost of every minute of downtime that their systems could potentially prevent Although their points are generally correct, these savings must be weighed against the level of duplicated resources and their attendant cost That single important disk or partition will have a solution requiring multiple sets of disks In an industry application involving live service databases, such as financial, health, or e-commerce, one typically finds at least two mirrors: one local to the data center and another at a remote data center Continuous data protection (CDP), discussed later, is the most expensive way to protect data and is therefore used only in extreme situations High-availability data service is expensive It is the SA’s job to make management aware of the costs associated with storage uptime requirements, work it into return on investment (ROI) calculations, and leave the business decision to management Requirements may be altered or refocused in order to get the best-possible trade-off between expense and reliability 25.1.3.3 Backups One of the most fundamental components of a storage service is the backup strategy Chapter 26 is dedicated to backups; here, we simply point out some important issues related to RAID, NAS, and SAN systems • RAID is not a backup strategy: RAID can be used to improve reliability, it is important to realize that RAID is not a substitute for a backup strategy For most RAID configurations, if two disks fail, all the data is lost Fires, earthquakes, floods, and other disasters will result in all 25.1 The Basics 599 data being lost A brownout can damage multiple disks or even the RAID controller Buggy vendor implementations and hardware problems could also result in complete data loss Your customers can, and will, delete critical files When they do, their mistake will be copied to the mirror or parity disk Some RAID systems include the abilty to have file snapshots, that is, the ability to view the filesystem as it was days ago This is also not a backup solution It is simply an improvement to the customer-support process of customers needing to request individual file restores when they accidentally delete a file If those snapshots are stored on the same RAID system as the rest of the data, a fire or double-disk failure will wipe out all data Backups to some other medium, be it tape or even another disk, are still required when you have a RAID system, even if it provides snapshot capabilities A snapshot will not help recover a RAID set after a fire in your data center It is a very common mistake to believe that acquiring a RAID system means that you no longer have to follow basic principles for data protection Don’t let it happen to you! Whither Backups? Once, Strata sourced a RAID system for a client without explicitly checking how backups would be done She was shocked and dismayed to find that the vendor claimed that backups were unnecessary! The vendor did plan—eventually—to support a tape device for the system, but that would not be for at least a year Adding a high-speed interface card to the box—to keep backups off the main computing network—was an acceptable workaround for the client When purchasing a storage system, ask about backup and restore options • RAID mirrors as backups: Rather than using a mirror to protect data all the time, some systems break, or disconnect, the mirrored disks so they have a static, unchanging copy of the data to perform backups on This is done in coordination with database systems and the OS to make sure that the data mirror is in a consistent state from an application point of view Once the backups are complete, the mirror set is reattached and rebuilt to provide protection until the next backup process begins The benefit is that backups not slow down normal data use, since they affect only disks that are otherwise unused The downside is that the 600 Chapter 25 Data Storage data is not protected during the backup operation, and the production system runs much slower when the mirror is being rebuilt Many SAs use such mirroring capabilities to make an occasional backup of an important disk, such as a server boot disk, in case of drive failure, OS corruption, security compromise, or other issues Since any error or compromise would be faithfully mirrored onto the other disk, the system is not run in true RAID mirror mode The mirror is established and then broken so that updates will not occur to it After configuration changes, such as OS patches, are made and tested, the mirror can be refreshed and then broken again to preserve the new copy This is better than restoring from a tape, because it is faster It is also more accurate, since some tape backup systems are unable to properly restore boot blocks and other metadata • RAID mirrors to speed backups: A RAID set with two mirrors can be used to make backups faster Initially, the system has identical data on three sets of disks, known as a triple-mirror configuration When it is time to backups, one mirror set is broken off, again in coordination with database systems and the OS to make sure that the data mirror is in a consistent state Now the backup can be done on the mirror that has been separated Done this way, backups will not slow down the system When the backup is complete, the mirror is reattached, the rebuild happens, and the system is soon back to its normal state The rebuild does not affect performance of the production system as much, because the read requests can be distributed between the two primary mirrors • NAS and backups: In a NAS configuration, it is typical that no unique data is stored on client machines; if data is stored there, it is well advertised that it is not backed up This introduces simplicity and clarity into that site, especially in the area of backups It is clear where all the shared customer data is located, and as such, the backup process is simpler In addition, by placing shared customer data onto NAS servers, the load for backing up this data is shared primarily by the NAS server itself and the server responsible for backups and is thus isolated from application servers and departmental servers In this configuration, clients become interchangable If someone’s desktop PC dies, the person should be able to use any other PC instead • SANs and backups: As mentioned previously, SANs make backups easier in two ways First, a tape drive can be a SAN-attached device Thus, all 28.1 The Basics 681 ACL-restricted directory that only the SAs can access The directory structure inside this area may include standard, preinstalled, disk images, and experimental directories One way of handling licensing issues is to require that software without site licenses be installed by an SA or someone trusted to follow all the license procedures For example, the person might verify that the software license was acquired for this particular host before it is installed If software is prepurchased in bulk, as described in Section 21.2.1, the SA might simply record this installation as consuming one of the prepurchased licenses Each directory may contain subdirectories for various flavors of Windows where version-specific packages are stored There may be a directory named Obsolete where old packages that are no longer supported are moved Although such old packages shouldn’t be needed any more, emergencies occasionally arise in which having the old packages around is beneficial It’s a balance of risk management versus disk space It is useful to create a directory for each package rather than having a single directory full of packages Package names are not always understandable to the casual observer, whereas directory names can be very specific: FooSoft Accounting Client 4.0 is more clear than FSAC40.ZIP Most packages include a separate README file, which should be in that directory If packages are all stored in one directory, there is a good chance that the names of the READMEs will conflict For every directory, it is useful to create a document by the same name— with the addition of txt, of course—that includes notes about the software, what the license restrictions are, and whether there are any special installation tricks or tips You should adopt a standard format for the first part of the file, with free-form text following If the software depot is extremely successful, you might choose to replicate it in various parts of the company This can save network bandwidth, improve installation speed, and offer more reliability The point at which such replication becomes useful is much different from that with the UNIX depot Windows software depots are relatively light on networks compared with the network activity required for a UNIX depot that accesses the network for every execution of a program A Windows software depot is also less real time than UNIX depots, because installation happens once, and the server is not accessed again An occasional slow install is painful but not a showstopper However, slow NFS access to a UNIX depot can destroy productivity 682 Chapter 28 Software Depot Service For these reasons, you may choose not to replicate a Windows software depot but instead simply locate it somewhere with excellent network connectivity In this section, we have described a simple yet powerful software depot for a Windows environment It takes into account the unique culture of Windows systems with respect to software installation It requires no software, unless replication is required, at which time one of many fine directory replication systems can be used It leverages the access controls of Windows’ CIFS file access protocol to restrict access as needed It is self-documenting, because the directory hierarchy describes what software is available, and local installation notes and policy documents can be colocated with the packages for easy access 28.2 The Icing Although the purpose of a software depot is to provide the same software to all hosts, the icing is to be able to provide customizations for various hosts Here, we include suggestions for handling a few frequently requested customizations: slightly different configurations, locally replicated packages, commercially licensed software, and smaller depots for OSs that not receive full support 28.2.1 Different Configurations for Different Hosts A commonly requested feature of UNIX software depots is the ability to have slightly different configurations for certain hosts or clusters of hosts If a package’s configuration must vary wildly from host to host, it may be useful to have the configuration file in the depot simply be a symbolic link to a local file For example, /sw/megasoft/lib/megasoft.conf might be a symbolic link to /etc/megasoft.conf, which could contain specific contents for that particular host If you might want to choose from a couple of different standard configurations, they could be included in the packages For example, /etc/megasoft conf might itself be a symbolic link to one of many configuration files in /sw/megasoft/lib You might have standard server and client configurations (megasoft.conf-server and megasoft.conf-client) or configurations for particular customer groups (megasoft.conf-mktg, megasoft.conf-eng, megasoft.conf-dev, and megasoft.conf-default) Because the series of symbolic links bounce from the depot to the local disk and back to the depot, they are often referred to as bounce links 28.2 The Icing 683 28.2.2 Local Replication If your UNIX depot is accessed over the network, it can be good to have commonly used packages replicated on the local disk For example, a developer’s workstation that has disk capacity to spare could store the most recent edition of the development tools locally Local replication reduces network utilization and improves performance You should make sure that access to a local disk is faster than access to a network file server, which is not always the case The problem becomes managing which machines have which packages stored locally, so that updates can be managed Depot management software should make all this easy It should provide statistics to help select which packages should be cached or at least permit SAs and customers to indicate which packages should be replicated locally In our UNIX example, if a sophisticated automounter is used, it can specify that for a particular machine, a package can be found on a local disk New releases of packages require special handling when local replication is being performed manually In some systems, a new, uncached release overrides the local copy If the intention was that customers always see the most recent release of a package, the right thing would happen, though performance would suffer if nobody remembered to copy the new release to this machine’s local disk It’s better to have the right thing happen slowly than the wrong thing happen with excellent performance On the other hand, if someone directly changed the master copy of a package without changing the release number, the SA must remember to also update any copies of the package that may be on other clients This can become a management nightmare if tracking all these local copies is done manually Replicating software depot files locally may have impact on backup policies Either the backup system will need to be tuned to avoid backing up the local repositories, which, after all, can be restored via the depot management software, or you have to plan on handling additional data storage requirements The benefit of backing up the locally replicated software is that full restore reflects the true last state of the machine, software and all The general solution to this is to use an NFS cache, whereby files accessed via NFS are cached to the local disk NFS caches, such as Solaris’s cachefs, work best on read-only data, such as a software depot Such a system has a huge potential for performance improvement Most important, it is adaptive, automatically caching what is used rather than requiring SAs to manually try to determine what should be cached and when It is a set-it-and-forget-it system 684 Chapter 28 Software Depot Service 28.2.3 Commercial Software in the Depot Including commercial software in the depot is as difficult as the software’s license is complex If there is a site license, the software packages can be included in the depot like anything else If the software automatically contacts a particular license server, which in turn makes a licensed/not licensed decision, the software can be made available to everyone because it will be useless to anyone not authorized If, however, software may be accessed only by particular customers, and the software doesn’t include a mechanism for verifying that the customer is authorized, it is the SA’s responsibility to make sure that the license is enforced In our Windows depot example, we discussed gating licensed software installation by requiring SAs to perform the installation A UNIX environment has additional options If the software is licensed for all users of a particular host, the software can be installed on that host only You should install it in the usual depot nomenclature, if possible (In other words, depot management software shouldn’t panic when it detects locally installed software in its namespace.) Alternatively, a UNIX group can be created for people who are authorized to use the software, and key files in the package can be made executable only by members of that group UNIX environments often face a situation in which various small groups all use the same software package, but a different license server must be accessed for each group.3 This is another problem that can be solved using the bounce-link technique to point different clients to different license files Having complex requirements for software distribution is a common problem (Hemmerich 2000) It is an area that people are continually trying to improve Before trying to solve the problem yourself, look at the past work in conference proceedings and journals for inspiration Someone may have solved your exact problem already 28.2.4 Second-Class Citizens Software depots also need to handle off-beat OSs that may exist on the network These hosts, often referred to as second-class-citizens, are OSs that We recommend centralizing license servers, but we include this example because we often find that for political reasons, they are not centralized 28.2 The Icing 685 not receive the full support prescribed for first-class citizens (see Section 3.1) but exist on your network and require minimal support The support that second-class citizens receive might simply be the allocation of an IP address and a few other simple configuration parameters to get the device up and running The software depot required for second-class-citizen OSs tends to be minimal: applications required for the special purpose of that machine, possibly compilers, and the tools required by the SAs We explicitly recommend against trying to provide every package available in the full depots It would be a huge effort for little gain It is important to have a written policy on second-class-citizen OSs Indicate the level of support to be expected and areas in which customers are expected to provide their own support The policy for a small software depot should specify a minimal set of packages customers should expect to find in the depot We recommend a few groups of tools that should be included in a small depot Install the tools needed for the purpose of the machine For example, if this machine is for porting software to that OS, install the appropriate compiler tool chain If the host is for running a particular application or service, install the required software for that The tools required for SA processes— automated and otherwise—should also be installed These include inventory collectors, log rotators, backups, software depot updates, debugging tools, and so on Finally, every company has a short list of convenience tools that can be provided for your own benefit, including the command line version of the corporate phone number lookup software, minimal email (“send only”) configurations, and so on ❖ A Tool Chain A developer’s tool chain is the specific software required to build software In some environments, this might involve software from various vendors or sources This usually includes build tools, such as make and autoconf, compilers, assemblers, linkers, interpreters, debuggers, source code control systems, such as SubVersion, CVS, Perforce, ClearCase, and SourceSafe, and various homegrown utilities that are involved in the development process The term chain refers to the fact that one tool often leads to the next, such as the compile/ assemble/link sequence 686 Chapter 28 Software Depot Service 28.3 Conclusion In this chapter, we have discussed software depots Software depots are an organized way of providing software packages to many hosts, though having a good organization is useful even for a single host A good software depot provides a ubiquitous set of tools that become as much a part of the culture of your customers as the network itself Windows systems usually handle software depots differently for historical, technical, and cultural reasons Windows depots tend to be repositories of software to be installed; UNIX depots, repositories of software used in real time from the depot Sites should have a written policy regarding various depot issues: how and by whom packages get installed, what systems receive the services of the depot, how requests and support issues are handled, and so on We described simple depots for both UNIX and Windows environments, showing that policy and organization are key and that an extremely powerful depot can be created with very little software There are many open source packages for maintaining depots, and thus we recommend against creating one from scratch Find one that you like, and modify it to suit your needs Although the purpose of a software depot is to provide the same software to many hosts, you eventually will receive requests to have customizations for particular hosts or groups of hosts We discussed the most common kinds of requests and some simple solutions Exercises Describe the software depot used in your environment, its benefits, and its deficiencies If you not have a depot, describe where software is installed and the pros and cons of adopting one What are the policies for your software depot? If you not have any policies or not have a depot, develop a set of policies for a depot at your site Justify the policy decisions made in your depot policy In the UNIX example (Section 28.1.6), what happens to perl modules and other add-on packages? How would you handle this situation? Compare your software depot to one of the samples described in this chapter How is it better or worse? A good software depot management system might also track usage How might usage statistics be used to better manage the depot? Exercises 687 If you are at a small site that does not require a complicated software depot, describe what kind of simple software depot you have At what point will you have grown to require a more complicated depot? What might that look like? How will you convert to this new system? Develop a set of codes for each of the OSs and the various versions of each OS in use at your location akin to what was used in Section 28.1.6 Explain and justify your decisions This page intentionally left blank Chapter 29 Web Services Managing web systems has become such a large and important part of system administration that specific techniques and specialities have developed outside of simply loading Apache or IIS and providing content In theory, we have already covered all the material required to run a web service or web-based application successfully Simply buy server hardware, install the right services, put it on a fast network, with carefully considered namespaces in a data center, document everything, create a solid disaster-recovery plan, be mindful of the security issues, debug problems as they happen, adhere to strict changemanagement controls, upgrade it as required using a good maintenance window strategy, monitor it for problems and capacity planning, provide plenty of data storage, and make sure that you have a good data backup and restore system It’s all right there in Chapters 4–10, 11, 15, 17, 18, 20, 22, 25 and 26 However, some specific techniques and issues are still underserved by those chapters A web site is a way of presenting information and applications to users using a client/server model Web content is usually accessed by a client program called a browser but can be accessed by any program that speaks HTTP The web server delivers documents called pages, formatted in HTML, as well as any content specified in the HTML page, such as embedded graphics, audio files, and so on Sometimes, the content includes programs written in languages, such as JavaScript, that will be run by the client browser The web is based on open standards, which means that they are developed by an international committee, not a single corporation They can be used without paying royalty or licensing fees Web standards are defined by the World Wide Web Consortium (W3C), and the underlying Internet protocols are defined by the IETF 689 690 Chapter 29 Web Services The benefit of web-based applications is that one browser can access many web applications The web browser is a universal client Web applications and small web servers are also present in firmware in many devices, such as small routers and switches, smart drive arrays, and network components ❖ The Power of Open Standards Imagine if instead of using open standards, one company set the standards and charged royalties to use them on a web site or in a browser The web would never have been as successful In fact, one attempt at just such a thing was attempted shortly before the web was becoming popular: Network Notes was a collaboration of AT&T, Lotus, and Novell One could use only the Novell/ Lotus software and only the AT&T network and could communicate only with other people using Network Notes People weren’t willing to pay money for the software, because there were no sites to visit; nobody was willing to build a site, because nobody had the software to access it The vendors weren’t willing to give the software away for free, because they had to recoup their development costs Even if the service had become popular, innovation would have been stifled in other ways Without the ability for vendors to make their own client software, nobody would be able to experiment with building browsers for other devices Cellphone-based browsing might never have happened This is relevant because even with the success of the Internet, companies continually bullheadedly repeat these mistakes There are continual attempts to extend web browsers in ways that lock customers in; rather, they are locking potential customers out 29.1 The Basics In this section, we explain the building blocks that make up web service, the typical architectures used, and the measures needed to provide service that is secure, scalable, monitored, and easy to manage 29.1.1 Web Service Building Blocks A uniform resource locator (URL) is the address of information on the web A URL consists of a host name, and the path to the resource, for example, http://www.EverythingSysadmin.com/ 29.1 The Basics 691 A web server receives HTTP requests and replies with data Some typical web server software packages are Apache HTTP, AOLServer, and Microsoft IIS The reply is usually a static file However, the reply is sometimes generated on demand, using the Common Gateway Interface (CGI) The generated, or dynamic, content is often the result of a database query There are two variations of CGI scripts: GET and POST GET takes inputs—values assigned to variable names—and uses them to read a result For example, http://google.com/ finance?q=aapl is a GET request POST takes inputs and uses them to some mutation, such as update a shopping cart, post to a blog, or delete a file POST does not put the input in the URL but inside the HTTP request itself Therefore, the input can be much longer, as it does not have to abide by the length limits of URLs The important point here is that GET is read-only and that POST is an update When it crawls the web to build its search corpus, a search engine does not follow POSTs There are famous cases in which programmers accidentally made an application in which a DELETE button was a GET instead of a POST; every time a web crawler found that site, all the data was deleted Other technologies for dynamically generated web pages include in-line interpreted code using such systems as PHP (www.php.net) and Microsoft’s Active Server Pages An otherwise normal HTML web page is stored on the server, but embedded in it are special instructions that the web server interprets prior to serving the page An instruction might be to perform a database query and turn the results into a table What the user sees is a web page with a dynamically generated table in the middle The web client receives the page, interprets it, and displays it for the user The web browser is a universal client Previously, every server application required the deployment of specific client software, which stalled progress What makes the web revolutionary is that new services can be created without requiring new client software The browser interprets and displays the page that was served HTML is interpreted and displayed Sometimes, an embedded language is included and is interpreted on the client, such as ECMAScript, commonly known as JavaScript Other file formats require interpretation, such a photo formats, video, audio, and so on AJAX is not a protocol but a technique for creating interactive web pages They are often as interactive as traditional PC applications The server is contacted only at appropriate times for updates and major state transitions, which reduces server load and improves responsiveness The term refers 692 Chapter 29 Web Services to its two building blocks: asynchronous JavaScript and XML The user interface is implemented mostly in JavaScript, which is able to asynchronously contact the server for additional information as required rather than wait for a user to click a SUBMIT button, as in HTTP GET and POST operations Web clients exist outside of computers and are now in mobile phones, televisions, and kiosks Even home appliances configure themselves by using HTTP to draw a configuration file from a central server, or using HTTP POST requests to connect to the manufacturer to request service In 2007, nearly billion people in the world access the Internet exclusively from mobile phones Millions have never seen the Internet used from a computer Many data formats are used on the web A web server generally outputs web pages, multimedia files, or errors Web pages are generally in HTML or an HTML derivative, such as XHTML, DHTML, XML, and so on Some of these formats are old and discouraged; others are new and evolving Multimedia files include pictures, audio, and video New multimedia formats arise all the time Usually, data cannot be used until the last byte is received, though web browsers an excellent job of displaying partial results as data is received to make the viewing experience feel faster However, some multimedia uses streaming formats, which are intended to be displayed in real time and provide pause, rewind, and fast-forward features Streaming is particularly important for live audio and video It wouldn’t make sense to download a day’s worth of live radio content and then listen to it after it has completely downloaded With an audio streaming format, live radio can be listened to live Many special-purpose formats exist and are often built on top of other formats XML is an excellent format for making other formats, or microformats One popular microformat is an RSS feed, a format that lists a table of contents of a resource, such as a blog or news site Wiki sites (see Chapter 9) often present a list of the most recently modified pages as an RSS feed Special RSS-reading software scans many RSS feeds and displays new content to the user HTTP requests that fail have a standard set of error and status codes Their meaning is usually not important to users but is very important to SAs The codes are three-digit numbers The first digit is 1, for informational; indicates success, indicates redirection, and indicates an error Following are the most common codes: 29.1 The Basics 693 • 200 (OK); request completed • 301 (Moved permanently); Implication: We wish you’d remember the new URL and go there directly next time • 302 (Redirect to specified URL) • 307 (Redirect to specified URL as a temporary measure) • 401 (Please try again with authentication) • 403 (Unauthorized, wrong password, or other issue) • 404 (No such page found) SAs often need to know these error codes to debug problems or to provide better service For example, when debugging authentication problems, it is important to know that a page that requires authentication is generally first requested without authentication and the browser receives code 401 The page is then rerequested with authentication information attached: for example, username and password Error 404, page not found, is important because although one still receives a page, it is an error message, not to be confused with a true web page This error page can be customized 29.1.2 The Webmaster Role The webmaster is the person who manages the content of a web site, much as an editor manages the content of a newspaper She is responsbile for setting web site policy This role is often confused with the web system administrator, who sets up the server, software, and so on A webmaster is involved in content; a web system administrator maintains the technical machinery— physical and virtual—that makes up the web site This confusion is understandable because at small sites, the same person may both jobs, but even large sites confuse the two roles when management is not technical Reenforce the difference by clearly explaining it to managers in terms that they understand, putting the two roles in different departments or two different budgets For SAs who find themselves forced into webmaster duties, we make the following recommendation Focus on enabling people to their own updates Create the structure, but use software that makes the web site self-service If you are required to update web pages, get agreement on a policy so that such requests are not a surprise There are plenty of stories about SAs being ready to leave for the weekend when surprised with a request to make 694 Chapter 29 Web Services a series of updates that will take hours Set up an SLA specifying that changes must be requested a certain number of hours or days in advance or outlining a schedule, with major updates on Monday, minor updates done with hours, and a process for handling emergency requests, for example If at all possible, be involved in the processes that lead to changes to the web site so that there are fewer surprises More examples are listed in Section 29.1.8.2 29.1.3 Service-Level Agreements Like any other service, a web service needs an SLA and monitoring to ensure compliance Many customers are used to thinking of the web as a 24/7 critical service, but the SLA of an individual web service might be quite different Most internal web services will have the same SLA as other office services, such as printing or storage If in doubt about the appropriate SLA for a web service, ask the customer group using it Ideally, as with any SLA, the service level should be set by collaborating with the customer community We suggest that you resist setting any SLA that does not allow for periodic maintenance, unless the service is built out on redundant infrastructure If the service is provided by a single host or a shared web host and is required to be available around the clock, it is time to have a discussion about increasing the redundancy of the service Metrics that are part of a web SLA should include the latency for a certain level of queries per second (QPS) That is, how long should a typical query take when the system is under a particular load? Latency is usually measured as the time between receipt of the first byte of the request and sending of the last byte of the answer 29.1.4 Web Service Architectures Different types of content require different serving infrastructures A single web server serving an unchanging document has different needs from a page serving dynamic pages Web servers that will be accessible from the Internet rather than only within an organization require particular attention to security 29.1.4.1 Static Web Server A static web server is one that serves only documents that don’t change or change rarely The documents are static in that they are read directly 29.1 The Basics 695 from disk and are not altered in the process of being delivered by the web server The documents themselves are normal documents of any type, such as web pages, images, word-processing documents, images, spreadsheets, and so on The documents are served out of the document root If it is a shared volume, subdirectories with appropriate permissions can be created for various customer groups Each group is then able to publish information simply by creating and updating files Web pages can be edited with normal office applications or with special-purpose web editing packages available for most desktop systems At high QPS rates, it is important that such a server have enough RAM to cache the most commonly requested files 29.1.4.2 CGI Servers CGI servers generate pages dynamically as described before Because a large amount of work is done for each page, these servers often cannot serve as many pages per second as a static server can At high QPS rates, such a server must have enough CPU resources to keep up with demand The software may also have other dependencies, such as RAM and network 29.1.4.3 Database-Driven Web Sites One of the most common types of web application is accessing and modifying information in a database Examples of database-driven web applications are updating your preferences on a social-networking site, online shopping catalogs, or choosing which classes you want to take next semester Web applications are replacing paper forms and allowing people to interact directly with a database Database-driven web sites create a template for a particular type of data rather than individual web pages For example, a web-based bookstore does not need to create a web page for every book it sells Instead, its catalog is stored in a database, and any particular item can be displayed by filling in the template with information from the database The information may come from multiple sources, with pricing and availability coming from separate databases, for example To make a global format change, one simply updates the template Imagine if a menu had to be changed and required every page on the site to be manually edited The site would be unmanagable Yet we are continually surprised to find web sites that started out manually updating every page ... The speed of a backup is limited by the slowest of the following factors: read performance of the disk, write performance of the backup medium, bandwidth, and latency of the network between the. .. printed pretty labels for the tape covers However, the system failed to reduce the thinking aspect of the job The software did a grand job of issuing the right backup commands at the right time, used... Understanding the basic relationship of storage devices to the operating system and to the file system gives you a richer understanding of the way that large storage solutions are built up out of