Tài liệu Service Level and Performance Monitoring pptx

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang	26
Dung lượng	166,99 KB

Nội dung

Service Level and Performance Monitoring W indows 2000 Server is being widely considered as an alternative to mainframe-type systems for high-end computing requirements. This will place tremendous burden and responsibility on Windows 2000 administrators to ensure maximum availability of systems. This chapter thus discusses service level and provides an introduction to Windows 2000 Server performance monitoring. What Is Service Level? If there is anything you have learned in this book, it is this: Windows 2000 is a major-league operating system. In our opin- ion, it is the most powerful operating system in existence . . . for the majority of needs of all enterprises. Only time and service packs will tell if Windows 2000 can go up against the big irons such as AS/400, Solaris, S/390, and the like. Microsoft has aimed Windows 2000 Server squarely at all levels of business and industry and at all business sizes. You will no doubt feel the rush of diatribe in the industry: 99.9 this, 10,000 concurrent hits that, clustering and load balancing, and more. But every system, server or OS, has its meltdown point, weak links, single point of failure (SPOF), “tensile strength,” and so on. Knowing, or at least predicting, the meltdown “event horizon” is more important than availability claims. Trust us, poor management will turn any system or service into a service level nightmare. 20 20 CHAPTER ✦✦✦✦ In This Chapter Service Level Management Windows 2000 Service Level Tools Task Manager The Performance Console Performance Monitoring Guidelines ✦✦✦✦ 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 719 720 Part V ✦ Availability Management One of the first things you need to ignore in the press from the get-go is the crazy comparisons of Windows 2000 to $75 operating systems, and so on. If your business is worth your life to you and your staff, you need to invest in performance and monitoring tools, disaster recovery, Quality of Service tools, service level tools, and more. Take a survey of what these tools can cost you. Windows 2000 Server out of the box has more built in to it than anything else, as this chapter will illustrate. On our calculators, Windows 2000 Server is the cheapest system going on performance-monitoring tools alone. Windows 2000 is no doubt going to be adopted by many organizations; it will cer- tainly replace Windows NT over the next few years and will probably become the leading server operating system on the Internet. With application service providing (ASP), thin-client, Quality of Service, e-commerce, distributed networking architecture (DNA), and the like becoming implementations everywhere as opposed to being new buzzwords, you, the server or network administrator, are going to find yourself dealing with a new animal in your server room. This animal is known as the service level agreement (SLA). Before we discuss the SLA further, we should first define service level and, second, how Windows 2000 addresses it. Service Level (SL) is simply the ability of IT management or MIS to maintain a con- sistent, maximum level of system uptime and availability. Many companies may understand SL as quality assurance and quality control (QA/QC). Examples will better explain it, as follows. Service Level: Example 1 Management comes to MIS with a business plan for application services providing (ASP). If certain customers can lease applications online, over reliable Internet con- nections, for x rate per month, they will forgo expensive in-house IT budgets and outsource instead. An ASP can, therefore, make its highly advanced network operations center and a farm of servers available to these businesses. If enough customers lease applications, the ASP will make a profit. The business plan flies if ASP servers and applications are available to customers all the time from at least 7 a.m. to 9 p.m. The business plan will only tolerate a .09 percent downtime during the day. Any more and customers will lose respect for the business and rather bring resources back in house. This means that IT or MIS must support the business plan by ensuring that systems are never offline for more than .09 percent of the business day. Response, as opposed to availability, is also a critical factor. And Quality of Service, or QoS, addresses this in SL. This will be discussed shortly in this chapter. Note 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 720 721 Chapter 20 ✦ Service Level and Performance Monitoring Service Level: Example 2 Management asks MIS to take its order-placing system, typically fax-based and pro- cessed by representatives in the field, to the extranet. Current practice involves a representative going to a customer, taking an order for stock, and then faxing the order to the company’s fax system, where the orders are manually entered into the system. The new system proposes that customers be equipped with an inexpensive terminal or terminal software and place the orders directly against their accounts on a Web server. MIS has to ensure that the Web servers and the backend systems, SQL Server 2000, Windows 2000 Server, the WAN, and so on, are available all the time. If customers find the systems offline, they will swamp the phones and fax machines, or simply place their orders with the competition. The system must also be reliable, informa- tive, and responsive to the customers’ needs. The Service Level Agreement The first example may require a formal service level agreement. In other words, the SLA will be a written contract signed between the client and the provider. The customer demands that the ASP provide written—signed—guarantees that the systems will be available 99.9 percent of the time. The customer demands such an SLA, because it cannot afford to be in the middle of an order-processing application, or sales letter, and then have the ASP suddenly disappear. The customer may be able to tolerate a certain level of unavailability, but if SL drops beyond what’s tolerable, the customer needs a way to obtain redress from the ASP. This redress could be the ability to cancel the contract, or the ability to hold the ASP accountable with penalties, such as fines, discount on service costs, waiver of monthly fees, and so on. Whatever the terms of the SLA, if the ASP cannot meet it, then MIS gets the blame. In the second example, there is unlikely to be a formal SLA between a customer and the supplier. Service level agreements will be in the form of memos between MIS and other areas of management. MIS will agree to provide a certain level of availability to the business model or plan. These SLAs are put in writing and usually favored by the MIS, who will take the SLA to budget and request money for systems and software to meet the SLA. However, the SLA can work to the disadvantage of MIS, too. If SL is not met, the MIS staff or CTO may get fired, demoted, or reassigned. The CEO may also decide to outsource or force MIS to bring in expensive consultants (which may help or hurt MIS). 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 721 722 Part V ✦ Availability Management In IT shops that now support SL for mission-critical applications, there are no mar- gins for tolerating error. Engineers who cannot help MIS meet SL do not survive long. Education and experience are likely to be high on the list of employment requirements. Service Level Management Understanding Service Level Management (SLM) is an essential requirement for MIS in almost all companies today. This section examines critical SLM factors that have to be addressed. Problem Detection This factor requires IT to be constantly monitoring systems for advanced warnings of system failure. You use whatever tools you can obtain to monitor systems and focus on all the possible points of failure. For example, you will need to monitor storage, networks, memory, processors, power, and so on. Problem detection is a lot like earthquake detection. You spend all of your time lis- tening to the earth, and the quake comes when you least expect it and where you least expect it. Then, 100 percent of your effort is spent on disaster recovery (DR). Your DR systems then need to kick in to recover. According to research from the likes of Forrester Research, close to 40 percent of IT management resources are spent on problem detection. Performance Management Performance Management accounts for about 20 percent of MIS resources. This factor is closely related to problem detection. You can hope that poor performance in areas such as networking, access times, transfer rates, restore or recover performance, and so on, will point to problems that can be fixed before they turn into disasters. However, most of the time a failure is usually caused by failures in another part of the system. For example, if you get a flood of continuous writes to a hard disk that does not let up until the hard disk crashes, is the hard disk at fault or should you be looking for better firewall software? The right answer is a combination of both factors. The fault is caused by the poor quality of firewall software that gives passage to a denial-of-service attack. But in the event this happens again, we need hard disks that can stand the attack a lot longer. 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 722 723 Chapter 20 ✦ Service Level and Performance Monitoring Availability Availability, for the most part, is a post-operative factor. In other words, availability management covers redundancy, mirrored or duplexed systems, fail-overs, and so on. Note that fail-over is emphasized because the term itself denotes taking over from a system that has failed. Clustering of systems or load balancing, on the other hand, is also as much disaster prevention as it is a performance-level maintenance practice. Using performance management, you would take systems to a performance point that is nearing threshold or maximum level, then you switch additional requests for service to other resources. A fail-over, on the other hand, is a machine or process that picks up the users and processes that were on a system that has just failed, and it is sup- posed to allow the workload to continue uninterrupted on the fail-over systems. A good example of fail-over is a mirrored disk, or a RAID-5 storage set: The failure of one disk does not interrupt the processing, which carries on oblivious to the failure on the remaining disks, giving management time to replace the defective components. There are several other SL-related areas that IT spends time on and which impact SLM. These include change management and control, software distribution, and systems management. See Chapter 11 for an extensive discussion of Change Management. SLM by Design SLM combines tools and metrics or analysis to meet the objectives of SL and Service Level Agreements. The SLM model is a three-legged stool, as illustrated in Figure 20-1. The availability leg supports the model by guaranteeing availability of critical systems. The administration leg ensures 24×7 operations and administrative house- keeping. The performance leg supports the model by assuring that systems are able to service the business and keep systems operating at threshold points considered safely below bottleneck and failure levels. If one of the legs fails or becomes weak, the stool may falter or collapse, which puts the business at risk. When managing for availability, the enterprise will ensure it has the resources to recover from disasters as soon as possible. This usually means hiring gurus or experts to be available on-site to fix problems as quickly as possible. Often, management will pay a guru who does nothing for 95 percent of his or her time, which seems a waste. But if they can fix a problem in record time, they will have earned their keep several times over. Note 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 723 724 Part V ✦ Availability Management Figure 20-1: The SLM model is a three-legged stool. Often, a guru will restore a system that, had it stayed offline a few days longer, would have cost the company much more than the salary of the guru. However, it goes without saying that the enterprise will save a lot of money and effort if it can obtain gurus who are also qualified to monitor for performance and problems, and who do not just excel at recovery. This should be worth 50 percent more salary to the guru. Administration is the effort of technicians to keep systems backed up, keep power supplies on line, monitor servers for error messages, ensure server rooms remain at safe temperatures and air circulation, and so on. The administrative leg manages the SL budget, hires and fires, maintains and reports on service level achievement, and reports to management or the CEO. The performance leg is usually carried out by analysts who know what to look for in a system. These analysts get paid the big bucks to help management decide how to support business initiatives and how to exploit opportunity. They need to know everything there is about the technology and its capabilities. For example, they need to know which databases should be used, how RAID works and the level required, and so on. They are able to collect data, interpret data, and forecast needs. Availability Administration Performance 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 724 725 Chapter 20 ✦ Service Level and Performance Monitoring SLM and Windows 2000 Server Key to meeting the objective of SLM is the acquisition of SL tools and technology. This is where Windows 2000 Server comes in. While clustering and load balancing are included in Advanced Server and Datacenter Server, the performance and system monitoring tools and disaster recovery tools are available to all versions of the OS. These tools are essential to SL. Acquired independently of the operating systems, they can cost an arm and a leg, and they might not integrate at the same level. These tools on Windows NT 4.0 were seriously lacking. On Windows 2000, however, they raise the bar for all operating systems. Many competitive products unfortu- nately just do not compete when it comes to SLM. The costs of third-party tools and integration for some operating systems are so prohibitive that they cannot be considered of any use to SLM whatsoever. The Windows 2000 monitoring tools are complex, and continued ignorance of them will not be tolerated by management as more and more customers demand SL com- pliance and service level agreements. The monitoring and performance tools on Windows 2000 include the following: ✦ System Monitor ✦ Task Manager ✦ Event Viewer ✦ Quality of Service ✦ Windows Management Interface ✦ SNMP We are not going to provide an exhaustive investigation into the SLM tools that ship with Windows 2000, or how to use each and every one. Such an advanced level of analysis would take several hundred pages, and it is thus beyond the scope of this book. Performance monitoring is also one of the services and support infrastruc- tures that ships with Windows 2000 but takes some effort to get to know and mas- ter. However, the information that follows will be sufficient to get you started. Windows 2000 System Monitoring Architecture Windows 2000 monitors or analyzes storage, memory, networks, and processing. This does not sound like a big deal, but the data analysis is not done on these areas per se. In other words, you do not monitor memory itself, or disk usage itself, but 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 725 726 Part V ✦ Availability Management rather how software components and functionality use these resources. In short, it is not sufficient to just report that 56MB of RAM were used between time x and time y. Your investigations need to find out what used the RAM at a certain time and why so much was used. If a system continues to run out of memory, there is a strong possibility, for example, that an application is stealing the RAM somewhere. In other words, the application or process has a bug and is leaking memory. When we refer to memory leaks, this means that software that has used memory has not released it after it is done. Software developers are able to watch their applications on servers to be sure they release all the memory they use. What if you are losing memory and you do not know which application is responsible? Not too long ago, Windows NT servers used on the Internet and in high-end mail applications (no fewer than 100,000 e-mails per hour) would simply run out of RAM. After extensive system monitoring, we were able to determine that the leak was in the latest release of the Winsock libraries responsible for Internet communi- cations on NT. Another company in Europe found the leak about the same time. Microsoft later released a patch. It turned out that the Winsock functions responsible for releasing memory were not able to cope with the rapid demand on the sock- ets. They were simply being opened at a rate faster than the Winsock libraries could cope with. The number of software components, services, and threads of functionality in Windows 2000 are so numerous that it is literally impossible to monitor tens of thousands of instances of storage, memory, network, or processor usage. To achieve such detailed and varied analysis, Windows 2000 includes built-in software objects, associated with services and applications, which are able to collect data in these critical areas. So when you collect data, the focus of your data collection is on the software components, in various services of the operating system, that are associated with these areas. When you perform data collection, the system collects data from the targeted object managers in each monitoring area. There are two methods of data collection supported in Windows 2000. The first one involves accessing registry pointers to functions in the performance counter DLLs in the operating system. The second supports collecting data through the Windows Management Infrastructure (WMI). WMI is an object-oriented framework that allows you to instantiate (create instances of) performance objects that wrap the performance functionality in the operating system. The OS installs a new technology for recovering data through WMI. These are known as managed object files (MOFs). These MOFs correspond to or are associated with resources in a system. The number of objects that are the subject of performance monitoring are too numerous to list here, but they can be looked up in the Windows 2000 Performance Counters Reference, which is on the Windows 2000 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 726 727 Chapter 20 ✦ Service Level and Performance Monitoring Resource Kit CD (see Appendix B). However, they include the operating system’s base services, such as the services that report on the RAM, Paging File functionality, and Physical Disk usage, and the operating system’s advanced services, such as Active Directory, Active Server Pages, the FTP service, DNS, WINS, and so on. To understand the scope and usage of the objects, it helps to first understand some performance data and analysis terms. There are three essential concepts to understanding performance monitoring. These are throughput, queues, and response time. From these terms, and once you fully understand them, you can broaden your scope of analysis and perform calculations to report transfer rate, access time, latency, tolerance, thresholds, bottlenecks, and so on. What is Rate and Throughput? Throughput is the amount of work done in a unit of time. If your child is able to con- struct 100 pieces of Lego bricks per hour, you could say that his or her assemblage rate is 100 pieces per hour, assessed over a period of x hours, as long as the rate remains constant. However, if the rate of assemblage varies, through fatigue, hunger, thirst, and so on, we can calculate the throughput. Throughput increases as the number of components increases, or the available time to complete a job is reduced. Throughput depends on resources, and time and space are examples of resources. The slowest point in the system sets the throughput for the system as a whole. Throughput is the true indicator of performance. Memory is a resource, the space in which to carry out instructions. It makes little sense to rate a system by millions of instructions per second, when insufficient memory is not available to hold the instruction information. What Is a Queue? If you give your child too many Lego bricks to assemble, or reduce the available time in which he or she has to perform the calculation and assemblage, the number of pieces will begin to pile up. This happens too in software and IS terms, where the number of threads can begin to back up, one behind the other, in a queue. When a queue develops, we say that a bottleneck has occurred. Looking for bottlenecks in the system is key to monitoring for performance and troubleshooting or problem detection. If there are no bottlenecks, the system might be considered healthy, but a bottleneck might soon start to develop. Queues can also form if requests for resources are not evenly spread over the unit of time. If your child assembles one piece per minute evenly every minute, he or she will get through 60 pieces in an hour. But if the child does nothing for 45 minutes and then suddenly gets inspired, a bottleneck will occur in the final 15 minutes because there are more pieces than the child can process in the remaining time. On 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 727 728 Part V ✦ Availability Management computer systems when queues and bottlenecks develop, systems become unre- sponsive. Additional requests for processor or disk resources are stalled. When requesting services are not satisfied, the system begins to break down. In this respect, we reference the response time of a system. What Is Response Time? Response time is the measure of how much time elapses between the firing of a computer event, such as a read request, and the system’s response to the request. Response time will increase as the load increases because the system is still responding to other events and does not have enough resources to handle new requests. A system that has insufficient memory and/or processing ability will process a huge database sort a lot slower than a better-endowed system with faster hard disks and CPUs. If response time is not satisfactory, you will either have to work with less data or increase the resources. Response time is typically measured by dividing the queue length over the resource throughput. Response time, queues, and throughput are reported and calculated by the Windows 2000 reporting tools. How Performance Objects Work Windows 2000 performance monitoring objects contain functionality known as performance counters. These so-called counters perform the actual analysis. For example, a hard disk object is able to calculate transfer rate, while a processor-associated object is able to calculate processor time. To gain access to the data or to start the data collection, you first have to create the object and gain access to its functionality. This is done by calling a create func- tion from a user interface or other process. As soon as the object is created, and its data collection functionality invoked, it begins the data-collection process and stores the data in various properties. Data can be streamed out to disk, in files, RAM, or to other components that assess the data and present it in some meaning- ful way. Depending on the object, your analysis software can create at least one copy of the performance object and analyze the counter information it generates. You need to consult Microsoft documentation to “expose” the objects to determine if the object can be created more than once concurrently. If it can be created more than once, you will have to associate your application with the data the object collects by ref- erencing the object’s instance counter. Windows 2000 allows you to instantiate an object for a local computer’s services, or you can create an object that collects data from a remote computer. 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 728 [...]... increase network bandwidth, consider saving the remote data to log files on the remote servers and then either copy the data to the local computer or view it remotely Summary This chapter introduced Service Level and Service Level Management More and more companies and business plans are demanding that MIS maintain SL standards To ensure that MIS or IT and IS managers adhere to the performance requirements... monitor by the server role: ✦ Application Servers: These include standard application servers and Terminal Services, or application, servers Terminal Services are more demanding and require constant performance monitoring The heaviest resource usage on these servers is memory and CPU Objects to monitor include Cache, Memory, Processors, and System 741 4667-8 ch20.f.qc 742 5/15/00 2:08 PM Page 742 Part... Monitoring performance requires resources, which can adversely affect the data you’re trying to gather Therefore, you need to decrease the impact of your performance monitoring activities There are several techniques you can use to ensure that performance monitoring overhead is kept to a minimum on any server you are monitoring 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 743 Chapter 20 ✦ Service Level and Performance. .. types of performancerelated logs: counter logs and trace logs These logs are useful for advanced performance analysis and record-keeping that can be done over a period of time There is also an alerting mechanism The Performance Logs and Alerts tree is shown in Figure 20-6 The tool is part of the Performance console snap-in and is thus started as described earlier Figure 20-6: The Performance Logs and Alerts... problems, and maintain server and service health These tools will also allow you to plan capacity and provide feedback to management to ensure that IT continues to support the business models and marketing plans being adopted We have discussed the Performance Console, System Monitor, Log and Alerts, and Task Manager in very loose terms Our definitions have also been very broad The number of monitoring. .. resources and system services based on the performance objects described earlier It works with counters in the same manner as System Monitor The Performance Logs and Alert Service obtains data from the operating system when the update interval has elapsed Trace logs collect event traces With trace logs, you can measure performance associated with events related to memory, storage file I/O, and so on... compute the data it receives and just reports it On the other hand, average counting computes the data for you For example, it is able to compute bits per second, or pages per second, and so on Other counters are able to report percentages, difference, and so on System Monitoring Tools Before you rush out and buy a software development environment to access the performance monitoring routines, you should... disks, and memory the heaviest You can monitor the memory collection, Cache, Processor, System, PhsysicalDisk, and LogicalDisk objects Exchange also ships with specialized counters ✦ Web/Internet Information Server: These servers consume extensive disk, cache, and network components Consider monitoring the Cache, Network Segment, PhysicalDisk, and LogicalDisk objects Performance Monitoring Overhead Monitoring. .. ready-to-go monitoring tools: the Performance Console and Task Manager Task Manager provides an instant view of systems activity such as memory usage, processor activity, process activity, and resource consumption Task Manager is very helpful for an immediate detection of system problems The Performance Console is used to provide performance analysis and information that can be used for troubleshooting and. .. Right-clicking the pane and saving the display as an HTML file does this, and it is the default Save As format Alternately, you can import the log file in comma-separated (CSV) or tab-separated (.tsv) format and then import the data in a spreadsheet, database, or report program such as Crystal Reports 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 735 Chapter 20 ✦ Service Level and Performance Monitoring Working . 20 ✦ Service Level and Performance Monitoring Service Level: Example 2 Management asks MIS to take its order-placing system, typically fax-based and pro- cessed. any system or service into a service level nightmare. 20 20 CHAPTER ✦✦✦✦ In This Chapter Service Level Management Windows 2000 Service Level Tools Task

Ngày đăng: 17/01/2014, 08:20

Xem thêm