Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 26 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
26
Dung lượng
166,99 KB
Nội dung
ServiceLevel and
Performance
Monitoring
W
indows 2000 Server is being widely considered as an
alternative to mainframe-type systems for high-end
computing requirements. This will place tremendous burden
and responsibility on Windows 2000 administrators to ensure
maximum availability of systems. This chapter thus discusses
service leveland provides an introduction to Windows 2000
Server performance monitoring.
What Is Service Level?
If there is anything you have learned in this book, it is this:
Windows 2000 is a major-league operating system. In our opin-
ion, it is the most powerful operating system in existence . . .
for the majority of needs of all enterprises. Only time and ser-
vice packs will tell if Windows 2000 can go up against the big
irons such as AS/400, Solaris, S/390, and the like.
Microsoft has aimed Windows 2000 Server squarely at all lev-
els of business and industry and at all business sizes. You will
no doubt feel the rush of diatribe in the industry: 99.9 this,
10,000 concurrent hits that, clustering and load balancing, and
more. But every system, server or OS, has its meltdown point,
weak links, single point of failure (SPOF), “tensile strength,”
and so on. Knowing, or at least predicting, the meltdown
“event horizon” is more important than availability claims.
Trust us, poor management will turn any system or service
into a servicelevel nightmare.
20
20
CHAPTER
✦✦✦✦
In This Chapter
Service Level
Management
Windows 2000
Service Level Tools
Task Manager
The Performance
Console
Performance
Monitoring
Guidelines
✦✦✦✦
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 719
720
Part V ✦ Availability Management
One of the first things you need to ignore in the press from the get-go is the crazy
comparisons of Windows 2000 to $75 operating systems, and so on. If your busi-
ness is worth your life to you and your staff, you need to invest in performance and
monitoring tools, disaster recovery, Quality of Service tools, servicelevel tools, and
more. Take a survey of what these tools can cost you. Windows 2000 Server out of
the box has more built in to it than anything else, as this chapter will illustrate.
On our calculators, Windows 2000 Server is the cheapest system going on
performance-monitoring tools alone.
Windows 2000 is no doubt going to be adopted by many organizations; it will cer-
tainly replace Windows NT over the next few years and will probably become the
leading server operating system on the Internet. With application service providing
(ASP), thin-client, Quality of Service, e-commerce, distributed networking architec-
ture (DNA), and the like becoming implementations everywhere as opposed to
being new buzzwords, you, the server or network administrator, are going to find
yourself dealing with a new animal in your server room. This animal is known as
the servicelevel agreement (SLA).
Before we discuss the SLA further, we should first define servicelevel and, second,
how Windows 2000 addresses it.
Service Level (SL) is simply the ability of IT management or MIS to maintain a con-
sistent, maximum level of system uptime and availability. Many companies may
understand SL as quality assurance and quality control (QA/QC). Examples will
better explain it, as follows.
Service Level: Example 1
Management comes to MIS with a business plan for application services providing
(ASP). If certain customers can lease applications online, over reliable Internet con-
nections, for x rate per month, they will forgo expensive in-house IT budgets and
outsource instead. An ASP can, therefore, make its highly advanced network opera-
tions center and a farm of servers available to these businesses. If enough cus-
tomers lease applications, the ASP will make a profit.
The business plan flies if ASP servers and applications are available to customers
all the time from at least 7 a.m. to 9 p.m. The business plan will only tolerate a .09
percent downtime during the day. Any more and customers will lose respect for the
business and rather bring resources back in house. This means that IT or MIS must
support the business plan by ensuring that systems are never offline for more than
.09 percent of the business day. Response, as opposed to availability, is also a criti-
cal factor. And Quality of Service, or QoS, addresses this in SL. This will be dis-
cussed shortly in this chapter.
Note
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 720
721
Chapter 20 ✦ ServiceLevelandPerformance Monitoring
Service Level: Example 2
Management asks MIS to take its order-placing system, typically fax-based and pro-
cessed by representatives in the field, to the extranet. Current practice involves a
representative going to a customer, taking an order for stock, and then faxing the
order to the company’s fax system, where the orders are manually entered into the
system. The new system proposes that customers be equipped with an inexpensive
terminal or terminal software and place the orders directly against their accounts
on a Web server.
MIS has to ensure that the Web servers and the backend systems, SQL Server 2000,
Windows 2000 Server, the WAN, and so on, are available all the time. If customers
find the systems offline, they will swamp the phones and fax machines, or simply
place their orders with the competition. The system must also be reliable, informa-
tive, and responsive to the customers’ needs.
The ServiceLevel Agreement
The first example may require a formal servicelevel agreement. In other words, the
SLA will be a written contract signed between the client and the provider. The cus-
tomer demands that the ASP provide written—signed—guarantees that the sys-
tems will be available 99.9 percent of the time. The customer demands such an SLA,
because it cannot afford to be in the middle of an order-processing application, or
sales letter, and then have the ASP suddenly disappear.
The customer may be able to tolerate a certain level of unavailability, but if SL
drops beyond what’s tolerable, the customer needs a way to obtain redress from
the ASP. This redress could be the ability to cancel the contract, or the ability to
hold the ASP accountable with penalties, such as fines, discount on service costs,
waiver of monthly fees, and so on. Whatever the terms of the SLA, if the ASP cannot
meet it, then MIS gets the blame.
In the second example, there is unlikely to be a formal SLA between a customer and
the supplier. Servicelevel agreements will be in the form of memos between MIS
and other areas of management. MIS will agree to provide a certain level of avail-
ability to the business model or plan. These SLAs are put in writing and usually
favored by the MIS, who will take the SLA to budget and request money for systems
and software to meet the SLA.
However, the SLA can work to the disadvantage of MIS, too. If SL is not met, the MIS
staff or CTO may get fired, demoted, or reassigned. The CEO may also decide to
outsource or force MIS to bring in expensive consultants (which may help or hurt
MIS).
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 721
722
Part V ✦ Availability Management
In IT shops that now support SL for mission-critical applications, there are no mar-
gins for tolerating error. Engineers who cannot help MIS meet SL do not survive
long. Education and experience are likely to be high on the list of employment
requirements.
Service Level Management
Understanding ServiceLevel Management (SLM) is an essential requirement for MIS
in almost all companies today. This section examines critical SLM factors that have
to be addressed.
Problem Detection
This factor requires IT to be constantly monitoring systems for advanced warnings
of system failure. You use whatever tools you can obtain to monitor systems and
focus on all the possible points of failure. For example, you will need to monitor
storage, networks, memory, processors, power, and so on.
Problem detection is a lot like earthquake detection. You spend all of your time lis-
tening to the earth, and the quake comes when you least expect it and where you
least expect it. Then, 100 percent of your effort is spent on disaster recovery (DR).
Your DR systems then need to kick in to recover. According to research from the
likes of Forrester Research, close to 40 percent of IT management resources are
spent on problem detection.
Performance Management
Performance Management accounts for about 20 percent of MIS resources. This fac-
tor is closely related to problem detection. You can hope that poor performance in
areas such as networking, access times, transfer rates, restore or recover perfor-
mance, and so on, will point to problems that can be fixed before they turn into dis-
asters. However, most of the time a failure is usually caused by failures in another
part of the system. For example, if you get a flood of continuous writes to a hard
disk that does not let up until the hard disk crashes, is the hard disk at fault or
should you be looking for better firewall software?
The right answer is a combination of both factors. The fault is caused by the poor
quality of firewall software that gives passage to a denial-of-service attack. But in
the event this happens again, we need hard disks that can stand the attack a lot
longer.
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 722
723
Chapter 20 ✦ ServiceLevelandPerformance Monitoring
Availability
Availability, for the most part, is a post-operative factor. In other words, availability
management covers redundancy, mirrored or duplexed systems, fail-overs, and so
on. Note that fail-over is emphasized because the term itself denotes taking over
from a system that has failed.
Clustering of systems or load balancing, on the other hand, is also as much disaster
prevention as it is a performance-level maintenance practice. Using performance
management, you would take systems to a performance point that is nearing
threshold or maximum level, then you switch additional requests for service to
other resources. A fail-over, on the other hand, is a machine or process that picks
up the users and processes that were on a system that has just failed, and it is sup-
posed to allow the workload to continue uninterrupted on the fail-over systems. A
good example of fail-over is a mirrored disk, or a RAID-5 storage set: The failure of
one disk does not interrupt the processing, which carries on oblivious to the failure
on the remaining disks, giving management time to replace the defective compo-
nents.
There are several other SL-related areas that IT spends time on and which impact
SLM. These include change management and control, software distribution, and
systems management. See Chapter 11 for an extensive discussion of Change
Management.
SLM by Design
SLM combines tools and metrics or analysis to meet the objectives of SL and
Service Level Agreements. The SLM model is a three-legged stool, as illustrated
in Figure 20-1.
The availability leg supports the model by guaranteeing availability of critical sys-
tems. The administration leg ensures 24×7 operations and administrative house-
keeping. The performance leg supports the model by assuring that systems are able
to service the business and keep systems operating at threshold points considered
safely below bottleneck and failure levels. If one of the legs fails or becomes weak,
the stool may falter or collapse, which puts the business at risk.
When managing for availability, the enterprise will ensure it has the resources to
recover from disasters as soon as possible. This usually means hiring gurus or
experts to be available on-site to fix problems as quickly as possible. Often, man-
agement will pay a guru who does nothing for 95 percent of his or her time, which
seems a waste. But if they can fix a problem in record time, they will have earned
their keep several times over.
Note
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 723
724
Part V ✦ Availability Management
Figure 20-1: The SLM model is a three-legged stool.
Often, a guru will restore a system that, had it stayed offline a few days longer,
would have cost the company much more than the salary of the guru. However, it
goes without saying that the enterprise will save a lot of money and effort if it can
obtain gurus who are also qualified to monitor for performanceand problems, and
who do not just excel at recovery. This should be worth 50 percent more salary to
the guru.
Administration is the effort of technicians to keep systems backed up, keep power
supplies on line, monitor servers for error messages, ensure server rooms remain
at safe temperatures and air circulation, and so on. The administrative leg manages
the SL budget, hires and fires, maintains and reports on servicelevel achievement,
and reports to management or the CEO.
The performance leg is usually carried out by analysts who know what to look
for in a system. These analysts get paid the big bucks to help management decide
how to support business initiatives and how to exploit opportunity. They need to
know everything there is about the technology and its capabilities. For example,
they need to know which databases should be used, how RAID works and the level
required, and so on. They are able to collect data, interpret data, and forecast
needs.
Availability
Administration
Performance
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 724
725
Chapter 20 ✦ ServiceLevelandPerformance Monitoring
SLM and Windows 2000 Server
Key to meeting the objective of SLM is the acquisition of SL tools and technology.
This is where Windows 2000 Server comes in. While clustering and load balancing
are included in Advanced Server and Datacenter Server, the performanceand sys-
tem monitoring tools and disaster recovery tools are available to all versions of
the OS.
These tools are essential to SL. Acquired independently of the operating systems,
they can cost an arm and a leg, and they might not integrate at the same level.
These tools on Windows NT 4.0 were seriously lacking. On Windows 2000, however,
they raise the bar for all operating systems. Many competitive products unfortu-
nately just do not compete when it comes to SLM. The costs of third-party tools
and integration for some operating systems are so prohibitive that they cannot be
considered of any use to SLM whatsoever.
The Windows 2000 monitoring tools are complex, and continued ignorance of them
will not be tolerated by management as more and more customers demand SL com-
pliance andservicelevel agreements. The monitoringandperformance tools on
Windows 2000 include the following:
✦ System Monitor
✦ Task Manager
✦ Event Viewer
✦ Quality of Service
✦ Windows Management Interface
✦ SNMP
We are not going to provide an exhaustive investigation into the SLM tools that ship
with Windows 2000, or how to use each and every one. Such an advanced level of
analysis would take several hundred pages, and it is thus beyond the scope of this
book. Performancemonitoring is also one of the services and support infrastruc-
tures that ships with Windows 2000 but takes some effort to get to know and mas-
ter. However, the information that follows will be sufficient to get you started.
Windows 2000 System Monitoring Architecture
Windows 2000 monitors or analyzes storage, memory, networks, and processing.
This does not sound like a big deal, but the data analysis is not done on these areas
per se. In other words, you do not monitor memory itself, or disk usage itself, but
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 725
726
Part V ✦ Availability Management
rather how software components and functionality use these resources. In short, it
is not sufficient to just report that 56MB of RAM were used between time x and time
y. Your investigations need to find out what used the RAM at a certain time and why
so much was used.
If a system continues to run out of memory, there is a strong possibility, for exam-
ple, that an application is stealing the RAM somewhere. In other words, the applica-
tion or process has a bug and is leaking memory. When we refer to memory leaks,
this means that software that has used memory has not released it after it is done.
Software developers are able to watch their applications on servers to be sure they
release all the memory they use.
What if you are losing memory and you do not know which application is responsi-
ble? Not too long ago, Windows NT servers used on the Internet and in high-end
mail applications (no fewer than 100,000 e-mails per hour) would simply run out of
RAM. After extensive system monitoring, we were able to determine that the leak
was in the latest release of the Winsock libraries responsible for Internet communi-
cations on NT. Another company in Europe found the leak about the same time.
Microsoft later released a patch. It turned out that the Winsock functions responsi-
ble for releasing memory were not able to cope with the rapid demand on the sock-
ets. They were simply being opened at a rate faster than the Winsock libraries
could cope with.
The number of software components, services, and threads of functionality in
Windows 2000 are so numerous that it is literally impossible to monitor tens of
thousands of instances of storage, memory, network, or processor usage.
To achieve such detailed and varied analysis, Windows 2000 includes built-in soft-
ware objects, associated with services and applications, which are able to collect
data in these critical areas. So when you collect data, the focus of your data collec-
tion is on the software components, in various services of the operating system,
that are associated with these areas. When you perform data collection, the system
collects data from the targeted object managers in each monitoring area.
There are two methods of data collection supported in Windows 2000. The first one
involves accessing registry pointers to functions in the performance counter DLLs
in the operating system. The second supports collecting data through the Windows
Management Infrastructure (WMI). WMI is an object-oriented framework that allows
you to instantiate (create instances of) performance objects that wrap the perfor-
mance functionality in the operating system.
The OS installs a new technology for recovering data through WMI. These are
known as managed object files (MOFs). These MOFs correspond to or are associ-
ated with resources in a system. The number of objects that are the subject of per-
formance monitoring are too numerous to list here, but they can be looked up in
the Windows 2000 Performance Counters Reference, which is on the Windows 2000
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 726
727
Chapter 20 ✦ ServiceLevelandPerformance Monitoring
Resource Kit CD (see Appendix B). However, they include the operating system’s
base services, such as the services that report on the RAM, Paging File functional-
ity, and Physical Disk usage, and the operating system’s advanced services, such as
Active Directory, Active Server Pages, the FTP service, DNS, WINS, and so on.
To understand the scope and usage of the objects, it helps to first understand some
performance data and analysis terms. There are three essential concepts to under-
standing performance monitoring. These are throughput, queues, and response time.
From these terms, and once you fully understand them, you can broaden your
scope of analysis and perform calculations to report transfer rate, access time,
latency, tolerance, thresholds, bottlenecks, and so on.
What is Rate and Throughput?
Throughput is the amount of work done in a unit of time. If your child is able to con-
struct 100 pieces of Lego bricks per hour, you could say that his or her assemblage
rate is 100 pieces per hour, assessed over a period of x hours, as long as the rate
remains constant. However, if the rate of assemblage varies, through fatigue,
hunger, thirst, and so on, we can calculate the throughput.
Throughput increases as the number of components increases, or the available
time to complete a job is reduced. Throughput depends on resources, and time and
space are examples of resources. The slowest point in the system sets the through-
put for the system as a whole. Throughput is the true indicator of performance.
Memory is a resource, the space in which to carry out instructions. It makes little
sense to rate a system by millions of instructions per second, when insufficient
memory is not available to hold the instruction information.
What Is a Queue?
If you give your child too many Lego bricks to assemble, or reduce the available
time in which he or she has to perform the calculation and assemblage, the number
of pieces will begin to pile up. This happens too in software and IS terms, where the
number of threads can begin to back up, one behind the other, in a queue. When a
queue develops, we say that a bottleneck has occurred. Looking for bottlenecks in
the system is key to monitoring for performanceand troubleshooting or problem
detection. If there are no bottlenecks, the system might be considered healthy, but
a bottleneck might soon start to develop.
Queues can also form if requests for resources are not evenly spread over the unit
of time. If your child assembles one piece per minute evenly every minute, he or
she will get through 60 pieces in an hour. But if the child does nothing for 45 min-
utes and then suddenly gets inspired, a bottleneck will occur in the final 15 minutes
because there are more pieces than the child can process in the remaining time. On
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 727
728
Part V ✦ Availability Management
computer systems when queues and bottlenecks develop, systems become unre-
sponsive. Additional requests for processor or disk resources are stalled. When
requesting services are not satisfied, the system begins to break down. In this
respect, we reference the response time of a system.
What Is Response Time?
Response time is the measure of how much time elapses between the firing of a
computer event, such as a read request, and the system’s response to the request.
Response time will increase as the load increases because the system is still
responding to other events and does not have enough resources to handle new
requests. A system that has insufficient memory and/or processing ability will pro-
cess a huge database sort a lot slower than a better-endowed system with faster
hard disks and CPUs. If response time is not satisfactory, you will either have to
work with less data or increase the resources.
Response time is typically measured by dividing the queue length over the resource
throughput. Response time, queues, and throughput are reported and calculated by
the Windows 2000 reporting tools.
How Performance Objects Work
Windows 2000 performancemonitoring objects contain functionality known as per-
formance counters. These so-called counters perform the actual analysis. For exam-
ple, a hard disk object is able to calculate transfer rate, while a
processor-associated object is able to calculate processor time.
To gain access to the data or to start the data collection, you first have to create
the object and gain access to its functionality. This is done by calling a
create func-
tion from a user interface or other process. As soon as the object is created, and its
data collection functionality invoked, it begins the data-collection process and
stores the data in various properties. Data can be streamed out to disk, in files,
RAM, or to other components that assess the data and present it in some meaning-
ful way.
Depending on the object, your analysis software can create at least one copy of the
performance object and analyze the counter information it generates. You need to
consult Microsoft documentation to “expose” the objects to determine if the object
can be created more than once concurrently. If it can be created more than once,
you will have to associate your application with the data the object collects by ref-
erencing the object’s instance counter. Windows 2000 allows you to instantiate an
object for a local computer’s services, or you can create an object that collects data
from a remote computer.
4667-8 ch20.f.qc 5/15/00 2:08 PM Page 728
[...]... increase network bandwidth, consider saving the remote data to log files on the remote servers and then either copy the data to the local computer or view it remotely Summary This chapter introduced ServiceLevelandServiceLevel Management More and more companies and business plans are demanding that MIS maintain SL standards To ensure that MIS or IT and IS managers adhere to the performance requirements... monitor by the server role: ✦ Application Servers: These include standard application servers and Terminal Services, or application, servers Terminal Services are more demanding and require constant performancemonitoring The heaviest resource usage on these servers is memory and CPU Objects to monitor include Cache, Memory, Processors, and System 741 4667-8 ch20.f.qc 742 5/15/00 2:08 PM Page 742 Part... Monitoringperformance requires resources, which can adversely affect the data you’re trying to gather Therefore, you need to decrease the impact of your performancemonitoring activities There are several techniques you can use to ensure that performancemonitoring overhead is kept to a minimum on any server you are monitoring 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 743 Chapter 20 ✦ ServiceLeveland Performance. .. types of performancerelated logs: counter logs and trace logs These logs are useful for advanced performance analysis and record-keeping that can be done over a period of time There is also an alerting mechanism The Performance Logs and Alerts tree is shown in Figure 20-6 The tool is part of the Performance console snap-in and is thus started as described earlier Figure 20-6: The Performance Logs and Alerts... problems, and maintain server andservice health These tools will also allow you to plan capacity and provide feedback to management to ensure that IT continues to support the business models and marketing plans being adopted We have discussed the Performance Console, System Monitor, Log and Alerts, and Task Manager in very loose terms Our definitions have also been very broad The number of monitoring. .. resources and system services based on the performance objects described earlier It works with counters in the same manner as System Monitor The Performance Logs and Alert Service obtains data from the operating system when the update interval has elapsed Trace logs collect event traces With trace logs, you can measure performance associated with events related to memory, storage file I/O, and so on... compute the data it receives and just reports it On the other hand, average counting computes the data for you For example, it is able to compute bits per second, or pages per second, and so on Other counters are able to report percentages, difference, and so on System Monitoring Tools Before you rush out and buy a software development environment to access the performancemonitoring routines, you should... disks, and memory the heaviest You can monitor the memory collection, Cache, Processor, System, PhsysicalDisk, and LogicalDisk objects Exchange also ships with specialized counters ✦ Web/Internet Information Server: These servers consume extensive disk, cache, and network components Consider monitoring the Cache, Network Segment, PhysicalDisk, and LogicalDisk objects PerformanceMonitoring Overhead Monitoring. .. ready-to-go monitoring tools: the Performance Console and Task Manager Task Manager provides an instant view of systems activity such as memory usage, processor activity, process activity, and resource consumption Task Manager is very helpful for an immediate detection of system problems The Performance Console is used to provide performance analysis and information that can be used for troubleshooting and. .. Right-clicking the pane and saving the display as an HTML file does this, and it is the default Save As format Alternately, you can import the log file in comma-separated (CSV) or tab-separated (.tsv) format and then import the data in a spreadsheet, database, or report program such as Crystal Reports 4667-8 ch20.f.qc 5/15/00 2:08 PM Page 735 Chapter 20 ✦ ServiceLevelandPerformanceMonitoring Working . 20 ✦ Service Level and Performance Monitoring
Service Level: Example 2
Management asks MIS to take its order-placing system, typically fax-based and pro-
cessed. any system or service
into a service level nightmare.
20
20
CHAPTER
✦✦✦✦
In This Chapter
Service Level
Management
Windows 2000
Service Level Tools
Task