A User Gets an “Insufficient Disk Space” Message When Adding Files to a Volume The insufficient disk space message (see Figure 5.107) to be expected for any of your users that are over their quota limit. Usually this is a good thing because it means that disk quotas are working. The only way around it is to increase your users’ quota limit or to stop denying users who exceed their disk space. If this is happening unexpectedly, verify that your users’ limits are set correctly. A common error is to forget to change the quota measurement from KB to MB.You may think that your users have 150MB of available space when they only have 150KB of space. Troubleshooting Remote Storage Remember when you are troubleshooting Remote Storage that you are writing data to backup media.This is going to be slower than writing to disks.This is not to say that your performance should be terrible. Just be realistic with your expectations. Here are some common Remote Storage troubleshooting issues: 186 Chapter 5 • Managing Physical and Logical Disks Figure 5.106 Cleaning Up Disk Quotas Figure 5.107 Exceeding Your Quota Limit 301_BD_W2k3_05.qxd 5/12/04 12:32 PM Page 186 ■ Remote Storage will not install. ■ Remote Storage is not finding a valid media type. ■ Files can no longer be recalled from Remote Storage. Remote Storage Will Not Install Remote Storage is not installed by default.You must add it through Control Panel | Add or Remove Programs.You must have administrative rights on the machine on which you are installing Remote Storage. Without administrative rights, setup will not continue. Remote Storage Is Not Finding a Valid Media Type During initial setup, Remote Storage searches for an available media type. If Remote Storage is not finding one on your machine, you either have not waited long enough for Remote Storage to finish searching or you do not have a compatible library. Files Can No Longer Be Recalled from Remote Storage Remote Storage has a runaway recall limit to deny recalling files from storage more than a specified number of times in a row. It is possible that you have an application that is making too many recalls. Once this threshold is crossed, future recalls are denied. If the recalls are legitimate, you can increase the threshold for the runaway recall limit. If they are not valid, then you need to terminate the application making the request. Troubleshooting RAID When troubleshooting RAID volumes, you must first troubleshoot the disk itself. So always start with the basic disk and dynamic disk checklist. However, there are times when the problem is with the RAID volume itself and not the underlying disk.This section covers the following: ■ Mirrored or RAID-5 volume’s status is Data Not Redundant. ■ Mirrored or RAID-5 volume’s status is Failed Redundancy. ■ Mirrored or RAID-5 volume’s status is Stale Data. Mirrored or RAID-5 Volume’s Status is Data Not Redundant A Data Not Redundant status indicates that your volume is not intact.This is due to moving disks from one machine to another without moving all the disks in the volume. Wait to import your disk until you have all the disks in the volume physically connected to the server.Then when you import them, Windows will see them as a complete volume and retain their configuration. Mirrored or RAID-5 Volume’s Status is Failed Redundancy Failed Redundancy, as shown in Figures 5.108 and 5.109, occurs when one of the disks in a fault-tol- erant volume fails.Your volume will continue to work, but it is no longer fault tolerant. If another disk fails, you will lose all your data on that volume.You should repair the failed disk as quickly as possible. Managing Physical and Logical Disks • Chapter 5 187 301_BD_W2k3_05.qxd 5/12/04 12:32 PM Page 187 Your mirrored volume will need to be recreated after replacing the disk. Right-click the defec- tive disk and select Remove Mirror.Then right-click the working disk and select Add Mirror, selecting the new disk as the mirror.To repair the RAID-5 volume, put in the disk and right-click the volume and choose Repair RAID-5 Volume. Mirrored or RAID-5 Volume’s Status is Stale Data Stale data occurs when a volume’s fault-tolerant information is not completely up to date.This hap- pens in a mirrored volume if something has been written to the primary disk, but for whatever reason it hasn’t made it to the mirror disk yet.This occurs in a RAID-5 volume when the parity information isn’t up to date. If you try to move a volume while it contains stale information, you will get a status of Stale Data when you try to import the disk. Move the disk back to the machine it was originally in and rescan the machine for new disk. After all the disks are discovered, wait until they say online and healthy before you try to move them again. 188 Chapter 5 • Managing Physical and Logical Disks Figure 5.108 Recovering a Failed Mirrored Volume Figure 5.109 Recovering a Failed RAID-5 Volume 301_BD_W2k3_05.qxd 5/12/04 12:32 PM Page 188 Implementing Windows Cluster Services and Network Load Balancing In this chapter: ■ Making Server Clustering Part of Your High-Availability Plan ■ Making Network Load Balancing Part of Your High- Availability Plan Introduction Fault tolerance generally involves redundancy; for example, in the case of disk fault toler- ance, multiple disks are used.The ultimate in fault tolerance is the use of multiple servers, configured to take over for one another in case of failure or to share the processing load. Windows Server 2003 provides network administrators with two powerful tools to enhance fault tolerance and high availability: server clustering (only in the Enterprise and Datacenter Editions) and Network Load Balancing (included in all editions). This chapter looks first at server clustering and shows you how to make clustering services part of your enterprise-level organization’s high-availability plan. We’ll start by introducing you to the terminology and concepts involved in understanding clustering. You’ll learn about cluster nodes, cluster groups, failover and failback, name resolution as it pertains to cluster services, and how server clustering works. We’ll discuss three cluster models: single-node, single quorum device, and majority node set.Then we’ll talk about cluster deployment options, including N-node failover pairs, hot standby server/N+1, failover ring, and random.You’ll learn about cluster administration, and we’ll show you how to use the Cluster Administrator tool as well as command-line tools. Next, we’ll discuss best practices for deploying server clusters.You’ll learn about hard- ware issues, especially those related to network interface controllers, storage devices, power-saving features, and general compatibility issues. We’ll discuss cluster network con- figuration and you’ll learn about multiple interconnections and node-to-node communi- cations. We’ll talk about the importance of binding order, adapter settings, and TCP/IP settings. We’ll also discuss the default cluster group. Next, we’ll move onto the subject of security for server clusters.This includes physical security, public/mixed networks, private Chapter 6 189 301_BD_W2k3_06.qxd 5/13/04 3:06 PM Page 189 networks, secure remote administration of cluster nodes, security issues involving the cluster service account, and how to limit client access. We’ll also talk about how to secure data in a cluster, how to secure disk resources, and how to secure cluster configuration log files. The next section addresses how to make Network Load Balancing (NLB) part of your high-avail- ability plan. We’ll introduce you to NLB concepts such as hosts/default host, load weight, traffic distri- bution, convergence, and heartbeats.You’ll learn about how NLB works and the relationship of NLB to clustering. We’ll show you how to manage NLB clusters using the NLB Manager tool, remote- management tools, and command-line tools. We’ll also discuss NLB error detection and handling. Next, we’ll move onto monitoring NLB using the NLB Monitor Microsoft Management Console (MMC) snap-in or the Windows Load Balancing Service (WLBS) cluster control utility. We discuss best practices for implementing and managing NLB, including issues such as multiple network adapters, protocols and IP addressing, and NLB Manager logging. Finally, we’ll address NLB security. Making Server Clustering Part of Your High-Availability Plan Certain circumstances require an application to be operational more than standard hardware would allow. Databases and mail servers often have this need. Using server clustering, it is possible to have more than one server ready to run critical applications. Server clustering also provides the capability to automatically manage the operation of the application so that if one server experienced a failure another server would automatically take over and keep the application running. Server clustering is a critical component in a high-availability plan. We’ll discuss high-availability strategies in the next chapter. The basic idea of server clustering has been around for many years on other computing plat- forms. Microsoft initially released its server cluster technology as part of Windows NT 4.0 Enterprise Edition. It supported two nodes and a limited number of applications. Server clustering was further refined with the release of Windows 2000 Advanced and Datacenter Server Editions. Server clusters were simpler to create, and more applications were available. In addition, some pub- lishers began to make their applications “cluster-aware,” so that their applications installed and oper- ated more easily on a server cluster. Now, with the release of Windows Server 2003, we see another level of improvement on the server clustering technology. Server clusters now support much larger clusters and more robust configurations. Server clusters are easier to create and manage. Features that were available only in the Datacenter Edition of Windows 2000 have now been made available in the Enterprise Edition of Windows Server 2003. Terminology and Concepts Although it has been used previously, a more formal definition of a server cluster is needed. For our purposes, a server cluster is a group of independent servers that work together to increase application availability to client systems and appear to clients under one common name.The independent servers that make up a server cluster are individually called nodes. Nodes in a server cluster monitor each other’s status through a communication mechanism called a heartbeat.The heartbeat is a series of messages that allow the server cluster nodes to detect communication failures and, if necessary, 190 Chapter 6 • Implementing Windows Cluster Services and Network Load Balancing 301_BD_W2k3_06.qxd 5/13/04 3:06 PM Page 190 perform a failover operation. A failover is the process by which resources are stopped on one node and started on another. Cluster Nodes A server cluster node is an independent server.This server must be running Windows 2000 Advanced Server, Windows 2000 Datacenter Server, Windows Server 2003 Enterprise Edition, or Windows Server 2003 Datacenter Edition.The two editions of Windows Server 2003 cannot be used in the same server cluster, but either can exist in a server cluster with a Windows 2000 Advanced Server node. Since Windows Server 2003 Datacenter Edition is available only through original equipment manufacturers (OEMs), this chapter deals with server clusters constructed with the Advanced Server Edition of Windows Server 2003 unless specifically stated otherwise. A server cluster node should be a robust system. When designing your server cluster, do not overlook applying fault-tolerant concepts to the individual nodes. Using individual fault-tolerant components to build fault-tolerant nodes to build fault-tolerant server clusters can be described as “fault tolerance in depth.”This approach will increase overall reliability and make your life easier. A server cluster consists of anywhere between one and eight nodes.These nodes do not neces- sarily need to have identical configurations, although that is a frequent design element. Each node in a server cluster can be configured to have a primary role that is different from the other nodes in the server cluster.This allows you to have overall better utilization of the server cluster if each node is actively providing services. A node is connected to one or more storage devices, which contain disks that house information about the server cluster. Each node also contains one or more separate network interfaces that provide client communications and support heartbeat communications. Cluster Groups The smallest unit of service that a server cluster can provide is a resource. A resource is a physical or logical component that can be managed on an individual basis and can be independently activated or deactivated (called bringing the resource online or offline). A resource can be owned by only one node at a time. There are several predefined (called “standard”) types of resources known to Windows Server 2003. Each type is used for a specific purpose.The following are some of the most common stan- dard resource types: ■ Physical Disk Represents and manages disks present on a shared cluster storage device. Can be partitioned like a regular disk. Can be assigned a drive letter or used as an NTFS mounted drive. ■ IP Address Manages an IP address. ■ Network Name Manages a unique NetBIOS name on the network, separate from the NetBIOS name of the node on which the resource is running. ■ Generic Service Manages a Windows operating system service as a cluster resource. Helps ensure that the service operates in one place at one time. ■ Generic Script Manages a script as a cluster resource (new to Windows Server 2003). ■ File Share Creates and manages a Windows file share as a cluster resource. Implementing Windows Cluster Services and Network Load Balancing • Chapter 6 191 301_BD_W2k3_06.qxd 5/13/04 3:06 PM Page 191 Other standard resource types allow you to manage clustered print servers, Dynamic Host Configuration Protocol (DHCP) servers, Windows Internet Name Service (WINS) servers, and generic noncluster-aware applications. (It is also possible to create new resource types through the use of dynamic link library files.) Individual resources are combined to form cluster groups. A cluster group is a collection of server resources that defines the relationships of resource within the group to each other and defines the unit of failover, so that if one resource moves between nodes, all resources in the group also move. As with individual resources, a cluster group can be owned by only one node at a time.To use an analogy from chemistry, resources are atoms and groups are compounds.The cluster group is the primary unit of administration in a server cluster. Similar or interdependent resources are combined into the same group. A resource cannot be dependent on another resource that is not in the same cluster group. Most cluster groups are designed around either an application or a storage unit. It is in this way that individual applications or disks in a server cluster are controlled independently of other applications or disks. Failover and Failback If a resource on a node fails, the cluster service will first attempt to reactivate the resource on the same node. If unable to do so, the cluster service will move the cluster group to another node in the server cluster.This process is called a failover. A failover can be triggered manually by the adminis- trator or automatically by a node failure. A failover can involve multiple nodes, if the server cluster is configured this way, and each group can have different failover policies defined. A failback is the corollary of a failover. When the original node that hosted the failed-over resource(s) comes back online, the cluster service can return the cluster group to operation on the original node.This failback policy can be defined individually for a cluster group or disabled entirely. Failback is usually performed at times of low utilization to avoid impacting clients, and it can be set to follow specific schedules. Cluster Services and Name Resolution A server cluster appears to clients as one common name, regardless of the number of nodes in the server cluster. It is for this reason that the server cluster name must be unique on your network. Ensure that the server cluster name is different from the names of other server clusters, domain names, servers, and workstations on your network.The server cluster will register its name with the WINS and DNS servers configured on the node running the default cluster group. Individual applications that run on a server cluster can (and should) be configured to run in separate cluster groups.The applications must also have unique names on the network and will also automatically register with WINS and DNS. Do not use static WINS entries for your resources. Doing so will prevent an update to the WINS registered address in the event of a failover. How Clustering Works Each node in a server cluster is connected to one or more storage devices.These storage devices contain one or more disks. If the server cluster contains two nodes, you can use either a SCSI inter- face to the storage devices or a Fibre Channel interface. For three or more node server clusters, 192 Chapter 6 • Implementing Windows Cluster Services and Network Load Balancing 301_BD_W2k3_06.qxd 5/13/04 3:06 PM Page 192 Fibre Channel is recommended. If you are using a 64-bit edition of Windows Server 2003, Fibre Channel is the required interface, regardless of the number of nodes. Fibre Channel has many benefits over SCSI. Fibre Channel is faster and easily expands beyond two nodes. Fibre Channel cabling is simpler, and Fibre Channel automatically configures itself. However, Fibre Channel is also more expensive than SCSI, requires more components, and can be more complicated to design and manage. On any server cluster, there is something called the quorum resource.The quorum resource is used to determine the state of the server cluster.The node that controls the quorum resource controls the server cluster, and only one node at a time can own the quorum resource.This prevents a situation called split-brain, which occurs when more than one node believes it controls the server cluster and behaves accordingly. Split-brain was a problem that occurred in the early development of server cluster technologies.The introduction of the quorum resource solved this problem. Cluster Models There are three basic server cluster design models available to choose from: single node, single quorum, and majority node set. Each is designed to fit a specific set of circumstances. Before you begin designing your server cluster, make sure you have a thorough understanding of these models. Single Node A single-node server cluster model is primarily used for development and testing purposes. As its name implies, it consists of one node. An external disk resource may or may not be present. If an external disk resource is not present, the local disk is configured as the cluster storage device, and the server cluster configuration is kept there. Failover is not possible with this server cluster model, because there is only one node. However, as with any server cluster model, it is possible to create multiple virtual servers. (A virtual server is a cluster group that contains its own dedicated IP address, network name, and services and is indistin- guishable from other servers from a client’s perspective.) Figure 6.1 illustrates the structure of a single-node server cluster. Implementing Windows Cluster Services and Network Load Balancing • Chapter 6 193 Figure 6.1 Single Node Server Cluster Storage Node Network Virtual Server Virtual Server Virtual Server . . . 301_BD_W2k3_06.qxd 5/13/04 3:06 PM Page 193 If a resource fails, the cluster service will attempt to automatically restart any applications and dependent resources.This can be useful when applied to applications that do not have built-in restart capabilities but would benefit from that capability. Some applications that are designed for use on server clusters will not work on a single-node cluster model. Microsoft SQL Server and Microsoft Exchange Server are two examples.Applications like these require the use of one of the other two server cluster models. Single Quorum Device The single quorum device server cluster model is the most common and will likely continue to be the most heavily used. It has been around since Microsoft first introduced its server clustering technology. This type of server cluster contains two or more nodes, and each node is connected to the cluster storage devices.There is a single quorum device (a physical disk) that resides on the cluster storage device.There is a single copy of the cluster configuration and operational state, which is stored on the quorum resource. Each node in the server cluster can be configured to run different applications or to act simply as a hot-standby device waiting for a failover to occur. Figure 6.2 illustrates the structure of a single quorum device server cluster with two nodes. Majority Node Set The majority node set (MNS) model is new in Windows Server 2003. Each node in the server cluster may or may not be connected to a shared cluster storage device. Each node maintains its own copy of the server cluster configuration data, and the cluster service is responsible for ensuring that this configuration data remains consistent across all nodes. Synchronization of quorum data occurs over Server Message Block (SMB) file shares.This communication is unencrypted. Figure 6.3 illustrates the structure of the MNS model. 194 Chapter 6 • Implementing Windows Cluster Services and Network Load Balancing Figure 6.2 Single Quorum Device Server Cluster Node Public Network Virtual Server Virtual Server Virtual Server . . . Node S Quorum Logical Disk Logical Disk Interconnect Network 301_BD_W2k3_06.qxd 5/13/04 3:06 PM Page 194 This model is normally used as part of an OEM pre-designed or pre-configured configuration. It has the ability to support geographically distributed server clusters. When used in geographically dispersed configurations, network latency becomes an issue.You must ensure that the round-trip network latency is a maximum of 500 milliseconds (ms), or you will experience availability problems. The behavior of an MNS server cluster differs from that of a single quorum device server cluster. In a single quorum device server cluster, one node can fail and the server cluster can still function.This is not necessarily the case in an MNS cluster.To avoid split-brain, a majority of the nodes must be active and available for the server cluster to function. In essence, this means that 50 percent plus 1 of the nodes must be operational at all times for the server cluster to remain opera- tional.Table 6.1 illustrates this relationship. Table 6.1 Majority Node Set Server Cluster Failure Tolerance Maximum Node Failures Number of Nodes in MNS before Complete Cluster Nodes Required to Continue Server Cluster Failure Cluster Operations 10 1 20 2 31 2 41 3 52 3 62 4 73 4 83 5 Implementing Windows Cluster Services and Network Load Balancing • Chapter 6 195 Figure 6.3 A Majority Node Set Server Cluster Node Public Network Node Quorum Quorum Quorum Node NodeNode Quorum Quorum 301_BD_W2k3_06.qxd 5/13/04 3:06 PM Page 195 . Server, Windows 2000 Datacenter Server, Windows Server 2003 Enterprise Edition, or Windows Server 2003 Datacenter Edition .The two editions of Windows Server 2003 cannot be used in the same server. to another without moving all the disks in the volume. Wait to import your disk until you have all the disks in the volume physically connected to the server. Then when you import them, Windows. any server cluster, there is something called the quorum resource .The quorum resource is used to determine the state of the server cluster .The node that controls the quorum resource controls the server