Windows Azure IaaS enables virtual machines to run in the Windows Azure cloud service, which can include a fi le server off ering a fi le share that can be domain joined, making it seem a plausible option to host the witness for a cluster.
Technically the answer is that the fi le share for a cluster could be hosted in a Windows Azure IaaS VM and the Windows Azure virtual network can be connected to your on-premises infrastructure using its site-to-site gateway functionality. In most cases it would not be practical because most likely the desire to use Windows Azure is because you have two datacenters hosting nodes and wish to use Windows Azure as the “third site.” Th e problem is, at the time of this writing, a Windows Azure virtual network supports only a single instance of the site-to-site gateway, which means it could be connected to only one of the datacenters. If the datacenter that the virtual network was connected to failed, the other datacenter would have no access to Windows Azure and therefore would not be able to see the fi le share witness, use its vote, and make quorum, making it fairly use- less. Once Windows Azure supports multiple site-to-site gateways, then using it for the fi le share witness would become a more practical solution.
The other options is a manual failover where services are manually activated on the disaster recovery site. In this scenario, it would be common to remove votes from the disaster recovery site so it does not affect quorum on the primary location. In the event of a failover to the disaster recovery location, the disaster recovery site would be started in a Force Quorum mode.
In reality, it is not that common to see stretched clusters for Hyper-V virtual machines because of the diffi culty and high expense of replicating the storage. Additionally, if virtual machines moved between locations, most likely their IP confi guration would require recon- fi guration unless network virtualization was being used or VLANs were stretched between locations, which again is rare and can be very expensive. In the next chapter, I will cover Hyper-V Replica as a solution for disaster recovery, which solves the problems of moving virtual machines between sites. Multisite clusters are commonly used for application workloads such as SQL and Exchange instead of for Hyper-V virtual machines.
Why Use Clustering with Hyper-V?
In the previous sections I went into a lot of detail about quorum and how clusters work. The key point is this: clusters help keep the workloads available with a minimal amount of downtime, even in unplanned outages. For Hyper-V servers that are running many virtual machines, keep- ing the virtual machines as available as possible is critical.
When looking at high availability, there are two types of outage: planned and unplanned. A planned outage is a known and controlled outage, such as, for example, when you are rebooting a host to apply patches or performing hardware maintenance or even powering down a com- plete datacenter. In a planned outage scenario, it is possible to avoid any downtime to the virtual
machines by performing a Live Migration of the virtual machines on one node to another node.
When Live Migration is used, the virtual machine is always available to clients.
An unplanned outage is not foreseen or planned, such as, for example, a server crash or hardware failure. In an unplanned outage, there is no opportunity to perform Live Migration of virtual machines between nodes, which means there will be a period of unavailability for the virtual machines. In an unplanned outage scenario, the cluster will detect that a node has failed and the resources that were running on the failed node will be redistributed among the remaining nodes in the cluster and then started. Because the virtual machines were effectively just powered off without a clean shutdown of the guest OS inside the virtual machines, the guest OS will start in what is known as a “crash consistent state,” which means when the guest OS starts and applications in the guest OS start, there may be some consistency and repair actions required.
In Windows Server 2008 R2, the Live Migration feature for moving virtual machines with no downtime between servers was available only between nodes in a cluster because the stor- age had to be available to both the source and target node. In Windows Server 2012, the abil- ity to live migrate between any two Hyper-V 2012 hosts was introduced. It’s known as Shared Nothing Live Migration, and it migrates the storage in addition to the memory and state of the virtual machine.
One traditional feature of clustering was the ability to smoothly move storage between nodes in a cluster. It was enhanced greatly with Windows Server 2008 R2 to actually allow storage to be shared between the nodes in a cluster simultaneously; it’s known as Cluster Shared Volumes (CSV). With CSV, an NTFS volume can be accessed by all the nodes at the same time, allowing virtual machines to be stored on a single NTFS-formatted LUN and run on different nodes in the cluster. The sharing of storage is a huge feature of clusters and makes the migration of vir- tual machines between nodes a much more effi cient process because only the memory and state of the virtual machine needs to be migrated and not the actual storage. Of course, in Windows Server 2012, nodes not in a cluster can share storage by accessing a common SMB 3 fi le share, but many environments do not have the infrastructure to utilize SMB 3 at a datacenter level or already have large SAN investments.
As can be seen, some of the features of clustering for Hyper-V are now available outside of a cluster at some level, but not with the same level of effi ciency and typically only in planned scenarios. Additionally, a cluster provides a boundary of host membership, which can be used for other purposes, such as virtual machine rebalancing, placement optimization, and even automation processes such as cluster patching. I will be covering migration, CSV, and the other technologies briefl y mentioned in detail later in this chapter.
Clustering brings high availability solutions to unplanned scenarios, but it also brings some other features to virtual machine workloads. It is because of some of these features that occa- sionally you will see a single-node cluster of virtual machines. Hyper-V has a number of great availability features, but they are no substitute for clustering to maintain availability during unplanned outages and to simplify maintenance options, so don’t overlook clustering.
Service Monitoring
Failover clustering provides high availability to the virtual machine in the event of a host fail- ure, but it does not provide protection or assistance if a service within the virtual machine fails.
Clustering is strictly making sure the virtual machine is running; it offers no assistance to the operating system running within the virtual machine.
Windows Server 2012 clustering changed this by introducing a new clustering feature, ser- vice monitoring, which allows clustering to communicate to the guest OS running within the virtual machine and check for service failures. If you examine the properties of a service within Windows, there are actions available if the service fails, as shown in Figure 7.8. Note that in the Recovery tab, Windows allows actions to be taken on the fi rst failure, the second failure, and then subsequent failures. These actions are as follows:
◆ Take No Action
◆ Restart The Service
◆ Run A Program
◆ Restart The Computer Figure 7.8
Service retry actions
Consider if a service fails three times consecutively; it’s unlikely restarting it a third time would result in a different outcome. Clustering can be confi gured to perform the action that is known to fi x any problem, reboot the virtual machine on the existing host. If the virtual machine is rebooted by clustering and the service fails a subsequent time inside the virtual machine, then clustering will move the virtual machine to another host in the cluster and reboot it.
For this feature to work, the following must be confi gured:
◆ Both the Hyper-V servers must be Windows Server 2012 and the guest OS running in the VM must be Windows Server 2012.
◆ The host and guest OSs are in the same or at least trusting domains.
◆ The failover cluster administrator must be a member of the local administrator’s group inside the VM.
◆ Ensure that the service being monitored is set to Take No Action (see Figure 7.8) within the guest VM for subsequent failures (which is used after the fi rst and second failures) and is set via the Recovery tab of the service properties within the Services application (services.msc).
◆ Within the guest VM, ensure that the Virtual Machine Monitoring fi rewall exception is enabled for the Domain network by using the Windows Firewall with Advanced Security application or by using the following Windows PowerShell command:
Set-NetFirewallRule -DisplayGroup "Virtual Machine Monitoring" -Enabled True After everything in the preceding list is confi gured, enabling the monitoring is a simple process:
1. Launch the Failover Cluster Manager tool.
2. Navigate to the cluster and select Roles.
3. Right-click the virtual machine role you wish to enable monitoring for, and under More Actions, select Confi gure Monitoring.
4. The services running inside the VM will be gathered by the cluster service communicat- ing to the guest OS inside the virtual machine. Check the box for the services that should be monitored, as shown in Figure 7.9, and click OK.
Figure 7.9 Enabling monitor- ing of a service
Monitoring can also be enabled using the Add-ClusterVMMonitoredItem cmdlet and -VirtualMachine, with the -Service parameters, as in this example:
PS C:\ > Add-ClusterVMMonitoredItem -VirtualMachine savdaltst01 -Service spooler After two service failures, an event ID 1250 is logged in the system log. At this point, the VM will be restarted, initially on the same host, but on subsequent failures it will restart on another node in the cluster. This process can be seen in a video at http://youtu.be/H1EghdniZ1I.
This is a very rudimentary capability, but it may help in some scenarios. As mentioned in the previous chapter, for a complete monitoring solution, leverage System Center Operations Manager, which can run monitoring with deep OS and application knowledge that can be used to generate alerts. Those alerts can be used to trigger automated actions for remediation or simply to generate incidents in a ticketing system.
Protected Network
While the operating system and applications within virtual machines perform certain tasks, the usefulness of those tasks is generally being able to communicate with services via the network.
If the network is unavailable on the Hyper-V host that the virtual machine uses, traditionally clustering would take no action, which has been a huge weakness. As far as clustering is aware, the virtual machine is still fi ne; it’s running with no problems. Windows Server 2012 R2 intro- duces the concept of a protected network to solve this fi nal gap in high availability of virtual machines and their connectivity.
The Protected Network setting allows specifi c virtual network adapters to be confi gured as protected, as shown in Figure 7.10, via the Settings option of a virtual machine and the Advanced Features options of the specifi c network adapter. In the event the Hyper-V host loses network connectivity that the virtual machine network adapters confi gured as a protected net- work are using, the virtual machines will be live migrated to another host in the cluster that does have network connectivity for that network. This does require that the Hyper-V host still have network connectivity between the Hyper-V hosts to allow Live Migration, but typically clusters will use different networks for virtual machine connectivity than those used for Live Migration purposes, which means Live Migration should still be possible.
It is important to try to provide as much resiliency as possible for network communications, which means using NIC teaming on the hosts as described Chapter 3, “Virtual Networking,”
but the protected network features provides an additional layer of resiliency to network failures.
Cluster-Aware Updating
Windows Server 2012 placed a huge focus on running the Server Core confi guration level, which reduced the amount of patching and therefore reboots required for a system. There will still be patches that need to be installed and therefore reboots, but the key point is to reduce (or ideally, eliminate) any impact to the virtual machines when hosts have to be rebooted.
In a typical cluster, any impact to virtual machines is removed by Live Migrating virtual machines off of a node, patching and rebooting that node, moving the virtual machines back, and repeating for the other nodes in the cluster. This sounds simple, but for a 64-node cluster, this is a lot of work.
SCVMM 2012 introduced the ability to automate the entire cluster patching process with a single click, and this capability was made a core part of failover clustering in Windows Server 2012. It’s called Cluster-Aware Updating. With Cluster-Aware Updating, updates are obtained
from Microsoft Update or an on-premises Windows Server Update Services (WSUS) implemen- tation and the entire cluster is patched with no impact to availability of virtual machines.
I walk through the entire Cluster-Aware Updating confi guration and usage at the following location:
http://windowsitpro.com/windows-server-2012/cluster-aware-updating-windows- server-2012
Figure 7.10 Confi guring a pro- tected network on a virtual machine network adapter
Where to Implement High Availability
With the great features available with Hyper-V clustering, it can be easy to think that clustering the Hyper-V hosts and therefore providing high availability for all the virtual machines is the only solution you need. Clustering the Hyper-V hosts defi nitely provides great mobility, storage sharing, and high availability services for virtual machines, but that doesn’t mean it’s always the best solution.
Consider an application such as SQL Server or Exchange. If clustering is performed only at the Hyper-V host level, then if the Hyper-V host fails, the virtual machine resource is moved to another host and then started in a crash consistent state, which means the service would be unavailable for a period of time and likely an amount of consistency checking and repair would be required. Additionally, the host-level clustering will not protect from a crash within the vir- tual machine where the actual service is no longer running but the guest OS is still functioning,
and therefore no action is needed at the host level. If instead guest clustering was leveraged, which means a cluster is created within the guest operating systems running in the virtual machines, the full cluster-aware application capabilities will be available, such as detecting if the application service is not responding on one guest OS, allowing another instance of the application to take over. Guest clustering is fully supported in Hyper-V virtual machines, and as covered Chapter 4, “Storage Confi gurations,” there are numerous options to provide shared storage to guest clusters, such as iSCSI, Virtual Fibre Channel, and shared VHDX.
The guidance I give is as follows:
◆ If the application running inside the virtual machine is cluster aware, then create multiple virtual machines, each with the application installed, and create a guest cluster between them. This will likely mean enabling some kind of shared storage for those virtual machines.
◆ If the application is not cluster aware but works with technologies such as Network Load Balancing (NLB), for example IIS, then deploy multiple virtual machines, each running the service, and then use NLB to load balance between the instances.
◆ If the application running inside the virtual machine is not cluster aware or NLB sup- ported but multiple instances of the application are supported and the application has its own methods of distributing load and HA (for example, Active Directory Domain Services), then deploy multiple instances over multiple virtual machines.
◆ Finally, if there is no application-native high availability option, rely on the Hyper-V cluster, which is better than nothing.
It is important to check whether applications support not only running inside a virtual machine (nearly all applications do today) but also running on a Hyper-V cluster, and extend- ing that, whether they support being live migrated between hosts. Some applications initially did not support being live migrated for technical reasons, or they were licensed by physical processors, which meant it was expensive if you wanted to move the virtual machine between hosts because all processors on all possible hosts would have to be licensed. Most applications have now moved beyond restrictions of physical processor instance licensing, but still check!
There is another confi guration you should perform on your Hyper-V cluster for virtual machines that contain multiple instances of an application (for example, multiple SQL Server VMs, multiple IIS VMs, multiple domain controllers, and so on). The goal of using multiple instances of applications is to provide protection from the VM failing or the host that is running the virtual machines failing. Having multiple instances of an application across multiple virtual machines is not useful if all the virtual machines are running on the same host. Fortunately, failover clustering has an anti-affi nity capability, which ensures where possible that virtual machines in the same anti-affi nity group are not placed on the same Hyper-V host. To set the anti-affi nity group for a virtual machine, use cluster.exe or PowerShell:
◆ (Get-ClusterGroup "<VM>").AntiAffinityClassNames =
"<AntiAffinityGroupName>"
◆ cluster.exe group "<VM>" /prop AntiAffinityClassNames="<AntiAffinityGroup Name>"
The cluster affi nity can be set graphically by using SCVMM, as shown in Figure 7.11. SCVMM uses availability set as the nomenclature instead of anti-affi nity group. Open the properties of the
virtual machine in SCVMM, navigate to the Hardware Confi guration tab, and select Availability under the Advanced section. Use the Manage Availability Sets button to create new sets and then add them to the virtual machine. A single virtual machine can be a member of multiple availability sets.
Figure 7.11 Setting affi nity using SCVMM
By default, this anti-affi nity solution is a soft enforcement, which means clustering will do its best to keep virtual machines in the same anti-affi nity group on separate hosts, but if it has no choice, it will place instances on the same host. This enforcement can be set to hard by setting the cluster ClusterEnforcedAntiAffinity attribute to 1, but this may mean virtual machines may not be able to be started.
For virtual machines that are clustered, it is possible to set the preferred owners for each virtual machine and set the order of their preference. However, it’s important to realize that just because a host is not set as a preferred owner for a resource (virtual machine), that doesn’t mean the host can’t still run that resource if none of the preferred owners are available. To set the preferred owners, right-click on a VM resource and select Properties, and in the General tab, set the preferred owners and the order as required.
If you want to ensure that a resource never runs on specifi c hosts, you can set the possible owners, and when a resource is restricted to possible owners, it cannot run on hosts that are not possible owners. This should be used with care because if no possible owners are available that are confi gured, the resource cannot start, which may be worse than it just not running on a nonoptimal host. To set the possible owners, you need to modify the cluster group of the virtual machine, which is in the bottom pane of Failover Cluster Manager. Right-click the vir- tual machine resource group and select Properties. Under the Advanced Policies tab, the pos- sible owners are shown. If you unselect servers, then that specifi c virtual machine cannot run on the unselected servers.
The same PowerShell cmdlet is used, Set-ClusterOwnerNode, to set both the preferred and possible owners. If the cmdlet is used against a cluster resource (that is, a virtual machine), it sets the preferred owners. If it is used against a cluster group, it sets the possible owners.