Avoid Network Split Brains on Virtual Servers with Storage Foundation HA 5.1 SP1's New Non-SCSI-3 Fencing Feature

| | Leave a comment
Configuring high availability for virtual machines (VMs) in VMware environments presents new challenges that are not always easy to spot when applications are virtualized. One prime example is handling the failover of applications on VMs that are part of a cluster without causing a network split brain. It is this exact issue that Service Pack 1 (SP1) for Storage Foundation HA 5.1 addresses with its new non-SCSI-3 fencing feature.

As enterprise organizations virtualize applications that are part of a cluster, they must account for how VMware virtualizes and controls physical resources assigned to it. Once virtualized, VMs no longer have direct access or visibility into the properties of resources attached to the VMware host.

This is problematic for clustering software as it can no longer see the SCSI-3 persistent group reservation (PGR) bit on the LUN associated with the application on the VM since it is obfuscated by VMware. Without access to this LUN, the clustering software cannot prevent the possibility of a network split brain from occurring.

Servers (physical or virtual) that are included in a cluster use a heartbeat to communicate the status of their availability. When a server in the cluster fails, its heartbeat stops which indicates to the other servers in the cluster that one of them should take over processing for that application. This configuration works as long as when a server heartbeat stops it actually reflects that the server has failed.

A network split brain situation occurs when a server heartbeat stops for reasons other than a server failure. This could be as something as simple as a network disruption caused by a failing network card or switch.

In this circumstance, the network disruption causes the cluster to be split into two sub clusters with the nodes in each cluster thinking that the nodes in the other sub cluster are dead. However all the nodes in both sub clusters are alive.  This creates the potential for the application to run at the same time on the production and failover servers which would lead to data corruption and a network split brain.

To avoid this scenario Veritas Cluster Server (VCS) uses the concept of I/O fencing to provide network arbitration.  To do this VCS has traditionally used the SCSI-3 PGR feature on LUNs to perform this task.

Since the servers in the cluster all share access to this LUN with the SCSI-3 PGR, once a network disruption occurs, the network split divides the cluster into two sub clusters. At this point, each sub cluster elects one node to "race" for control of the LUN. The "winner" is the first sub cluster that gets control of this LUN. It then sets the SCSI-3 PGR bit to indicate that it now controls the application which eliminates any risk to data consistency and correctness.

However in a virtualized environment, VMs that are part of a cluster cannot access this LUN with the SCSI-3 PGR bit since the LUN is virtualized by VMware so this fencing technique cannot be used to provide network arbitration. So if a network disruption occurs for reasons other than the failure of a production application, it once again re-introduces the possibility of a network split brain.

This is what the new non-SCSI-3 fencing feature in Storage Foundation HA 5.1 SP1 resolves. To address this issue, SP1 takes advantage of the coordination point server (CPS) feature available in Symantec's Veritas Storage Foundation HA 5.1.

CPS acts like a LUN with the SCSI-3 PGR feature but CPS operates in a LAN as opposed to a SAN environment. The original purpose of CPS was to coordinate the recovery of cluster at another site should an entire site go down as a means to prevent clusters from operating in a network split brain.

In SP1 the functionality of CPS is extended to eliminate the need for servers (physical or virtual) to have access to and set the SCSI-3 PGR bit on a LUN to indicate that it has control of the application. This becomes relevant in virtual environments since VMware virtual machines do not have access to SCSI-3 based fencing support though CPS may also be used in physical environments if an organization does not have storage arrays that support the SCSI-3 feature.

Using the non-SCSI-3 fencing feature in CPS, the servers in the cluster now all access the CPS. Once a network disruption occurs, the network split divides the cluster into two sub clusters. At this point, each sub cluster elects one node to "race" to communicate with the CPS. The "winner" is the sub cluster as determined by the CPS with this sub-cluster awarded control of the application which eliminates any risk to data consistency and correctness.

While the introduction of this non-SCSI-3 fencing feature in SP1 eliminates the dependency on LUNs with SCSI-3 PGR bits to provide network split brain arbitration in environments where SCSI-3 PGR support is not available, organizations do need to proceed with some caution when implementing this solution.

Notably, communication needs to occur over an IP network from the servers (physical or virtual) in the cluster to the CPS in order for it to do the network arbitration and determine which sub cluster will assume application processing. This communication over a LAN will take longer than on a SAN hence the time required for fencing arbitration with CPS is longer than when using SCSI-3 based fencing.

Virtualizing business application servers that are part of clustered configurations is a growing priority for enterprise organizations. But as they do so, they need to make sure they are taking the proper steps to avoid network split brains. The new non-SCSI-3 fencing in Storage Foundation HA 5.1 SP1 now makes it possible for organizations to more confidently virtualize these application servers and do so in such a way that they do not re-introduce the possibility of creating network split brain scenarios.

Leave a comment

Optional: Sign in with   |  

Entry Sponsorship

This entry is sponsored by Symantec Corp.

About Symantec Corp.

    Symantec is a global leader in infrastructure software, enabling businesses and consumers to have confidence in a connected world. The company helps customers protect their infrastructure, information and interactions by delivering software and services that address risks to security, availability, compliance and performance. Headquartered in Cupertino, Calif., Symantec has operations in more than 40 countries. More information is available at www.symantec.com.

    DCIG is paid a fee by Symantec Corp. in connection with this blog. Symantec undertakes no obligation to update, correct or modify any statements contained in this blog; these statements represent the views and opinions of DCIG only.