Deep Dive: vSphere Network Link Failure Settings

In this series of posts I will tackling different topics around vSphere and attempting to explain what they mean and how to use them. This article will discuss the link fail over detection methods.

Link Fail over detection

Link fail over detection is a critical component in any infrastructure this is the method ESXi used to determine if a link has failed and should not be used for traffic. ESXi provides two options:

Link Status
Beacon Probing

Link Status

Link status is just as it sounds. The link is either up or down. This method can detect switch failure or cable failure on the next hop. For example if switch A were to loose power ESXi move move all possible traffic from NIC1 and NIC2 to NIC3 and 4.

Link status does have some drawbacks:

It cannot detect mis-configuration on the switches or upstream.
It cannot detect upstream failures (for example the router attached to each switch)

For these reasons it is critical that you implement some type of link state tracking on your network gear. A common setup is to configure ports to shutdown when their uplink ports fail. This type of link state tracking is a function of the switch gear and it’s critical that it be configured all the way to the ESXi ports so ESXi see’s a link failure. It still cannot overcome the misconfiguration. This is really bad in situations where MTU is misconfigured upstream. For this reason VMware implemented a Network health check and can help identify MTU mismatches and VLAN issues. I would 100% recommend turning it on. It’s a free health check that can save you hours.

Beacon probing

Beacon probing is a really simple process. It requires a odd number of network devices. Each network card sends out a broadcast message. As each nic receives the other network cards broadcast it knows it is not isolated from the others and assumes good link state. This process has a number of advantages:

Can detect upstream failures
Can detect some misconfigurations

It does have a downside

Requires at least three network cards for a quorum (2 would vote each other out)
Can lead to false positives

I would like to explain the false positives. There are a number of situations where it would be possible for broadcast message to not reach the destination during these times all links determined as isolated would be shutdown. You could put your host into a isolation event very quickly all at once.

Link State Tracking Choice

This one is 100% up to you. If you only have two or less network cards use link state. If you have three or more then you might want to use beacon probing. Either way test every possible failure scenario for possible issues before depoying in production.

Notify Switch of failure

Should you notify the switch of a failure? I would think this is a good idea. Without going into a discussion of arps. This setting chooses to send out gratuitous arp messages after a fail over event. These messages allow switches to quickly update their arp tables. Without these updates messages destined for moved virtual machines may take up to five minutes before they get the message. This is unlikely but possible in complex network configurations. My vote is always yes… I cannot think of a downside but suggestion one if you know it.

Failback:

This setting allows traffic to be moved back to a link after a link state failure is set to yes. If set to no you have to manually move the traffic flow back. There are two schools of thought on this matter. Failback yes creates a automated fail back when outages occur. Less work is good. But it’s possible that a link starts flapping and traffic keeps moving back and forth all night between working and failed links… causing availability problems in your environment. It’s really up to your requirements but I suggest that if you use failback:NO enable a vSphere alarm to let you know so you can re-add the link after the failure is resolved.