VMware Network Failover Policies

I have been reading up on these policies, I find it odd that almost everything I read talks about the Load Balancing policies but not much about the failover policies  so here we go.

Network Failover Detection

This is the method used to detect when a failure occurs on a physical network.   There are two options:

  • Link Status only: This uses the next hop port to test for outage.  The problem is this does not detect misconfigurations or upstream outages on switches.  This is essentially a port down condition.
  • Beacon Probing: This is a process of sending out a packet on all nic’s in a team.  These packets are destined for the other nic’s in the team.  This way a nic can detect upstream issues if it cannot talk to it’s team members.   It also uses link status.  This method has a problem if you don’t have at least three members which can create isolation of all nic’s.  It’s the standard quorum issue in clusters.  Assume you have two nic’s and one of them is misconfigured in a upstream switch.  They both beacon probe and when they are unable to connect to each other they both determine they are failed and all nic’s go down.

Notify Switches

This setting determines when a failure occurs if the virtual switch should notify the physical switches using a gratuitous ARP.  In almost all situations you want the default of yes but you can choose no and wait 5 minutes for the ARP/CAM tables to expire before you vm’s work.  (This is not an issue with HA because OS’s when booting send out a Gratuitous ARP)

Failback

This setting determines how a nic is returned to use after a failure.

  • Yes – Return the adapter to active use right away when available
  • No – Adapter is set to inactive until an administrator changes it manually

This requires some discussion.  Assume that you have a failing adapter due to hardware.  When it fails everything goes to your second adapter.  You don’t know if it’s a hardware or software issue.  When you reboot if you are set to yes then it’s back servicing your loads until it fails again, if set to no it stays away until you allow it back into the mix.   An even worst scenario is a nic that is flapping up and down or a switch going up and down.  Both of the scenario’s create interruptions you may not want.   Then again changing it to no means you have all your traffic on a single link perhaps causing contention.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.