Deep Dive: Standard Virtual Switch vs Distributed Virtual Switch

Let the wars begin.  This article will discuss the current state of affairs between virtual switches in ESXi.   I have not included any third party switches because I believe them to becoming quickly not part of the picture with NSX.

 

Whats the deal with these virtual switches?

Virtual switches are a lot like ethernet layer 2 switches.  They have a lot of the same common features.  Both switch types feature the following configurable items:

  • Uplinks – connections from the virtual switch to the outside world – physical network cards
  • Port Groups – groups of virtual ports with similar configuration

In addition both switch types support:

  • Layer 2 traffic handling
  • VLAN segmentation
  • 801.1 Q tagging
  • nic teaming
  • Outbound traffic shaping

So the first question everyone ask’s is if two virtual machines are in the same vlan and on the same server does their communication leave the server?

No… communication between the two vm’s on the same ESXi host can communicate without leaving the switch.

 

Port Groups what are they?

Much like the name suggests port groups are groups of ports..  They can be best described as a number of virtual ports (think physical port 1-10) that are configured the same.  Port groups can have a defined number of ports and expanded at will (like a 24 port switch or 48 port switch)  There are two generic types of port groups:

  • Virtual machine
  • VMkernel

Virtual machine port groups is for guest virtual machines.  VMkernel port groups are for ESXi management functions and storage. The follow are valid uses for VMkernel ports

  • Management Traffic
  • Fault Tolerance Traffic
  • IP based storage
  •  vMotion traffic

You can have one or many port groups for VMkernel but each requires a valid IP address that can reach other VMkernel ports in the cluster.

At time of writing (5.5) the follow maximum’s apply

  • Total switch ports per host: 4096
  • Maximum active ports: 1016
  • Port groups per standard switch:512
  • Port groups per distributed switch: 6500
  • VSS port groups per host: 1000

So as you can see vDS scales a lot higher.

Standard Virtual Switch

The standard switch has one real advantage.  It does not require enterprise plus licensing to use.  It has a lot less features and some draw backs including:

  • No configuration sync – you have to create all port groups exactly the same on each host or lots of things will fail (even upper case vs lower case will cause it to fail)

Where do standard switches make sense?  Small shops with a single port group they make a lot of sense.  If you need to host 10 virtual machine on the same subnet then standard switches will work fine.

Advice

  • Use scripts to deploy switches and keep them in sync to avoid manual errors
  • Always try vMotions between all hosts before after each change to ensure nothing is broken
  • Don’t go complex on your networking design – it will not pay off

Distributed Virtual Switch

Well the distributed virtual switch is a different animal.  It is configured by vCenter and deployed to each ESXi host.  The configuration is in sync.  It has the following additional features:

  • Inbound Traffic Shaping – Throttle incoming traffic to the switch – useful to slow down traffic to a bad neighbor
  • VM network port block – Block the port
  • Private VLAN’s – This feature requires switches that support PVLAN so you can create VLAN’s inbetween vlans
  • Load – Based teaming – Best possible load balancing (another article on this topic later)
  • Network vMotion – Because the dVS is owned by vCenter traffic stats and information can move between hosts when a virtual machine moves… on a standard switch that information is lost with a vMotion
  • Per port policy – dVS allows you to define policy at the port level instead of port group level
  • Link Layer Discoery Protocol – LLDP enables virtual to physical port discovery (your network admins can see info on your virtual switches and you can see network port info – great for troubleshooting and documentation)
  • User defined network i/o control – you can shape outgoing traffic to help avoid starvation
  • Netflow – dVS can output netflow traffic
  • Port Mirroring – ports can be configured to mirror for diagnostic and security purposes

As you can see there are a lot of features on the vDS with two draw backs:

  • Requires enterprise plus licensing
  • Require vCenter to make any changes

The last draw back has provided a number of hybrid solutions over the years.  At this point VMware has created a work around with the empherial port group type and the network recovery features of the console.

Advice in using:

  • Backup your switch with PowerCli (a number of good scripts out there)
  • Don’t go crazy just because you can if you don’t need the feature don’t use it
  • Test your vCenter to confirm you can recover from a failure

So get to the point which one should I use?

Well to take the VCDX model here are the elements of design:

Availability

  • VSS – deployed and defined on each ESXi host no external requirements + for availability
  • dVS – deployed and defined by vCenter and requires it to provision new ports/ port groups – for availability

Manageability

  • VSS – pain to manage in most environments and does not scale with lots of port groups or complex solutions – for manageability
  • dVS – Central management can be deployed to multiple hosts or clusters at the same datacenter + for manageability

Performance

  • VSS – performance is fine no effect on quality
  • dVS – performance is fine no effect on quality other than it can scale up a lot larger

Recoverability

  • VSS – is deployed to each host and stored on each host… if you loose it you have to rebuild from scratch and manually add vm’s to the new switch – for recoverability
  • dVS – is deployed from vCenter and you always have it as long as you have vCenter.  If you loose vCenter you have to start from scratch and cannot add new hosts.  (don’t remove your vCenter it’s a very bad idea) + as long as you have a way to never loose your vCenter (does not exist yet)

Security

  • VSS – Offers basic security features not much more
  • dVS – Wider range of security features + for security

 

End Result:

dVS is better is most ways but costs more money.   If you want to use dVS it might be best to host vCenter on another cluster or ensure it’s availability.

 

 

Deep Dive: vSphere Network Link Failure Settings

In this series of posts I will tackling different topics around vSphere and attempting to explain what they mean and how to use them.  This article will discuss the link fail over detection methods.

 

Link Fail over detection

Link fail over detection is a critical component in any infrastructure this is the method ESXi used to determine if a link has failed and should not be used for traffic.   ESXi provides two options:

  • Link Status
  • Beacon Probing

 

Link Status

Link status is just as it sounds.  The link is either up or down.  This method can detect switch failure or cable failure on the next hop. For example if switch A were to loose power ESXi move move all possible traffic from NIC1 and NIC2 to NIC3 and 4.

Drawing1

 

Link status does have some drawbacks:

  • It cannot detect mis-configuration on the switches or upstream.
  • It cannot detect upstream failures (for example the router attached to each switch)

For these reasons it is critical that you implement some type of link state tracking on your network gear.  A common setup is to configure ports to shutdown when their uplink ports fail.   This type of link state tracking is a function of the switch gear and it’s critical that it be configured all the way to the ESXi ports so ESXi see’s a link failure.   It still cannot overcome the misconfiguration.  This is really bad in situations where MTU is misconfigured upstream.   For this reason VMware implemented a Network health check and can help identify MTU mismatches and VLAN issues.  I would 100% recommend turning it on.  It’s a free health check that can save you hours.

Beacon probing

Beacon probing is a really simple process.  It requires a odd number of network devices.  Each network card sends out a broadcast message.  As each nic receives the other network cards broadcast it knows it is not isolated from the others and assumes good link state.   This process has a number of advantages:

  • Can detect upstream failures
  • Can detect some misconfigurations

It does have a downside

  • Requires at least three network cards for a quorum (2 would vote each other out)
  • Can lead to false positives

I would like to explain the false positives.  There are a number of situations where it would be possible for broadcast message to not reach the destination during these times all links determined as isolated would be shutdown.   You could put your host into a isolation event very quickly all at once.

 

Link State Tracking Choice

This one is 100% up to you.  If you only have two or less network cards use link state.   If you have three or more then you might want to use beacon probing.  Either way test every possible failure scenario for possible issues before depoying in production.

 

Notify Switch of failure

Should you notify the switch of a failure?  I would think this is a good idea.  Without going into a discussion of arps.  This setting chooses to send out gratuitous arp messages after a fail over event.   These messages allow switches to quickly update their arp tables.  Without these updates messages destined for moved virtual machines may take up to five minutes before they get the message.   This is unlikely but possible in complex network configurations.   My vote is always yes… I cannot think of a downside but suggestion one if you know it.

 

Failback:

This setting allows traffic to be moved back to a link after a link state failure is set to yes.  If set to no you have to manually move the traffic flow back.    There are two schools of thought on this matter.  Failback yes creates a automated fail back when outages occur.  Less work is good.  But it’s possible that a link starts flapping and traffic keeps moving back and forth all night between working and failed links… causing availability problems in your environment.  It’s really up to your requirements but I suggest that if you use failback:NO enable a vSphere alarm to let you know so you can re-add the link after the failure is resolved.

 

Network IO Control failing to shape traffic with multi-nic vMotion

NIOCI have always used network IO Control to shape my traffic on a virtual switch (Enterprise Plus required).   It does a great job of balancing traffic when contention comes into play.  Unfortunately it cannot shape traffic as it comes into the virtual switch.  It can only shape traffic going out.   A new friend (@VMPrime)  pointed this out to me at the perfect time.   He was knowledgeable and encouraging an all around great guy.   He pointed out that NIOC only have effect on the machine during traffic flows exiting the machine.  When traffic goes to another machine NIOC has no effect.  I remember reading about this but the terminology was a little fuzzy from a VMware perspective.  Joe provided a simple scenario when that lack of control could be a problem.  Take into account the following scenario.  Assume that we have a two host cluster each running two 10GB nics.  We have a vlan for management, virtual machines and we have setup multi-nic vMotion as shown in the diagram 1 below.   We have NIOC setup with shares to protect each traffic type during contention.  Assume that the network utilization of host A is 2GB.  While the network utilization of host b is 15GB.Assuming that host B has capacity for all of host A virtual workloads I put host A into maintenance mode host A now utilizes up to 18GB of network to transfer the running state of virtual machines to host B.  Host A’s NIOC kicks in preserving 2GB for virtual machines and allocated 18GB to vMotion to Host B.  We are now shoving 18GB into Host B who’s virtual machine need 15GB’s.  Now both sides are contending for space and we might have availability issues on our host in addition the vMotion might fail.

How do we solve this issue?

This is exactly why we have Network Limits. Unlike CPU and memory Limits NIOC limits can really help with this exact issue. Putting a limit on vMotion of for example 2.5 GB per link would create a scenario when it could never use more than 5GB per host. Will this still have an effect? Maybe it’s a cost benefit anaylisis. You have to weigh your options and you might have to adjust you limit lower.

 

Drawing1

Testing MTU in vSphere

Well I have been playing around with vxlan more than I care to admit.   It’s a painful process.  One key component to vxlan is an increased MTU to 1600 in order to support the encapsulation.  You verify that you don’t have a MTU issue the following way:

Login to your ESXi host (I like ssh but it’s up to you).

Identify the vmknic with your MTU settings:

esxcfg-vmknic -l

You should see a list of vmknic’s and MTU settings.   Then check to make sure your local switch also has the MTU setting => the nic setting

esxcfg-vswitch -l

Check for MTU of switch.   If everything looks ok you can use the vmkping to send a packet.  Test basic connectivity first:

vmkping IP_Address_of_local_interface
vmkping IP_address_of_remote_interface

This should return with pings unless you are using 5.5 (see below for more 5.5 stuff).   If this fails you have basic connectivity issues like firewall,subnet or some other layer 2 problem.  Now test for a 1600 byte packet (VMware has a 28 byte overhead that command does not take into account)

5.0 (-d is do not fragment -s is size)

vmkping -d -s 1572 IP_Address_of_local_interface
vmkping -d -s 1572 IP_address_of_remote_interface

5.1 (-I allows you to identify the vmknic to use)
vmkping -I vmknic# -d -s 1572 IP_Address_of_local_interface
vmkping -I vmknic# -d -s 1572 IP_address_of_remote_interface

5.5 (this one is different this actually shoots out a vxlan packet not a MTU 1572 packet - true test of vxlan)

vmkping ++netstack=vxlan vmknic_IP -d -s 1572

or

esxcli network diag ping --netstack=vxlan --host vmknic_IP --df --size=1572

Enjoy your testing and remember the 1572 rule.

VMware predefined NIOC settings what do they mean?

Recently I was setting up a new 5.5 cluster with NIOC and I noticed all the new NIOC pre-build categories:

Untitled (1)

 

Some are obvious but others are a little more questionable.  After a great discussion with VMware support I found out the following:

  • NFS traffic – This is traffic using the NFS bindings in ESXi (not guest NFS traffic) only ESXi NFS traffic
  • Management Traffic – ESXi management traffic only – connections between vcenter and ESXi
  • vMotion Traffic – vMotion and heartbeats
  • vSphere storage Area network traffic – I had a lot of questions on this one but it turned out to be simple vSAN only traffic
  • vSphere replication traffic – Traffic coming from the vsphere replication appliance only no other replication traffic
  • iSCSI traffic – As expected it’s traffic to ESXi that is iSCSI using hardware or software initiator
  • Virtual Machine traffic – Traffic out of guest virtual machines
  • Fault Tolerance Traffic – Traffic specific to vmware FT

There is all the predefined ones… what if I create a user defined category and assign it to my NFS port group… which assigns NIOC.   Simple the one with the larger share.

VMware Network IO Control

Network IO Control has been around since 4.1 but I mostly ignored it.   We mostly would use HP’s virtual connect to divide up 10GB connections into smaller nics.  As I spend more time designing solutions I have found Network IO control to be my best friend.   I allows you to get the most of out of a 10GB connection at all times.   The concept is simple:  Each type of network traffic is allocated a share (number between 1 and 100).  Network IO comes with some predefined traffic classes that are automatically assigned by vsphere.

  • vMotion
  • iSCSI
  • FT Logging
  • Management
  • NFS
  • Virtual machine traffic

In addition you can create your user defined class of traffic.  User defined classes of traffic can be assigned at the port group level.   You need to keep a few things in mind when working with Network IO

  • Network IO is evaulated on the dvUplink level (shares are per network uplink)
  • Network IO requires vNetwork Distributed Switch (vDS)
  • Network IO shares are only used when contention is present (in other words each type of traffic get 100% of requested bandwidth unless contention)
  • When evaluating shares only active traffic is taken into account (For example if you have NFS with 30, Virtual machine with 100, and vMotion with 50 but you don’t use NFS then you only have 150 shares to divide your 10GB)
  • Network IO only applied to outbound flow – it cannot do anything about shaping in bound flow.

In addition network IO control offers two addition features:

  • Limits – Just like cpu or memory limits (allows you make the customer think they have 10Gb for their virtual machine but never allowing them more than 1gb) – I would avoid limits unless you have a odd use case
  • Load-Based Teaming – this new feature allows you to move traffic flow to an addition uplink once you reach 75% of capacity over a 30 second period.  This is 100% the best load balancing option with vmware.

Network IO provides some awesome features that you should play with.   You can read a older but still valid white paper here.

ESXi purple screen of death with errors E1000PollRxRing and E1000DevRx

Everyone loves the purple screen of death right?

300px-ESX_PSOD_adaptec_asm

This is a nasty problem that I ran into… It happens with Windows virtual machines using the E1000 adapter and is a known issue. Since this is a known issue from the 5.0 days I doubt vmware is going to resolve it.  The only fix is to replace the E1000 adapters with VMXNET3 adapters.   You really should be using VMXNET3 adapters anyways but replacing these requires downtime.   It’s a pain.   You can read the official vmware article here:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2059053

vcloud Director my virtual machine cannot get http to work unless it’s on the same node as vShield Edge

The name says most of it.  You have a ORG vapp that cannot get http to work.  DNS works.. telnet works but true http connections will not work.  I spent a bunch of time troubleshooting this issue with vmware support and the end result: somewhere upstream my MTU was not set to 1600 breaking VXLAN.   You can test this by moving the VM to the same esxi host as the vshield edge.  If it works while on that node it’s a MTU issue on VXLAN trust me.

How do you prove it?  Login to the ESXi host and locate the virtual IP for VXLAN on another node (assume it’s 192.168.10.31) and use this command:

vmkping -s 1547 192.168.10.31 -d

If it hangs then your have MTU issues somewhere.

Vsphere Issue Guest Unable to collect IPv4 routing table

Yesterday Morning I showed up for work and found a virtual machine all buggered up due to some application errors in java.  My only recourse was a timely reboot.  During the reboot I found it stuck on the following:

Guest Unable to collect IPv4 routing table

This was a first for me.  I originally assumed that it was caused by an operating system issue (which it is see RHBA-2013:1290) but there really is only work around’s at this time.  You can follow the official route which is documented here.  Or a nice work around that works every time here.   I have tested both and they work but I hate modifying the vmx file.  This is a really nasty bug that needs to be fixed and I hope VMware will soon.

VMware Network Failover Policies

I have been reading up on these policies, I find it odd that almost everything I read talks about the Load Balancing policies but not much about the failover policies  so here we go.

Network Failover Detection

This is the method used to detect when a failure occurs on a physical network.   There are two options:

  • Link Status only: This uses the next hop port to test for outage.  The problem is this does not detect misconfigurations or upstream outages on switches.  This is essentially a port down condition.
  • Beacon Probing: This is a process of sending out a packet on all nic’s in a team.  These packets are destined for the other nic’s in the team.  This way a nic can detect upstream issues if it cannot talk to it’s team members.   It also uses link status.  This method has a problem if you don’t have at least three members which can create isolation of all nic’s.  It’s the standard quorum issue in clusters.  Assume you have two nic’s and one of them is misconfigured in a upstream switch.  They both beacon probe and when they are unable to connect to each other they both determine they are failed and all nic’s go down.

Notify Switches

This setting determines when a failure occurs if the virtual switch should notify the physical switches using a gratuitous ARP.  In almost all situations you want the default of yes but you can choose no and wait 5 minutes for the ARP/CAM tables to expire before you vm’s work.  (This is not an issue with HA because OS’s when booting send out a Gratuitous ARP)

Failback

This setting determines how a nic is returned to use after a failure.

  • Yes – Return the adapter to active use right away when available
  • No – Adapter is set to inactive until an administrator changes it manually

This requires some discussion.  Assume that you have a failing adapter due to hardware.  When it fails everything goes to your second adapter.  You don’t know if it’s a hardware or software issue.  When you reboot if you are set to yes then it’s back servicing your loads until it fails again, if set to no it stays away until you allow it back into the mix.   An even worst scenario is a nic that is flapping up and down or a switch going up and down.  Both of the scenario’s create interruptions you may not want.   Then again changing it to no means you have all your traffic on a single link perhaps causing contention.