Edit: Thank to Ron Flax, Todd Craw for helping me correct some errors.

I have been blessed of late to be involved in some VMware NSX deployments and I am really excited about the technology.   I am by no means a master of NSX but I will post about my understand as a method to spread information and assist with my personal learning.   In this post I will be covering only the switch capabilities of NSX.

 

Traditional Switches

The key element of a layer 2 ethernet switch is the MAC address.  This is a unique (perhaps)  identifier on a network card.  Each network adapter should have a unique address.   A traditional physical switch learns the mac addresses connected on each port when the network device first tries to communicate.  For example:

Lunch

When you power on Windows Physical server the physical switch learns that MAC 00:00:00:00:01:01 is connected to port 1.  Any messaged destined for 00:00:00:00:01:01 should be sent to port 1.   This allows the switch to create logical connections between ports and limit the amount of wasted traffic.   This entry in the switches MAC table (sometimes called a cam table) stays present for 5 minutes (user configurable)  and is refreshed whenever the server uses it’s network card.   The Linux server on port two is discovered exactly the same way via physically talking on the port, the table is updated for port 2.   If Windows wants to talk to linux their communication never leaves the switch as long as they are in the same subnet.   If the MAC address is unknown by the switch it will broadcast the request out all ports with hopes that something will respond.

Address Resolution Protocol (ARP)

ARP is a protocol used to resolve IP addressed to their MAC addresses.  It is critical to understand that ARP does not return the MAC address of the final destination it only returns the mac address of the next hop unless the final destination is on the same subnet.  This is because ethernet is only concerned with next hop via mac not end destination.

Lunch

You can follow the communication with ARP’s between each layer of the diagram the key component is that if the IP is not local then it returns its own MAC and forwards it out the default gateway.

Traditional Virtual Switches

In order to understand NSX vSwitch it is critical that you understand how the traditional virtual switch works.  In a traditional virtual switch (VSS and dVS) the switch learns the mac addresses of virtual machines when they are powered on.  As soon as a virtual machine is assigned a switch port it becomes hard-coded in the MAC table for that virtual switch.   Anything that is local to that switch in the same vlan or segment will be delivered locally.    Otherwise the virtual switch just forwards the message out it’s uplink and allows the physical switches to resolve the connection.

NSX Virtual Switch

The NSX virtual switch includes additional functionality from the traditional virtual switch.  The key feature is the ability to use VXLAN to span layer 2 segments between hosts without the use of multiple streched VLAN’s.   VXLAN also allows strech layer 2 to distant datacenters and up to 16 million segements vs the current limit of 4096 vlans.  There are some common components that need to be understood:

  • VTEP (VXLAN Tunnel End Point)  – this is a ESXi virtual adapter that has its own vlan and ip address including gateway.  This interface must be set for 1600 MTU and all physical switches/routers that handle this traffic must allow at least 1600 MTU.
  • NSX virtual switch (also called logical switch) – This is a software kernel based construct that does the heavy lifting. This is deployed to a dVS switch and works as extensions to the dVS.
  • NSX Manager – This is the management plane for NSX, it acts as a central point for communication, scripting and control.  It is only required when making changes as part of the management plane
  • NSX Control cluster – This is a series of virtual machines that are clustered via software.  Each node (should be a odd number and at least three)  contains all required information and load is distributed between all three.  (Best Practice: Do a DRS rule to keep these on separate hosts, future releases may do this for you)
  • VNI – Virtual network interface – this is an identifier used by VXLAN to separate networks (think vlan tag) they start at 5000 and go to 16,000,000.  It easiest for people to think vlan tags when working with VNI’s.

With all the terminology out-of-the-way it’s time to get down to the path.   The NSX Virtual switch includes one key component the ability to switch packets between nodes or clusters without having the layer 2 streched between the clusters.  For my networking friends this means reduction in spanning tree issues.

So let me lay it out below:

Lunch

We have a three node NSX control cluster that has been deployed.  We have two ESXi hosts running dVS’s with the NSX Virtual switch.  VXLAN has been enabled and a virtual network VNI:5000 has been created.   The VTEP’s have been configured.   We have created two virtual machine as shown in green.  Neither has been connected to the VNI network yet.

 

Time to learn our first MAC:

  • We connect the Windows server to VNI:5000 as shown below
  • The MAC table on our local switch is updated (Learns) then passes it’s learned information to the control cluster
  • The control cluster passes it to all members of the logical switch (there are three methods to pass the information which I will cover in another post unicast, multicast and hybrid)

Lunch

 

This syncing of the MAC table ensures that each member of VNI knows how to handle switching creating a distributed switch (like a switch stack that has multiple switches that act as one).

When we power on the linux server the same method is used:

  • We connect the Windows server to VNI:5000 as shown below
  • The MAC table on our local switch is updated (Learns) then passes it’s learned information to the control cluster
  • The control cluster passes it to all members of the logical switch (there are three methods to pass the information which I will cover in another post unicast, multicast and hybrid)

Lunch

Now we have a ARP table available on each switch that works great.   Let’s follow the flow of communication: Assume the following.   Windows server wants to open a web page on Linux server on port 80:

  • User on Windows server brings up internet explorer and types in 192.168.10.11
  • Windows server sends out a arp entry for 192.168.10.11
  • ESXi1 ‘s virtual switch returns the MAC address 00:00:00:00:02:02
  • Windows server sends out a IP packet with the MAC address of 00:00:00:00:02:02
  • ESXi’s virtual switch forwards the packet out VTEP1 by encapsulating it destined for the IP of VTEP2
  • VTEP2 opens the packet and removes the VTEP encapsulation and forwards the packet to ESXi2 virtual switch on VNI:5000
  • The switch on ESXi2 sends the packet to the virtual port that the linux servers network card is connected on.

 

This is how a NSX virtual switch handles switching.  At first you may say this makes no sense at all… wouldn’t a VLAN just be easier.   There are a number of benefits this brings:

  • Limits your Spanning tree to potentially top of rack switches if architected correctly
  • Allows you to expand past the 4096 VLAN limit
  • Opens the door for other NSX services (which I will post about in the future.)

 

As I mentioned this is just my understanding I do not have inside knowledge if I have made a mistake let me know, I’ll test then correct it.

© 2014, Joseph Griffiths. All rights reserved.

17 Thoughts to “Deep Dive: How does the NSX vSwitch Work”

  1. Hi Joseph,
    great post! Just one point. I think there is a small typo in the MAC tables in the last picture.
    The IP address for entry 2 with MAC 00:00:00:00:02:02 should be 192.168.10.11.
    Regards,
    Thomas

  2. RE: Traditional switches: “If the MAC address is unknown by the switch it will forward the message out it’s default gateway.” If the MAC address is unknown, it will flood it out all ports except the one it came into. This is default switch/bridge behavior for the last 20+ years…

  3. ” It is critical to understand that ARP does not return the MAC address of the final destination it only returns the mac address of the next hop. This is because ethernet is only concerned with next hop via mac not end destination.”

    This is not totally correct as it assumes the other device is always on another IP subnet. It does not consider if the other device is on the same subnet in which case ARP response will have the actual MAC addr of the other end device.

  4. Great deep dive post Joseph. I believe that when the source host (in your example, Windows) attempts to communicate with the destination host (in your example, Linux) via IP address and realizes that it is on a different subnet, it immediately ARPs for the default gateway. The host knows what network it is on so it can quickly compare the destination IP address and determine that it is a different network.

  5. Joseph

    I sincerely thank you because during last days I tried to find some good basic explanation of concepts behind NSX and I can definitively say that is the best. Very clear even for me that (unfortunately), have not deep network knowledge.
    Now I have a couple of questions:
    1) Step by step guide:
    Do you know a real step by step guide? I would start with a basic lab with some esxi hosts and NSX but I have not idea how to start. I mean: in a physical world, I could have just 2/3 esxi hosts connected to a couple of cisco switches and go on creating some vlans on the switch, create a cluster, some distributed switches (using vlan) and so on…
    Now…supposing I would play with NSX what is the first step? Is there some guide from scratch?
    2) In your opinion, is possible to create a decent NSX lab using vmware workstation?

    Best
    Marco

  6. Marco,

    Thanks for reading and your kind comments. I’ll attempt to address your questions:
    1. There are a number of guides available some of the easiest I have seen are on vcdx133.com (http://vcdx133.com/category/nsx/) Rene really takes some time to provide great articles. I would suggest that you start with the DFW on your current vSphere then work into routing, switching etc… Check out the order in Rene’s articles I think they will help. I’ll also see if time allows for me to publish more.
    2.Workstation will make it hard. If you can ensure that your base OS is able to handle jumbo frams (1600 MTU is a requirement) then it can work… anything that fragments the packets between 1600 MTU will break NSX. For NSX at home I am using a HP 24 gig switch connected to each ESXi host with dual 1GB ports. Then I use my home linksys router as the router. I then static route all 172.0.0.0/8 networks to NSX. It works well. You can do NSX with a single ESXi host and learn it very well you don’t need multiple hosts to learn NSX. For learning you might want to start with the VMware hands on labs hol.vmware.com They provide learning environments that you can play with and break.

    I hope it helps thanks for reading and commenting. Let me know if I can help more.

    Joseph

  7. joseph

    thanks a lot for your kind reply.
    Ok I’ll have a look at vcdx133.com anyway I can’t wait for your articles! 🙂
    I have been really impressed by your ability to explain difficult concepts in a really simple way. While you can find lot of easy and excellent articles that explain how to build from scratch a good vmware infrastructure lab, I didn’t find the same for NSX.

    bye
    Marco

  8. Hi Joseph,

    Great explanation! Two things.

    1) Are you sure your picture about ARP is correct? I would assume the ARP request is only about IP from the same subnet.

    2) When ENTRY1 is being distributed to ESXi2 I believe IP 172.16.0 22 should be distributed as well.

  9. Thanks for reading sorry about the delayed response. On #1 you are correct and I have corrected the post. On #2 you are correct that the VTEP that hosts the MAC is distributed to each member of the control cluster. The VTEPs IP’s are not part of the virtual routing and need to be VLAN backed and are distributed to each member of the transport zone. I personally don’t know if the IP addresses or some internal pointer is used to denote their connection on the MAC table. My guess is it’s an internal pointer that takes up less memory and space but I don’t know.

Leave a Reply