Deep Dive: How does NSX Distributed routing work

As a continuation of my previous host How does NSX virtual switch work I am now writing about how the routing works.   I should thank Ron Flax and Elver Sena for walking through this process with me.   Ron is learning about what I mean by knowledge transfer and being very patient with it.   This post will detail how routing works in NSX.  The more I learn about NSX the more it makes sense.  It is really a great product.  My post are 100% about VMware NSX not multi-hypervisor NSX.

How does routing work anyways

Must like my last post I think it’s important to understand how routing in a physical environment works.   Simply put when you want to go from one subnet to another subnet (layer 2 segment) you have to use a router.   The router receives a packet and has two choices (some routers have more but this is generic):

  • I know that destination network lives down this path and send it
  • I don’t know that destination network and forward it out my default gateway

IP packets continue to forward upward until a router knows how to deliver the IP packet.  It all sounds pretty simple.

Standardized Exterior Gateway protocols

It would be simple if someone placed a IP subnet at a location and never changed it.  We could all learn a static route to the location and never have it change.   Think of the internet like a freeway.  Every so often we need to do road construction this may cause your journey to take longer using an alternate route but you will still get there.  I am afraid that the world of IT is constant change.  So protocols were created to dynamically update router of these changes standardized exterior gateway protocols were born. (BGP, OSPF etc..) I will not go into these protocols because I have a limited understanding and because it’s not relevant for the topic today (It will be relevant later).   It’s important to understand that routes can change and there is an orderly way of updating (think DNS for routing.. sorta).

IP

A key component of routing is the internet protocol.  This is a unique address that we all use every single day.   There are public ip address and internal IP addresses.  NSX can use either and has solutions for bridging both.  This article will use two subnets 192.168.10.0/24 and 192.168.20.0/24.   The /24 after the IP address denotes the size of the range in cidr notation.  For this article it’s enough to denote that these ranges are on different layer 2 segments and normally cannot talk to each other without a router.

Setup

We are going to place these two networks on different VXLAN backed network interfaces (VNI’s) as shown below:

Lunch

If you are struggling with the term VNI just replace with VLAN and it’s about the same thing.  (Read more about the differences in my last post).   In this diagram we see that each virtual machine is connected to its own layer 2 segment and will not be able to talk to each other or anything else.  We could deploy another host into VNI 5000 with the address 192.168.10.11 and they would be able to talk using NSX switching but no crossing from VNI 5000 to VNI 5001 will be allowed.  In a physical environment a router would be required to allow this communication, in NSX this is also true:

Lunch

Both of the networks shown would set their default gateway to be the router.  Notice the use of distributed router.  In a physical environment this would be a single router or a cluster.  NSX uses a distributed router.  It’s capability scales up as you scale up your environment, each time you add a server you get more routing capacity.

Where does the distributed router live?

This was a challenge for me when I first started working with NSX, I thought everything was a virtual machine.   The NSX vSwitch is really just a code extension of the dVS.  This is also true of the router.  the hypervisor kernel does the router with mininal physical overhead.  This provides an optimal path for data, if the data is on the same machine communication never leaves the machine (much like switching in normal vss).  The data plane for the router lives on the dVS.    There are a number of components to consider:

  • Distributed router – code that lives on each ESXi host as part of the dVS that handles routing.
  • NSX Routing Control VM – this virtual machine that controls aspects of routing (such as BGP peering)  it is in the control plane not data plane (in order words it is not required to do routing) (Design Tip: You can make it highly available by clicking the HA button at anytime, this will create another vm with a anti-affinity rule)
  • NSX Control cluster – This is the control cluster mentioned in my last post.   It syncs configuration between the management and control plane elements to the data plane.

How does NSX routing work?

Here is the really neat part.  A routers job to deliver IP packets.  It is not concerned if the packets should be delivered it just fings IP’s to their destination.   So let’s go through a basic routing situation in NSX.  Assume that Windows virtual machine wants to talk to Linux virtual machine.

Lunch

 

The process is like this:

  1. The L3 Local router becomes aware of each virtual machine as it talks out and updates the control cluster with arp entry including VNI and ESXi Node
  2. The control cluster updates all members of the same transport zone so everyone knows the arp entries
  3. Windows virtual machine wants to visit the website on Linux so it arps
  4. ESXi1’s DLR (Distributed Logical Router) returns its own mac address
  5. Windows sends a packet to ESXi1’s LDR
  6. Local LDR knows that Linux is on VNI 5001 so it routes the packet to the local VNI 5001 on ESXi1
  7. Switch on ESXi1 knows that Linux lives on ESXi2 so it sends the packet to VTEP1
  8. VTEP1 sends the packet to VTEP2
  9. VTEP2 drops the packet into VNI 5001 and Linux gets the message

It really makes sense if you think about it.  It works just like any router or switch you have mostly ever used.  You just have to get used to the distributed nature.   The greatest strength of NSX is the ability to handle everything locally.   If Linux was on the same ESXi host then the packet would never leave ESXi1 to get to Linux.

What is the MAC address and IP address of the DLR?

Here is where the fun begins.   It is the same on each host:

Lunch

Yep it’s not a typo each router is seen as the default gateway for each VNI.  Since the layer 2 networking is done over VXLAN (via VTEP) each local router can have the same IP address and mac address.  The kernel code knows to route it locally and it all works.   This does present one problem: External access.

External Access

In order for your network to be accessible via external networks the DLR has to present the default gateway outside, but if each instance has the same IP / Mac who responds to requests to route traffic?   Once instance gets elected as the designated instance (DI) and answers all questions.   If a message needs to be sent to another ESXi host than the one running the DI then it routes like above.   It’s a simple but great process that works.

Network Isolation

What if your designated instance becomes isolated?  There is an internal heartbeat that if not responded to will cause a new DI election to happen.   What about if networking fails on my ESXi host?  Well then every other instance will continue to communicate with everyone else, packets destined for the failed host will fail.

Failure of the control cluster

What about if the control cluster fails?   Well since all the routing is distributed and held locally everything will continue to operate.  Any new elements in the virtual world may fail but everything else will be good.  It’s a good idea to ensure that you have enough control clusters and redundancy as they are a critical component of the control plane.

Deep Dive: How does the NSX vSwitch Work

Edit: Thank to Ron Flax, Todd Craw for helping me correct some errors.

I have been blessed of late to be involved in some VMware NSX deployments and I am really excited about the technology.   I am by no means a master of NSX but I will post about my understand as a method to spread information and assist with my personal learning.   In this post I will be covering only the switch capabilities of NSX.

 

Traditional Switches

The key element of a layer 2 ethernet switch is the MAC address.  This is a unique (perhaps)  identifier on a network card.  Each network adapter should have a unique address.   A traditional physical switch learns the mac addresses connected on each port when the network device first tries to communicate.  For example:

Lunch

When you power on Windows Physical server the physical switch learns that MAC 00:00:00:00:01:01 is connected to port 1.  Any messaged destined for 00:00:00:00:01:01 should be sent to port 1.   This allows the switch to create logical connections between ports and limit the amount of wasted traffic.   This entry in the switches MAC table (sometimes called a cam table) stays present for 5 minutes (user configurable)  and is refreshed whenever the server uses it’s network card.   The Linux server on port two is discovered exactly the same way via physically talking on the port, the table is updated for port 2.   If Windows wants to talk to linux their communication never leaves the switch as long as they are in the same subnet.   If the MAC address is unknown by the switch it will broadcast the request out all ports with hopes that something will respond.

Address Resolution Protocol (ARP)

ARP is a protocol used to resolve IP addressed to their MAC addresses.  It is critical to understand that ARP does not return the MAC address of the final destination it only returns the mac address of the next hop unless the final destination is on the same subnet.  This is because ethernet is only concerned with next hop via mac not end destination.

Lunch

You can follow the communication with ARP’s between each layer of the diagram the key component is that if the IP is not local then it returns its own MAC and forwards it out the default gateway.

Traditional Virtual Switches

In order to understand NSX vSwitch it is critical that you understand how the traditional virtual switch works.  In a traditional virtual switch (VSS and dVS) the switch learns the mac addresses of virtual machines when they are powered on.  As soon as a virtual machine is assigned a switch port it becomes hard-coded in the MAC table for that virtual switch.   Anything that is local to that switch in the same vlan or segment will be delivered locally.    Otherwise the virtual switch just forwards the message out it’s uplink and allows the physical switches to resolve the connection.

NSX Virtual Switch

The NSX virtual switch includes additional functionality from the traditional virtual switch.  The key feature is the ability to use VXLAN to span layer 2 segments between hosts without the use of multiple streched VLAN’s.   VXLAN also allows strech layer 2 to distant datacenters and up to 16 million segements vs the current limit of 4096 vlans.  There are some common components that need to be understood:

  • VTEP (VXLAN Tunnel End Point)  – this is a ESXi virtual adapter that has its own vlan and ip address including gateway.  This interface must be set for 1600 MTU and all physical switches/routers that handle this traffic must allow at least 1600 MTU.
  • NSX virtual switch (also called logical switch) – This is a software kernel based construct that does the heavy lifting. This is deployed to a dVS switch and works as extensions to the dVS.
  • NSX Manager – This is the management plane for NSX, it acts as a central point for communication, scripting and control.  It is only required when making changes as part of the management plane
  • NSX Control cluster – This is a series of virtual machines that are clustered via software.  Each node (should be a odd number and at least three)  contains all required information and load is distributed between all three.  (Best Practice: Do a DRS rule to keep these on separate hosts, future releases may do this for you)
  • VNI – Virtual network interface – this is an identifier used by VXLAN to separate networks (think vlan tag) they start at 5000 and go to 16,000,000.  It easiest for people to think vlan tags when working with VNI’s.

With all the terminology out-of-the-way it’s time to get down to the path.   The NSX Virtual switch includes one key component the ability to switch packets between nodes or clusters without having the layer 2 streched between the clusters.  For my networking friends this means reduction in spanning tree issues.

So let me lay it out below:

Lunch

We have a three node NSX control cluster that has been deployed.  We have two ESXi hosts running dVS’s with the NSX Virtual switch.  VXLAN has been enabled and a virtual network VNI:5000 has been created.   The VTEP’s have been configured.   We have created two virtual machine as shown in green.  Neither has been connected to the VNI network yet.

 

Time to learn our first MAC:

  • We connect the Windows server to VNI:5000 as shown below
  • The MAC table on our local switch is updated (Learns) then passes it’s learned information to the control cluster
  • The control cluster passes it to all members of the logical switch (there are three methods to pass the information which I will cover in another post unicast, multicast and hybrid)

Lunch

 

This syncing of the MAC table ensures that each member of VNI knows how to handle switching creating a distributed switch (like a switch stack that has multiple switches that act as one).

When we power on the linux server the same method is used:

  • We connect the Windows server to VNI:5000 as shown below
  • The MAC table on our local switch is updated (Learns) then passes it’s learned information to the control cluster
  • The control cluster passes it to all members of the logical switch (there are three methods to pass the information which I will cover in another post unicast, multicast and hybrid)

Lunch

Now we have a ARP table available on each switch that works great.   Let’s follow the flow of communication: Assume the following.   Windows server wants to open a web page on Linux server on port 80:

  • User on Windows server brings up internet explorer and types in 192.168.10.11
  • Windows server sends out a arp entry for 192.168.10.11
  • ESXi1 ‘s virtual switch returns the MAC address 00:00:00:00:02:02
  • Windows server sends out a IP packet with the MAC address of 00:00:00:00:02:02
  • ESXi’s virtual switch forwards the packet out VTEP1 by encapsulating it destined for the IP of VTEP2
  • VTEP2 opens the packet and removes the VTEP encapsulation and forwards the packet to ESXi2 virtual switch on VNI:5000
  • The switch on ESXi2 sends the packet to the virtual port that the linux servers network card is connected on.

 

This is how a NSX virtual switch handles switching.  At first you may say this makes no sense at all… wouldn’t a VLAN just be easier.   There are a number of benefits this brings:

  • Limits your Spanning tree to potentially top of rack switches if architected correctly
  • Allows you to expand past the 4096 VLAN limit
  • Opens the door for other NSX services (which I will post about in the future.)

 

As I mentioned this is just my understanding I do not have inside knowledge if I have made a mistake let me know, I’ll test then correct it.

Central Ohio VMware Lunch and Learn

I have been toying with the idea of starting a community series of lunch and learn sessions to assist people in learning about VMware technology.   I am happy to announce that the first session will be Sep. 25th at Noon at:

 

OARnet – Bale Conference Room

1224 Kinnear Road

Columbus, OH 43212

 

It was very kind of my previous employer to be willing to host us for these sessions.   I am excited to announce that VMware education has also provided some certification discount codes for me to pass out.   The format will be a bit loose.  I will be focusing on VCP content but it will be open to discussion.   I want it to be a forum.   I have also invited others to present in the future and hope to make it a monthly occurrence.   The topic for this month with be vSphere networking.  It will be a great refresher course for anyone looking to study for the VCP-NV.    The event it 100% open to the public and we have seating for about 60 people.  There is standing room for about 40 more.   Bring your lunch and join us.  Feel free to contact me via comments or twitter if you have questions or would like to present a future topic.  The one request I have is this is a technical conversation not a sales pitch.  I want it to be a discussion between technical people.

 

Looking forward to seeing you there.