Deep Dive: How does NSX Distributed routing work

As a continuation of my previous host How does NSX virtual switch work I am now writing about how the routing works. I should thank Ron Flax and Elver Sena for walking through this process with me. Ron is learning about what I mean by knowledge transfer and being very patient with it. This post will detail how routing works in NSX. The more I learn about NSX the more it makes sense. It is really a great product. My post are 100% about VMware NSX not multi-hypervisor NSX.

How does routing work anyways

Must like my last post I think it’s important to understand how routing in a physical environment works. Simply put when you want to go from one subnet to another subnet (layer 2 segment) you have to use a router. The router receives a packet and has two choices (some routers have more but this is generic):

I know that destination network lives down this path and send it
I don’t know that destination network and forward it out my default gateway

IP packets continue to forward upward until a router knows how to deliver the IP packet. It all sounds pretty simple.

Standardized Exterior Gateway protocols

It would be simple if someone placed a IP subnet at a location and never changed it. We could all learn a static route to the location and never have it change. Think of the internet like a freeway. Every so often we need to do road construction this may cause your journey to take longer using an alternate route but you will still get there. I am afraid that the world of IT is constant change. So protocols were created to dynamically update router of these changes standardized exterior gateway protocols were born. (BGP, OSPF etc..) I will not go into these protocols because I have a limited understanding and because it’s not relevant for the topic today (It will be relevant later). It’s important to understand that routes can change and there is an orderly way of updating (think DNS for routing.. sorta).

A key component of routing is the internet protocol. This is a unique address that we all use every single day. There are public ip address and internal IP addresses. NSX can use either and has solutions for bridging both. This article will use two subnets 192.168.10.0/24 and 192.168.20.0/24. The /24 after the IP address denotes the size of the range in cidr notation. For this article it’s enough to denote that these ranges are on different layer 2 segments and normally cannot talk to each other without a router.

Setup

We are going to place these two networks on different VXLAN backed network interfaces (VNI’s) as shown below:

If you are struggling with the term VNI just replace with VLAN and it’s about the same thing. (Read more about the differences in my last post). In this diagram we see that each virtual machine is connected to its own layer 2 segment and will not be able to talk to each other or anything else. We could deploy another host into VNI 5000 with the address 192.168.10.11 and they would be able to talk using NSX switching but no crossing from VNI 5000 to VNI 5001 will be allowed. In a physical environment a router would be required to allow this communication, in NSX this is also true:

Both of the networks shown would set their default gateway to be the router. Notice the use of distributed router. In a physical environment this would be a single router or a cluster. NSX uses a distributed router. It’s capability scales up as you scale up your environment, each time you add a server you get more routing capacity.

Where does the distributed router live?

This was a challenge for me when I first started working with NSX, I thought everything was a virtual machine. The NSX vSwitch is really just a code extension of the dVS. This is also true of the router. the hypervisor kernel does the router with mininal physical overhead. This provides an optimal path for data, if the data is on the same machine communication never leaves the machine (much like switching in normal vss). The data plane for the router lives on the dVS. There are a number of components to consider:

Distributed router – code that lives on each ESXi host as part of the dVS that handles routing.
NSX Routing Control VM – this virtual machine that controls aspects of routing (such as BGP peering) it is in the control plane not data plane (in order words it is not required to do routing) (Design Tip: You can make it highly available by clicking the HA button at anytime, this will create another vm with a anti-affinity rule)
NSX Control cluster – This is the control cluster mentioned in my last post. It syncs configuration between the management and control plane elements to the data plane.

How does NSX routing work?

Here is the really neat part. A routers job to deliver IP packets. It is not concerned if the packets should be delivered it just fings IP’s to their destination. So let’s go through a basic routing situation in NSX. Assume that Windows virtual machine wants to talk to Linux virtual machine.

The process is like this:

The L3 Local router becomes aware of each virtual machine as it talks out and updates the control cluster with arp entry including VNI and ESXi Node
The control cluster updates all members of the same transport zone so everyone knows the arp entries
Windows virtual machine wants to visit the website on Linux so it arps
ESXi1’s DLR (Distributed Logical Router) returns its own mac address
Windows sends a packet to ESXi1’s LDR
Local LDR knows that Linux is on VNI 5001 so it routes the packet to the local VNI 5001 on ESXi1
Switch on ESXi1 knows that Linux lives on ESXi2 so it sends the packet to VTEP1
VTEP1 sends the packet to VTEP2
VTEP2 drops the packet into VNI 5001 and Linux gets the message

It really makes sense if you think about it. It works just like any router or switch you have mostly ever used. You just have to get used to the distributed nature. The greatest strength of NSX is the ability to handle everything locally. If Linux was on the same ESXi host then the packet would never leave ESXi1 to get to Linux.

What is the MAC address and IP address of the DLR?

Here is where the fun begins. It is the same on each host:

Yep it’s not a typo each router is seen as the default gateway for each VNI. Since the layer 2 networking is done over VXLAN (via VTEP) each local router can have the same IP address and mac address. The kernel code knows to route it locally and it all works. This does present one problem: External access.

External Access

In order for your network to be accessible via external networks the DLR has to present the default gateway outside, but if each instance has the same IP / Mac who responds to requests to route traffic? Once instance gets elected as the designated instance (DI) and answers all questions. If a message needs to be sent to another ESXi host than the one running the DI then it routes like above. It’s a simple but great process that works.

Network Isolation

What if your designated instance becomes isolated? There is an internal heartbeat that if not responded to will cause a new DI election to happen. What about if networking fails on my ESXi host? Well then every other instance will continue to communicate with everyone else, packets destined for the failed host will fail.

Failure of the control cluster

What about if the control cluster fails? Well since all the routing is distributed and held locally everything will continue to operate. Any new elements in the virtual world may fail but everything else will be good. It’s a good idea to ensure that you have enough control clusters and redundancy as they are a critical component of the control plane.

6 Replies to “Deep Dive: How does NSX Distributed routing work”

B Ram says:

October 23, 2014 at 8:28 pm

“Local LDR knows that Linux is on VNI 5001 so it routes the packet to the local VNI 5001 on ESXi1”

Are you stating every esxi host will have local VNI’s to all the VNI’s in the DC , Please elaborate.

1. Joseph Griffiths says:
  
  October 27, 2014 at 10:26 am
  
  Thanks for the question. VNI 5001 (VXLAN network interface) is stretched between all hosts in the same transport zones. The flow is just like this:
  1. Linux1 on VNI 5002 on ESXihost1 sends a message to Linux2 on VNI 5001
  2. Packet leaves Linux1 goes to vSwitch
  3. vSwitch knows it’s not on VNI 5002 and sends to LDR
  4. LDR knows that Linux2 is on VNI 5001 so it drops the packet into VNI 5001 on ESXihost1
  5. The vSwitch on ESXihost1 knows that Linux2 lives on ESXiHost2 so it sends packet THE VTEP to go to ESXiHost2
  6. The packet is VXLAN encoded and sent to ESXiHost2
  7. ESXiHost2 VTEP gets the packet and drops it into VNI 5001
  8. VNI 5001 delievers the packet to Linux2
  
  That’s the process for communication. Please let me know if you have additional questions.
  
  Thanks,
  Joseph
  
marco says:

December 7, 2017 at 4:51 pm

Hi Joseph

At the end of section “How does NSX routing work?” You wrote:

If Linux was on the same ESXi host then the packet would never leave ESXi1 to get to Linux.

Just a curiosity even I know that probably this doesn’t make really much sense:

supposing that I have a non NSX environment with ESXI1 having a DVS with a couple of port group managing “traditinal” VLANs, let say 600 and 700. I know that in this case VMs’ traffic need to be forwarded to the gateway but, deploying a DLR, would be theoritically possible to manage these VLAN in the same way of VXLAN meaning that traffic between VLAN (on the same host) wouldn’t leave the host? What I know for sure that this would be possible using something like vyos or other piece of software able to make routing but I don’t know if this would be possible using NSX.
Again….maybe in a real environment this doesn’t make any sense but it is just a theoritical question.

Best
Marco

1. Joseph Griffiths says:
  
  December 12, 2017 at 5:18 pm
  
  It is possible to do this by bridging the VLAN with a VXLAN and making the DLR the gateway for the VLAN. It’s not a great long term solution there are a number of impacts on this choice for example all traffic to the VLAN will go out a single ESXi hosts (designated by the host that is running the control VM) Failure of the control VM will cause traffic to the VLAN to fail until it’s restored so you want fail over control VM. Bridging exists for two use cases physical entities that require VLAN access or for converting current VLAN’s into VXLAN’s without major interruption of services.
  
Tarun says:

October 9, 2018 at 1:39 pm

Hi joseph,
Thanks for awesome article . My questions is “what will happen if Control VM is powered off” ..will the routing already there work ? ..

1 . Let say i have ESG haviing iBGP with DLR . so when the traffic will come to ESG how will ESG know where to route traffic if DLR is powerwed off.
2. will the DLR lif work if DLR control VM is powered off ?

Tarun Gupta

1. joseph says:
  
  December 22, 2018 at 2:01 pm
  
  Great questions sorry for the delayed response it got caught in my spam filter.
  
  1. DLR cannot be powered off because it’s distributed it would require all ESXi hosts participating to be powered off. Now if the control VM is powered off and not recovered then control plane distributed routing will fail thus anyt traffic between the ESG and DLR would fail. There are a few workarounds… first put the control VM in active/passive fail over. Second consider default routes for traffic if possible.
  
  2. DLR will continue to work without the control VM but upstream communication to the ESG may be interrupted depending on routing protocol.