Deep Dive: Configuration Maximums for dVS

Recently I have been thinking about configuration maximums of the current virtual distributed switches.   In the configuration maximum’s document for 5.5 it states the following:

– Total virtual network switch ports per host (VDS and VSS ports) – 4096
– Maximum active ports per host (VDS and VSS) – 1016
– Hosts per distributed switch – 1000
– Static/Dynamic port groups per distributed switch – 6500
– Ephemeral port groups per distributed switch – 1016
– Distributed virtual network switch ports per vCenter – 60000

The first question is between these numbers:

– Total virtual network switch ports per host (VDS and VSS ports) – 4096
– Maximum active ports per host (VDS and VSS) – 1016

In order to explain these numbers you must have some context about how a vDS and VSS work and allocate ports:

  • virtual standard switch (VSS)- allocates ports statically when a port group is created on the local ESXi host – so if you allocate 24 ports to a port group then 24 ports are taken.
  • virtual distributed switch (dVS) – allocates ports to the port group in vCenter but each individual ESXi host only allocates ports based upon currently powered on machines (assuming Dynamic or static port binding).  so if you create a dVS port group with 24 ports but there is only one virtual machine in the port group it would only take one port on it’s assigned ESXi host.

Ephemeral ports on a dVS work just like a VSS, so each local ESXi host uses all ports in a port group.

 

What is a proxy switch?

Proxy or Ghost switch is a term that you may see around to reference the local copy of the vDS on each host.  The proxy switch only contains relavant information to its virtual machines.   When you vMotion a new virtual machine to the host, vCenter allocates a new port on the ESXi host and sync’s a new proxy configuration to that switch alone.

What is the difference between an active port and total ports?

An active port is defined different between the switches

  • VSS any port on a port group is considered active on each ESXi host
  • dVS static or dynamic port in use on the ESXi host
  • dVS Ephemeral any ports on the port group are allocated on all ESXi hosts.

 

So in order to hit the 4096 total ports you would need a combination of VSS and dVS ports.    When using a single dVS you will hit the 1016 total active ports and never hit the 4096 total ports.

Lets look at some dVS switch maximums:

– Static/Dynamic port groups per distributed switch – 6500
– Ephemeral port groups per distributed switch – 1016

These are software limits static and dynamic are enforced by the dVS at vCenter and have no relationship to the ESXi hosts.   Ephemeral port groups have the hard limit of 1016 which aligns with the maximum number of active ports, which assumes you have 1016 port groups each with a single port.

How about the last set of numbers:

– Hosts per distributed switch – 1000
– Distributed virtual network switch ports per vCenter – 60000

Not much to say here.  The 60,000 creates a boundary that may require you not to allocate 1,000 ports per port group, it is per vCenter not dVS.  So that limit can span multiple vDS’s.

Best practices and design considerations:

Given that only active ports take memory on a ESXi host there is no reason not to allocate larger port groups, then again since port groups can be grown dynamically there is no reason not to keep them small.  I vote for something in between.  It would provide the best manageability without getting close to the maximums.

VMware NSX how to firewall between IP’s and issues

The first thing everyone does with NSX is try to create firewall rules between IP addresses.  I consider this a mistake because the DFW can key off a lot better markers than IP addresses.   Either way at some point you will want to use IP addresses in your rules.  This post will describe how to setup firewall rules between IP addresses.

 

Setup:

I have two Linux machines each on their own subnet:

Linux1 – 172.16.1.10 – 172.16.1.0/24 network

Linux3 – 172.16.10.10 – 172.16.10.0/24 network

Routing is setup between the hosts so they can connect to each other.  I would like to block all traffic except ssh between these subnets.   We are going to assume that both of these networks exist in NSX.

NSX Setup:

First we have to set up an IP set in NSX Manager.  This is suprisingly a set of IP addresses.

  • Login to the vSphere web client
  • Click networking and security
  • Select your NSX Manager and expand it
  • Select Manage -> grouping objects
  • On the lower pane select IP Sets
  • Press the green plus button to add a new set
  • Setup each set as shown below:

Capture

Capture

Tale of multiple cities:

Here is where NSX gets interesting you have multiple ways to block access.  First a little understanding of firewall constructs in NSX:

  • Security Groups – these are groups of machines / constructs they can include IP sets, MAC sets, dynamic name based wild card information.  They can contain whole datacenters or a single virtual machine.  It can be very dynamic with boolean conditions.
  • Security Policies – These are groups of firewall rules and introspection services.  These are policies that are applied to security groups.  Each of the firewall policies assume that they are assigned to one or more security groups.   So your source or destination needs to be the policies assigned security group.  The opposite side (source/destination) needs to either be a security group or any.

Remember we want the following rules:

  • SSH between 172.16.1.0/24 and 172.16.10.0/24 should be allowed bi-directional
  • Everything else between them should be blocked

Within these constructs there are a number of possible options for the firewalls:

  • Option 1 – rules in this order
    • Firewall rule allowing ssh between source: assigned policy group and destination: 172.16.10.0/24
    • Firewall rule allowing ssh between source: 172.16.10.0/24 and destination: assigned policy group
    • Firewall rule blocking any between source: assigned policy group and destination: 172.16.10.0/24
    • Firewall rule blocking any between source: 172.16.10.0/24 and destination: assigned policy group
    • Assign the security policy to 172.16.1.0/24
  • Option 2 – Security Groups
    • Firewall rule allowing ssh between source: assigned policy group and destination: assigned policy group
    • Firewall rule blocking any between source: assigned policy group and destination: assigned policy group
    • Assign the security policy to 172.16.1.0/24 and 172.16.10.0/24
  • Option 3 – Two rules
    • Rule 1
    • Firewall rule allowing ssh between source: Assigned Policy group and destination: 172.16.10.0/24
    • Firewall rule blocking any between source: Assigned Policy group and destination: 172.16.10.0/24
    • Assign Policy to 172.16.1.0/24
    • Rule 2
    • Firewall rule allowing ssh between source: Assigned Policy group and destination: 172.16.1.0/24
    • Firewall rule blocking any between source: Assigned Policy group and destination: 172.16.1.0/24
    • Assign Policy to 172.16.10.0/24

First question anyone will ask is why would I not use option 2?  It’s smaller and easier to read.  It does accomplish the same goal.   It does lack granularity in design.  What if you had a third subnet 172.16.20.0/24 and you only wanted it to access 172.16.1.0/24.  Option 1 would easily be able to do this, while option 2 would mistakenly open up 172.16.10.0/24.   This is the heart of firewall design.  Layer rules to create granularity.    I am not a master of the firewall but I do have a few suggestions:

  • Outbound firewall rules sound great but right away will kill you in complexity
  • Protect the end points… apply rules to the destination (think apply rules to the web server instead of every PC)  If you need to apply source rules do it on the destination
  • Use naming conventions that describe the purpose of the rule  Allow-SSH-Into-Production
  • Consider using a DROP all on your default rule and then applying only allow rules in security groups
  • Rules that are part of the default and not created in service composer don’t show up in the GUI so don’t use them beyond the default DROP apply everything as a security policy

 

Let’s do Option 1

  • Return to networking and security and select service composer
  • Select security groups and create a security group for each IP Set

Capture

Capture

  • Repeat for the other subnet
  • Click on security policies
  • Create a new policy as shown belowCapture

Capture

Capture

Capture

Capture

Capture

  • Now that you have it build your just need to apply it to a security group
  • Click on the text of your Security Policy
  • Select Manage -> Security Groups
  • Click edit and add 172.16.1.0/24

Now your rules should work.  You can test with ping and SSH.   Using the same dialog’s you can create option 2 or 3.   The same rules you use for firewalls on physical entities need to apply to DFW.   You need to think before you create or you will be in firewall spawl.

Deep Dive: How does NSX Distributed Firewall work

This is a continuation of my posts on NSX features you can find other posts on the Deep Dive page.   My favorite feature of VMware NSX is the Distributed firewall.   It provides some long over due security features.  At one time I worked in an environment where we wanted to ensure that every type of traffic was filtered with a firewall.   This was an attempt to increase security.  We wanted to ensure that there was no east <-> west traffic between hosts; so everyone was in its own subnet.  Each virtual machine was deployed inside a /27 subnet alone.   Every communication required a trip to the firewall which was also serving as a router. It’s kinda hard if your stuff is stolen to get around Europe – even getting a låna-pengar.biz – lån och krediter in Sweden is not possible.

LunchThis model worked but made us very firewall centric.  Everything required multiple firewall changes.  Basic provisioning took weeks because of the constant need for more firewall changes.   In addition we wanted secondary controls so each host ran their own host based firewall as well.   This model caused a few major design constraints: you had to buy larger firewalls to handle all the routing and you had to take your firewall guys to lunch all the time to avoid mega rage.

Enter the distributed firewall

The distributed firewall applies firewall rules at the virtual machine kernel and network interface right above the guest OS.  This has a few advantages:

  • No one on the OS can change firewall rules
  • Only traffic that should be on the network is on the network everything else gets blocked before leaving the virtual machine (Think mega cost savings, and less garbage traffic)
  • You can inspect each packet before it gets to the network and take action (lots of third-party plugins will be able to do this)
  • You can scale out your firewalls capacity by adding more hosts in a modular fashion that matched your server growth

The firewall has a api for third-party solutions like virus scanners or IDS.   This allows them to be part of the data stream in real-time.

Components of Distributed firewall (DFW)

The DFW has a management plane, control plane and data plane which should be familiar to network admins.

  • Management Plane – is Done via vCenter plugin or API access to the NSX manager – This allows you to use any vCenter object as the source or destination (Datacenter, VM name, vNic etc..) It also allows you to define IP ranges for more traditional firewalls between IP’s
  • Control Plane – is done by the NSX manager it takes changes from vCenter and stores them in a central database and then pushes the rules down to each ESXi host.  (Database is /etc/vmware/vsfwd/vsipfw_ruleset.dat on each ESXi host)
  • Data Plane – ESXi hosts are the data plane doing the actual work of the firewall.  All firewall functions take place in kernel modules on the ESXi hosts.  Remember that enforcement is done locally and at the destination reducing the traffic on the wire.

Each vNIC get its own instance of DFW put into place and managed by a set of daemons called vsfwd.

How does it work?

Each firewall rule is created and applied via the NSX manager GUI or API.   When published it pushes all rules down to each ESXi host.  They create a file on disk which holds the all the firewall rules.   The ESXi host applies rules to the instance of DFW when a change in vCenter (remember management plane – like a new vNic vlan change etc..) happens the firewall rules are re-consulted.  IP-based rules require VMware tools to identify the IP address / addresses of the server.

How about vMotion?

Since the rules are applied to the virtual container they are moved with the host when vMotion is used, no effect.

How about HA events?

Rules are loaded off disk and applied to virtual machines.

What about if NSX Manager is not available?

Rules are loaded off disk. New systems will get the rule set that apply to them, for example if my new server is called Web-Machine12 and I have rules that are applied to all vm’s named Web-* then it will get them from disk.  This entourages the use of naming standards.

How about if I create a new virtual machines and it does not have any rules?

At the bottom is a default rule (some vote for allow all other deny all, I vote deny all) so you machine will have deny all.

Group and Policies

DFW has the concept of Security Groups (yep like it sounds) groups of similar systems, these can be hard-coded to specific entities or dynamic using regular expresses on any vCenter entity.   They also have security policies these are groups of like-minded rules to be processes in order.   So you define the scope of the rules in the Security Groups and define what is done in Security policies.  It can be a one to many reference on both sides.  (A security group can have many policies or a policy can have may groups) providing the ability to layer rules.

How do I track my firewall drops / accepts?

This is the first thing your firewall guys are going to ask for…  And I don’t like the answer right now.  They are logged to the ESXi hosts syslog.   So you need to centralize your host logs and do some searches to gather the firewalls into one place.   If you search your host based logs for “vsip_pkt” (In 6.1 they changed this to dfwpktlogs:) you will find the firewall drops / accepts.

 

Deep Dive: How does NSX Distributed routing work

As a continuation of my previous host How does NSX virtual switch work I am now writing about how the routing works.   I should thank Ron Flax and Elver Sena for walking through this process with me.   Ron is learning about what I mean by knowledge transfer and being very patient with it.   This post will detail how routing works in NSX.  The more I learn about NSX the more it makes sense.  It is really a great product.  My post are 100% about VMware NSX not multi-hypervisor NSX.

How does routing work anyways

Must like my last post I think it’s important to understand how routing in a physical environment works.   Simply put when you want to go from one subnet to another subnet (layer 2 segment) you have to use a router.   The router receives a packet and has two choices (some routers have more but this is generic):

  • I know that destination network lives down this path and send it
  • I don’t know that destination network and forward it out my default gateway

IP packets continue to forward upward until a router knows how to deliver the IP packet.  It all sounds pretty simple.

Standardized Exterior Gateway protocols

It would be simple if someone placed a IP subnet at a location and never changed it.  We could all learn a static route to the location and never have it change.   Think of the internet like a freeway.  Every so often we need to do road construction this may cause your journey to take longer using an alternate route but you will still get there.  I am afraid that the world of IT is constant change.  So protocols were created to dynamically update router of these changes standardized exterior gateway protocols were born. (BGP, OSPF etc..) I will not go into these protocols because I have a limited understanding and because it’s not relevant for the topic today (It will be relevant later).   It’s important to understand that routes can change and there is an orderly way of updating (think DNS for routing.. sorta).

IP

A key component of routing is the internet protocol.  This is a unique address that we all use every single day.   There are public ip address and internal IP addresses.  NSX can use either and has solutions for bridging both.  This article will use two subnets 192.168.10.0/24 and 192.168.20.0/24.   The /24 after the IP address denotes the size of the range in cidr notation.  For this article it’s enough to denote that these ranges are on different layer 2 segments and normally cannot talk to each other without a router.

Setup

We are going to place these two networks on different VXLAN backed network interfaces (VNI’s) as shown below:

Lunch

If you are struggling with the term VNI just replace with VLAN and it’s about the same thing.  (Read more about the differences in my last post).   In this diagram we see that each virtual machine is connected to its own layer 2 segment and will not be able to talk to each other or anything else.  We could deploy another host into VNI 5000 with the address 192.168.10.11 and they would be able to talk using NSX switching but no crossing from VNI 5000 to VNI 5001 will be allowed.  In a physical environment a router would be required to allow this communication, in NSX this is also true:

Lunch

Both of the networks shown would set their default gateway to be the router.  Notice the use of distributed router.  In a physical environment this would be a single router or a cluster.  NSX uses a distributed router.  It’s capability scales up as you scale up your environment, each time you add a server you get more routing capacity.

Where does the distributed router live?

This was a challenge for me when I first started working with NSX, I thought everything was a virtual machine.   The NSX vSwitch is really just a code extension of the dVS.  This is also true of the router.  the hypervisor kernel does the router with mininal physical overhead.  This provides an optimal path for data, if the data is on the same machine communication never leaves the machine (much like switching in normal vss).  The data plane for the router lives on the dVS.    There are a number of components to consider:

  • Distributed router – code that lives on each ESXi host as part of the dVS that handles routing.
  • NSX Routing Control VM – this virtual machine that controls aspects of routing (such as BGP peering)  it is in the control plane not data plane (in order words it is not required to do routing) (Design Tip: You can make it highly available by clicking the HA button at anytime, this will create another vm with a anti-affinity rule)
  • NSX Control cluster – This is the control cluster mentioned in my last post.   It syncs configuration between the management and control plane elements to the data plane.

How does NSX routing work?

Here is the really neat part.  A routers job to deliver IP packets.  It is not concerned if the packets should be delivered it just fings IP’s to their destination.   So let’s go through a basic routing situation in NSX.  Assume that Windows virtual machine wants to talk to Linux virtual machine.

Lunch

 

The process is like this:

  1. The L3 Local router becomes aware of each virtual machine as it talks out and updates the control cluster with arp entry including VNI and ESXi Node
  2. The control cluster updates all members of the same transport zone so everyone knows the arp entries
  3. Windows virtual machine wants to visit the website on Linux so it arps
  4. ESXi1’s DLR (Distributed Logical Router) returns its own mac address
  5. Windows sends a packet to ESXi1’s LDR
  6. Local LDR knows that Linux is on VNI 5001 so it routes the packet to the local VNI 5001 on ESXi1
  7. Switch on ESXi1 knows that Linux lives on ESXi2 so it sends the packet to VTEP1
  8. VTEP1 sends the packet to VTEP2
  9. VTEP2 drops the packet into VNI 5001 and Linux gets the message

It really makes sense if you think about it.  It works just like any router or switch you have mostly ever used.  You just have to get used to the distributed nature.   The greatest strength of NSX is the ability to handle everything locally.   If Linux was on the same ESXi host then the packet would never leave ESXi1 to get to Linux.

What is the MAC address and IP address of the DLR?

Here is where the fun begins.   It is the same on each host:

Lunch

Yep it’s not a typo each router is seen as the default gateway for each VNI.  Since the layer 2 networking is done over VXLAN (via VTEP) each local router can have the same IP address and mac address.  The kernel code knows to route it locally and it all works.   This does present one problem: External access.

External Access

In order for your network to be accessible via external networks the DLR has to present the default gateway outside, but if each instance has the same IP / Mac who responds to requests to route traffic?   Once instance gets elected as the designated instance (DI) and answers all questions.   If a message needs to be sent to another ESXi host than the one running the DI then it routes like above.   It’s a simple but great process that works.

Network Isolation

What if your designated instance becomes isolated?  There is an internal heartbeat that if not responded to will cause a new DI election to happen.   What about if networking fails on my ESXi host?  Well then every other instance will continue to communicate with everyone else, packets destined for the failed host will fail.

Failure of the control cluster

What about if the control cluster fails?   Well since all the routing is distributed and held locally everything will continue to operate.  Any new elements in the virtual world may fail but everything else will be good.  It’s a good idea to ensure that you have enough control clusters and redundancy as they are a critical component of the control plane.

Deep Dive: How does the NSX vSwitch Work

Edit: Thank to Ron Flax, Todd Craw for helping me correct some errors.

I have been blessed of late to be involved in some VMware NSX deployments and I am really excited about the technology.   I am by no means a master of NSX but I will post about my understand as a method to spread information and assist with my personal learning.   In this post I will be covering only the switch capabilities of NSX.

 

Traditional Switches

The key element of a layer 2 ethernet switch is the MAC address.  This is a unique (perhaps)  identifier on a network card.  Each network adapter should have a unique address.   A traditional physical switch learns the mac addresses connected on each port when the network device first tries to communicate.  For example:

Lunch

When you power on Windows Physical server the physical switch learns that MAC 00:00:00:00:01:01 is connected to port 1.  Any messaged destined for 00:00:00:00:01:01 should be sent to port 1.   This allows the switch to create logical connections between ports and limit the amount of wasted traffic.   This entry in the switches MAC table (sometimes called a cam table) stays present for 5 minutes (user configurable)  and is refreshed whenever the server uses it’s network card.   The Linux server on port two is discovered exactly the same way via physically talking on the port, the table is updated for port 2.   If Windows wants to talk to linux their communication never leaves the switch as long as they are in the same subnet.   If the MAC address is unknown by the switch it will broadcast the request out all ports with hopes that something will respond.

Address Resolution Protocol (ARP)

ARP is a protocol used to resolve IP addressed to their MAC addresses.  It is critical to understand that ARP does not return the MAC address of the final destination it only returns the mac address of the next hop unless the final destination is on the same subnet.  This is because ethernet is only concerned with next hop via mac not end destination.

Lunch

You can follow the communication with ARP’s between each layer of the diagram the key component is that if the IP is not local then it returns its own MAC and forwards it out the default gateway.

Traditional Virtual Switches

In order to understand NSX vSwitch it is critical that you understand how the traditional virtual switch works.  In a traditional virtual switch (VSS and dVS) the switch learns the mac addresses of virtual machines when they are powered on.  As soon as a virtual machine is assigned a switch port it becomes hard-coded in the MAC table for that virtual switch.   Anything that is local to that switch in the same vlan or segment will be delivered locally.    Otherwise the virtual switch just forwards the message out it’s uplink and allows the physical switches to resolve the connection.

NSX Virtual Switch

The NSX virtual switch includes additional functionality from the traditional virtual switch.  The key feature is the ability to use VXLAN to span layer 2 segments between hosts without the use of multiple streched VLAN’s.   VXLAN also allows strech layer 2 to distant datacenters and up to 16 million segements vs the current limit of 4096 vlans.  There are some common components that need to be understood:

  • VTEP (VXLAN Tunnel End Point)  – this is a ESXi virtual adapter that has its own vlan and ip address including gateway.  This interface must be set for 1600 MTU and all physical switches/routers that handle this traffic must allow at least 1600 MTU.
  • NSX virtual switch (also called logical switch) – This is a software kernel based construct that does the heavy lifting. This is deployed to a dVS switch and works as extensions to the dVS.
  • NSX Manager – This is the management plane for NSX, it acts as a central point for communication, scripting and control.  It is only required when making changes as part of the management plane
  • NSX Control cluster – This is a series of virtual machines that are clustered via software.  Each node (should be a odd number and at least three)  contains all required information and load is distributed between all three.  (Best Practice: Do a DRS rule to keep these on separate hosts, future releases may do this for you)
  • VNI – Virtual network interface – this is an identifier used by VXLAN to separate networks (think vlan tag) they start at 5000 and go to 16,000,000.  It easiest for people to think vlan tags when working with VNI’s.

With all the terminology out-of-the-way it’s time to get down to the path.   The NSX Virtual switch includes one key component the ability to switch packets between nodes or clusters without having the layer 2 streched between the clusters.  For my networking friends this means reduction in spanning tree issues.

So let me lay it out below:

Lunch

We have a three node NSX control cluster that has been deployed.  We have two ESXi hosts running dVS’s with the NSX Virtual switch.  VXLAN has been enabled and a virtual network VNI:5000 has been created.   The VTEP’s have been configured.   We have created two virtual machine as shown in green.  Neither has been connected to the VNI network yet.

 

Time to learn our first MAC:

  • We connect the Windows server to VNI:5000 as shown below
  • The MAC table on our local switch is updated (Learns) then passes it’s learned information to the control cluster
  • The control cluster passes it to all members of the logical switch (there are three methods to pass the information which I will cover in another post unicast, multicast and hybrid)

Lunch

 

This syncing of the MAC table ensures that each member of VNI knows how to handle switching creating a distributed switch (like a switch stack that has multiple switches that act as one).

When we power on the linux server the same method is used:

  • We connect the Windows server to VNI:5000 as shown below
  • The MAC table on our local switch is updated (Learns) then passes it’s learned information to the control cluster
  • The control cluster passes it to all members of the logical switch (there are three methods to pass the information which I will cover in another post unicast, multicast and hybrid)

Lunch

Now we have a ARP table available on each switch that works great.   Let’s follow the flow of communication: Assume the following.   Windows server wants to open a web page on Linux server on port 80:

  • User on Windows server brings up internet explorer and types in 192.168.10.11
  • Windows server sends out a arp entry for 192.168.10.11
  • ESXi1 ‘s virtual switch returns the MAC address 00:00:00:00:02:02
  • Windows server sends out a IP packet with the MAC address of 00:00:00:00:02:02
  • ESXi’s virtual switch forwards the packet out VTEP1 by encapsulating it destined for the IP of VTEP2
  • VTEP2 opens the packet and removes the VTEP encapsulation and forwards the packet to ESXi2 virtual switch on VNI:5000
  • The switch on ESXi2 sends the packet to the virtual port that the linux servers network card is connected on.

 

This is how a NSX virtual switch handles switching.  At first you may say this makes no sense at all… wouldn’t a VLAN just be easier.   There are a number of benefits this brings:

  • Limits your Spanning tree to potentially top of rack switches if architected correctly
  • Allows you to expand past the 4096 VLAN limit
  • Opens the door for other NSX services (which I will post about in the future.)

 

As I mentioned this is just my understanding I do not have inside knowledge if I have made a mistake let me know, I’ll test then correct it.

Design Scenario: Gigabit network and iSCSI ESXi 5.x

Many months ago I posted some design tips on the VMware forums (I am Gortee there if you are wondering).   Today a user updated the thread with a new scenario looking for some advise.  While it would be a bad idea personally and professionally for me to give specific advise without a design engagement I thought I might provide some thoughts about the scenario here.  This will allow me to justify some design choices I might make in the situation.   In no way should this be taken as law.  In reality everyone situation is different and little requirements can really change the design.   The original post is here.

The scenario provided was the following:

3 ESXI hosts (2xDell R620,1xDell R720) each with 3×4 port NICS (12 ports total), 64GB RAM. (Wish I would have put more on them ;-))

1 Dell MD3200i iSCSI disk array with 12 x 450GB SAS 15K Drives (11+1 Spare) w/2 4 port GB Ethernet Ports

2 x Dell 5424 switches dedicated for traffic between the MD3200i and the 3 Hosts

Each host is connected to the iSCSI network though 4 dedicated NIC Ports across two different cards

Each Host has 1 dedicated VMotion Nic Port connected to its own VLAN connected to a stacked N3048 Dell Layer 3 switch

Each Host will have 2 dedicated (active\standby) Nic ports (2 different NIC Cards) for management

Each Hosts will have a dedicated NIC for backup traffic (Has its own Layer 3 dedicated network/switch)

Each host will use the remaining 4 Nic Ports (two different NIC cards) for the production/VM traffic)

 would you be so kind to give me some recommendations based on our environment?

Requirements

  • Support 150 virtual machines
  • Do not interrupt systems during the design changes

Constraints

  • Cannot buy new hardware
  • Not all traffic is vlan segmented
  • Lots of 1GB ports per server

Assumptions

  • Standard Switches only (Assumed by me)
  • Software iSCSI is in use (Assumed again by me)
  • Not using Enterprise plus licenses

 

Storage

Dell MD3200i iSCSI disk array with 12 x 450GB SAS 15K Drives (11+1 Spare) w/2 4 port GB Ethernet Ports

2 x Dell 5424 switches dedicated for traffic between the MD3200i and the 3 Hosts

Each host is connected to the iSCSI network though 4 dedicated NIC Ports across two different cards

I personally have never used this array model, the vendor should be included on the design to make sure none of my suggestions here are not valid with this storage system.  Looking at the VMware HCL we learn the following:

  • Only supported on ESXi 4.1 U1 through 5.5 (no 5.5 U1 yet so don’t update)
  • You should be using the VMW_PSP_RR (Round Robin) for path fail over
  • The array supports the following VAAI natives Block Zero,Full Copy,HW Assisted Locking

The following suggestions should apply to physical cabling:

Storage

Looking at the diagram I made the following design choices:

  • From my limited understanding the array the cabling follows the best practice guide I could find.
  • Connection from the ESXi hosts to switches are done to create as much redundancy as possible including all available cards.  It is critical that the storage be as redundant as possible.
  • Each uplink (physical nic) should be configured to connect to an individual vmkernel port group.  Each port group should be configured with only one uplink.
  • Physical switches and port groups should be configured to use native port assuming these switches don’t so anything other than provide storage traffic between these four devices (three ESXi and one array)  if the array and switch is providing storage to more things you should follow your vendor’s best practices for segmenting traffic.
  • Port binding for iSCSI should be configured as per VMware document and vendor documents

New design considerations from storage:

  • 4 1GB’s will be used to represent max traffic the system will provide
  • The array does not support 5.5 U1 yet so don’t upgrade
  • We have some VAAI natives to help speed up processes and avoid SCSI locks
  • Software iSCSI requires that forged transmissions be allowed on the switch

Advise to speed up iSCSI storage

  • Bind your bottle neck – is it switch speeds, array processors, ESXi software iSCSI and solve it.
  • You might want to consider Storage DRS on your array to automatically balance load and IO metrics (requires enterprise plus license but saves so much time) – Also has an impact on CBT backups making them do a full backup.
  • Hardware iSCSI adapters might also be worth the time… thou they have little real benefit in the 5.x generation of ESXi

 

Networking

We will assume that we now have 8 total 1GB ports available on each host.   We have a current network architecture that looks like this (avoided the question of how many virtual switches):

network

I may have made mistakes from my reading a few items pop out to me:

  • vMotion does not have any redundancy which means if that card fails we will have to power off VM’s to move them to another host.
  • Backup also does not have redundancy which is less of an issue than the vMotion network
  • All traffic does not have redundant switches creating single points of failure

A few assumptions have to be made:

  • No single virtual machine will require more than 1Gb of traffic at any time (otherwise we have to be looking into LACP or etherchannel solutions.
  • Management traffic, vMotion and virtual machine traffic can live on the same switches as long as they are segmented with VLAN’s

 

Recommended design:

Drawing1

  • Combine the management switch and VM traffic switch into dual function switches to provide both types of traffic.
  • This uses vlan tags to include vMotion and management traffic on the same two uplinks providing card redundancy (configured active / passive)  Could also be configured with multi-nic vMotion but I would avoid due to complexity around management network starvation in your situation.
  • Backup continues to have it’s own two adapters to avoid contention

This does require some careful planning and may not be the best possible use of links.   I am not sure you need 6 links for your VM traffic but it cannot hurt.

 

Final Thoughts:

Is any design perfect?  Nope lots of room for error and unknowns.  Look at the design and let me know what I missed.  Tell me how you would have done it differently… share so we can both learn.  Either way I hope it helps.

Deep Dive: Network Health check

vSphere 5.1 introduced one of my favorite new features.  Network health check.  This feature is designed to identify problems with MTU and VLAN settings.   It is easy enough to set up MTU and VLAN’s in ESXi especially with a dVS.  In most environment the vSphere admins don’t control the physical switches making confirmation of upstream configuration hard.    The health check resolves these issues.  It is only available on dVS switches and only via the web client. (I know time to start using that web client.. your magical fat client is going away) If you have an upstream issue with MTU then you will get an alert in vCenter.   You can find the health check by selecting the dVS and clicking on the manage tab.  On the middle pane you will see Health check which you can edit and enable.   You came here because you want to know how it works.

 

MTU

MTU check is easy.   Each system sends out a ping message to the other nodes.  This ping message has a special header that tells the network not to fragment (split) the packet.   In addition it has a payload (empty data) to make the ping the size of the max MTU.   If the host get’s a return message from the ping it knows the MTU is correct.  If it fails then we know MTU is bad.   Each node checks it’s MTU at an interval.   You can manually check your MTU with vmkping but the syntax has changed between 5.0,5.1 and 5.5 so look up the latest syntax.

 

VLAN

Checking the VLAN is a little more complex.    Each VLAN has to be checked.   So one host on the same vDS (not sure which one but I am willing to bet it’s the master node) sends out a broadcast layer 2 packet on the VLAN.  Then it waits for each node to reply to the broadcast via unicast layer 2 packet.   You can determine which hosts have VLAN issues based upon who reports back.   I assume that host marked as bad then try’s to broadcast as a method to identify failed configuration or partitions.   This test is repeated on each VLAN and at regular intervals. It only works when two peers can connect.

Teaming policy

In ESXi 5.5 they added a check for teaming policy to physical switch.  This check identifies mismatches between IP Hash teaming and switches that are not configured in etherchannel/LACP.

 

Negative Effect of Health check

So why should I not use health check?  Well it does produce some traffic.  It does require you to use the web client to enable and determine which vlan’s are bad…  otherwise I cannot figure out a reason to not use it.   A simple and easy way to determine issues.

Design Advice on health check

Health check is a proactive way to determine upstream vlan or MTU issues before you deploy production to that VLAN.  It saves a ton of time when troubleshooting and fighting between networking and server teams.  I really cannot see a reason to not use it.    I have not tested the required bandwidth but it cannot be huge.   My two cents turn it on if you have a vDS… if you don’t have vDS I hope you only have ten or less VLAN’s.

Deep Dive: vSphere Traffic Shaping

Traffic Shaping is all about the bad actor scenario.  We have 100’s of virtual machines that all get along with each other.  The application team deploys a appliance that goes nuts and starts to use it’s link 100%.  Suddenly you get a call about database and website outages.  How do you deal with the application teams bad actor? According to SmartlyHeated.com, this is the most common reason why every apartment has it’s own water heater. My wife would be very unhappy if she could not take her hot shower in the morning because Bob upstairs took an extra long shower an hour ago, especially today, since we just installed the best shower head we could find! Sharing resources are great as long as resources are unlimited, not over provisioned or usage patterns stay static.  In a real world none of those things are true.  You are likely limited on resources, over provisioned and your traffic patterns change every single day.   Limits allow us to create constraints upon portions of resources in order control bad actors.

Limits (available on any type of switch)

Limits are as expected limits that a machine cannot cross.  This allows a machine to see a 10GB uplink but only use 1GB at most.  This injected slow down is into the communication stream via normal protocol methods.   The limit settings in VMware can be applied on the port group or on dvPort or dvPort Group.  Notice the difference on dVS switches we can apply limits on ports as well as port groups.  Limits can be applied on standard switches via outbound traffic while a dVS can be inbound and outbound.  There are three options on limits:

  • Average bandwidth = Average number of  bit’s per second to allow across the port
  • Peak bandwidth – Max bits per second to allow across a port when it’s utilizing it’s burst traffic, this limits the bandwidth used by the port when using it’s burst.
  • Burst Size – Max bits per second to allow in a burst.  This is the number of bytes allocated to burst when allocation over the average is required.  This can be viewed as a bank when you don’t use all your average bandwidth it can be stored up to the burst size to be used when needed.

 

Limits of the Limits

Limits produce some well… limits.   Limits are always enforced.  Meaning even if bandwidth is  available it will not be allocated to the port group/ port.  Limits on VSS’s are outbound only meaning you can still flood a switch.  Limits are not reservations.  Machines without limits can consume all available resources on a system.  So effectively limits are only useful to stop a bad actor from everyone else.  It is not a sharing method.  Limits on network do have their place but I would avoid general use if possible.

 

Network IO Control a better choice

Network IO Control (NIOC) is available only on the vDS switch.  It provides a solution to the bad actor symptom while providing flexibility.  NIOC is applied to outbound traffic.  NIOC works very much like resource pools with compute and memory.  You setup a NIOC share (resource pool) with a number between 1 and 100.   vSphere comes with some system defined NIOC shares like vMotion and management.  You can also defined new resource pools and assign them to port groups.  NIOC only comes into play during times of contention on the uplink.  All NIOC Shares are calculated on a uplink by uplink basis.  All the active traffic types on the uplink shares are added together.  For example assume my uplink has the following shares:

  • Management 10
  • vMotion 20
  • iSCSI 40
  • Virtual machines 50

If contention arises and only Management, iSCSI and virtual machines are active we would have 100 total shares.  This number is then used to divide the total available bandwidth on that uplink.  Let’s assume we have a 10GB uplink.  The each active traffic type would get based on shares:

  • Managment 1GB
  • iSCSI 4GB
  • Virtual machines 5GB

This example also assumes they are using 100% of their available links.  If management is only using 100MB the others will get it’s left over amount divided by their share amount (in this case 900mb/90 then 40 assigned to iSCSI and 50 assigned to virtual machine).   If a new traffic type comes into play then the shares are recalculated to meet the demands.   This allows you to create worst case scenarios to ensure traffic types for example:

  • Management will get at least 1GB
  • vMotion will get at least 2GB
  • iSCSI will get at least 4GB
  • Virtual machines will get at least 5GB

There is one wrinkle to this plan with multi-nic vMotion but I will address that in another post.

 

Design Choices

Limits have their uses.  They are hard to manage and really hard to diagnose… Imagine coming into a vSphere environment where limits are in place but you did not know.   It could take a week to figure out that was causing the issues.   My vote use them sparingly.   NIOC on the other hand should be used in almost every environment with Enterprise Plus licenses.   It really has no draw back and provides controls on traffic.

Deep Dive: vSphere Network Load Balancing

In vSphere load balancing is a hot topic.   As load size per physical host increases so does the need for more bandwidth.  In a traditional sense this was done with etherchannel or LACP.  This bonds together multiple links so they link and act like a single link.   This helps avoid loops.

What the heck is a loop?

A loop is anytime two layer 2 (ethernet) endpoints have multiple connections to each other.

 

It is possible with two virtual switches to create a bridged loop if care is not taken.   Virtual switches by default will not create loops.  On the physical switch side protocols like spanning tree were created to solve this link issue.  STP disables a link if a loop is detected.  If the enabled link goes down STP turns on the disabled link.   This process works for redundancy but does not do anything if link 1 is not a big enough pipe to handle the full load.    VMware has  provided a number of load balancing algorithms to provide more bandwidth.

Options

  • Route Based on Originating virtual port (Default)
  • Route Based on IP Hash
  • Route Based on Source MAC Hash
  • Route Based on Physical NIC Load (LBT)
  • Use Explicit Failover Order

 

In order to explain each of these options assume we have a ESXi host with two physical network cards called nic1 and nic2.   It’s important to understand that the load balancing options can be configured at the network switch or port group level allowing for lots of different load balancing on the same server.

Route Based on Originating virtual port (Default)

The physical nic to be used is determined by the ID of the virtual port to which the VM is connected.  Each virtual machine is connected to a virtual switch which has a number of virtual ports, each port has a number.   Once assigned the port does not change unless the host changes ESXi hosts.  This number is the virtual ID.   I don’t know the exact method used but I assume it’s something as simple and odd’s and evens for two nics.  Everything odd goes to port 1 while even goes to port 0.  This method has the lowest overhead from a virtual switch processing, and works with any network configuration.  It does not require any special physical switch configuration.  You can see though it does not really load balance.  Lets assume you have a lot of port groups with only virtual machine on port 0.  In this case all virtual machines would use the same uplink leaving the other unused.

Route Based on IP Hash

The physical nic to be used is determined by a hash of the source and destination IP address.   This method provides load balancing to multiple physical network cards from a single virtual machine.  It’s the only method that allows a single virtual machine to use the bandwidth of multiple physical nics.  It has one major draw back the physical switches must be configured to use etherchannel (802.3ad link aggregation) so they present both network links as a single link to avoid problems.   This is a major design choice.  It also does not provide perfect load balancing.  Lets assume that you have a application server that does 80% of it’s traffic with a database server.  Their communication will always happen across the same link.  They will never use the bandwidth of two links.  Their hash will always assign them the same link. In addition this method uses a lot of CPU.

  • When using etherchannel only a single switch may be used
  • Beacon probing is not supported on IP Hash
  • vDS is required for LACP
  • Troubleshooting is difficult because each destination/source combination may take a different path.  (Some virtual machine paths may work with others will not in a non-consistent pattern.)

Route Based on Source Mac Hash

The physical nic to be used is determined by a hash created from the virtual machines source address.  This method provides a more balanced approach to load balancing than originating virtual port.  Each virtual machine will always use only a single link but load will be distributed.  This method has a low CPU overhead and does not require any physical switch configuration

Route Based on Physical NIC Load (Distributed Virtual Switch Required also called LBT)

The physical nic to be used is determined by load.  The nics are used in order (nic1 then nic2)  No traffic will be moved to nic2 untile nic1 is utilized above 75% capacity for 30 seconds.  Once this is achieved traffic flows are moved to the next available nic.  They will stay at that nic until another LBT event happens moving traffic.   LBT does require the dVS and some CPU overhead.  It does not allow a single virtual machine to gain more than 100% of a single link speed.   It does balance traffic among all links during times of contention.

Use Explicit Fail over

The physical nic to be used is determined by being the highest nic on the list of available nics.  The others will not be used unless the first nic is unavailable.  This method does no load balancing and should only be used is very special cases (link multi-nic vMotion).

 

Design Advice

Which one should you use?  It depends on your need.  Recently a friend told me they never changed the default because they never get close to using a single link.   While this method has merit and I wish more people understood their network metrics you may need to plan for the future.  There are two questions I use to determine which to use:

  • Do you have any virtual machines that alone require more than a single links bandwidth? (If yes then the only option is IP Hash and LACP or etherchannel)
  • Do you have vDS’s? (If yes then use Route based on physical nic load, if no then use default or source MAC)

Simply put the LBT is a lot more manageable and easy to configure.

Deep Dive: Virtual Switch Security settings and Port Binding

Security Settings:

Three options are available on a virtual switch.  These settings can be set at the switch layer then overwritten on individual port groups.

  • Promiscuous Mode – This allows the guest adapter to detect all frames passed on the vSwitch that are in the same VLAN as the guest.  Allows for packet sniffing.  This is not port mirroring it only allows a host to see it’s own traffic and any broadcast traffic.
  • MAC Address Change – Allows the guest to change it’s mac address.  If set to reject all frames for the mac not in the .vmx file are dropped at the switch.
  • Forged Transmits – If set to reject all frames from the guest with a mac address that does not match the .vmx file are dropped.

Security settings advise:

Set all three to reject on the switch keeping your operating systems admins in a box while protecting shared resources.   Then add individual polices to each port group as needed.   If you are wondering where it’s needed one of the use cases is nested virtualization.. which requires all three to be set to accept.

Port Binding:

Port binding is a setting that allows you to determine how and when the ports on a virtual switch are allocated.  Currently there are three port binding options:

  • Static binding (default)
  • Dynamic binding
  • Ephemeral binding

Static Binding – means a port is allocated to a virtual machine when it is added to the port group.  Once allocated to the port group it continues to use the port until removed from the port group (via deletion or move to another port group).  Network stats with static binding is kept through power off and vMotion.

Dynamic Binding – will be removed in the near future. Ports are allocated only when a virtual machine is powered on and the virtual network card is connected.  They are dynamically allocated when needed.  Network stats are kept through vMotion but not power off.

Ephemeral Binding – Is a lot like a standard vSwitch it can be managed from the vCenter or ESXi host.  Ports are allocated when the host is powered on and nic is connected.  One major difference is that dvPorts are created on demand all other binding type creates them when the port group is created.  This process takes more RAM and processor power and so their are limits on the number of ephemeral ports available.  Ephemeral ports are used for recovery when vCenter is down and may help with vCenter availability.  All stats are lost when you vMotion or power off the virtual machine.

Port Group Type advice:

I would use static binding on almost everything.  Ephemeral has a high cost and does not scale.  I do personally use ephemeral for vCenter because I use 100% dVS switches.  If you are using standard switches just use static across the board.