Configuring a NSX load balancer from API

A customer asked me this week if there was any examples of customers configuring the NSX load balancer via vRealize Automation.   I was surprised when google didn’t turn up any examples.  The NSX API guide (which is one of the best guides around) provides the details for how to call each element.  You can download it here. Once you have the PDF you can navigate to page 200 which is the start of the load balancer section.

Too many Edge devices

NSX load balancers are Edge service gateways.   A normal NSX environment may have a few while others may have hundreds but not all are load balancers.   A quick API lookup of all Edges provides this information: (my NSX manager is 192.168.10.28 hence the usage in all examples)

https://192.168.10.28/api/4.0/edges
        <edgeSummary>
            <objectId>edge-57</objectId>
            <objectTypeName>Edge</objectTypeName>
            <vsmUuid>420CD713-469F-7053-8281-A7BD66A1CD46</vsmUuid>
            <nodeId>92484cee-ab3c-4ed2-955e-e5bd135f5be5</nodeId>
            <revision>2</revision>
            <type>
                <typeName>Edge</typeName>
            </type>
            <name>LB-1</name>
            <clientHandle></clientHandle>
            <extendedAttributes/>
            <isUniversal>false</isUniversal>
            <universalRevision>0</universalRevision>
            <id>edge-57</id>
            <state>deployed</state>
            <edgeType>gatewayServices</edgeType>
            <datacenterMoid>datacenter-21</datacenterMoid>
            <datacenterName>Home</datacenterName>
            <tenantId>default</tenantId>
            <apiVersion>4.0</apiVersion>
            <recentJobInfo>
                <jobId>jobdata-34935</jobId>
                <status>SUCCESS</status>
            </recentJobInfo>
            <edgeStatus>GREEN</edgeStatus>
            <numberOfConnectedVnics>1</numberOfConnectedVnics>
            <appliancesSummary>
                <vmVersion>6.2.0</vmVersion>
                <vmBuildInfo>6.2.0-2982179</vmBuildInfo>
                <applianceSize>compact</applianceSize>
                <fqdn>NSX-edge-57</fqdn>
                <numberOfDeployedVms>1</numberOfDeployedVms>
                <activeVseHaIndex>0</activeVseHaIndex>
                <vmMoidOfActiveVse>vm-283</vmMoidOfActiveVse>
                <vmNameOfActiveVse>LB-1-0</vmNameOfActiveVse>
                <hostMoidOfActiveVse>host-29</hostMoidOfActiveVse>
                <hostNameOfActiveVse>vmh1.griffiths.local</hostNameOfActiveVse>
                <resourcePoolMoidOfActiveVse>resgroup-27</resourcePoolMoidOfActiveVse>
                <resourcePoolNameOfActiveVse>Resources</resourcePoolNameOfActiveVse>
                <dataStoreMoidOfActiveVse>datastore-31</dataStoreMoidOfActiveVse>
                <dataStoreNameOfActiveVse>SYN8-NFS-GEN-VOL1</dataStoreNameOfActiveVse>
                <statusFromVseUpdatedOn>1478911807005</statusFromVseUpdatedOn>
                <communicationChannel>msgbus</communicationChannel>
            </appliancesSummary>
            <hypervisorAssist>false</hypervisorAssist>
            <allowedActions>
                <string>Change Log Level</string>
                <string>Add Edge Appliance</string>
                <string>Delete Edge Appliance</string>
                <string>Edit Edge Appliance</string>
                <string>Edit CLI Credentials</string>
                <string>Change edge appliance size</string>
                <string>Force Sync</string>
                <string>Redeploy Edge</string>
                <string>Change Edge Appliance Core Dump Configuration</string>
                <string>Enable hypervisorAssist</string>
                <string>Edit Highavailability</string>
                <string>Edit Dns</string>
                <string>Edit Syslog</string>
                <string>Edit Automatic Rule Generation Settings</string>
                <string>Disable SSH</string>
                <string>Download Edge TechSupport Logs</string>
            </allowedActions>
        </edgeSummary>

 

This is for a single Edge gateway in my case I have 57 Edges deployed over the life of my NSX environment and 15 active right now.   But only Edge-57 is a load balancer.   This report does not provide anything that can be used to identify it as a load balancer from a Edge as a firewall.   In order to identify if it’s a load balancer I have to query it’s load balancer configuration using:

https://192.168.10.28/api/4.0/edges/edge-57/loadbalancer/config

Notice the addition of the edge-57 name to the query.   It returns:

<loadBalancer>
    <version>2</version>
    <enabled>true</enabled>
    <enableServiceInsertion>false</enableServiceInsertion>
    <accelerationEnabled>false</accelerationEnabled>
    <monitor>
        <monitorId>monitor-1</monitorId>
        <type>tcp</type>
        <interval>5</interval>
        <timeout>15</timeout>
        <maxRetries>3</maxRetries>
        <name>default_tcp_monitor</name>
    </monitor>
    <monitor>
        <monitorId>monitor-2</monitorId>
        <type>http</type>
        <interval>5</interval>
        <timeout>15</timeout>
        <maxRetries>3</maxRetries>
        <method>GET</method>
        <url>/</url>
        <name>default_http_monitor</name>
    </monitor>
    <monitor>
        <monitorId>monitor-3</monitorId>
        <type>https</type>
        <interval>5</interval>
        <timeout>15</timeout>
        <maxRetries>3</maxRetries>
        <method>GET</method>
        <url>/</url>
        <name>default_https_monitor</name>
    </monitor>
    <logging>
        <enable>false</enable>
        <logLevel>info</logLevel>
    </logging>
</loadBalancer>

Notice that this edge has load balancer enabled as true with some default monitors.   To compare here is a edge without the feature enabled:

https://192.168.10.28/api/4.0/edges/edge-56/loadbalancer/config
<loadBalancer>
    <version>1</version>
    <enabled>false</enabled>
    <enableServiceInsertion>false</enableServiceInsertion>
    <accelerationEnabled>false</accelerationEnabled>
    <monitor>
        <monitorId>monitor-1</monitorId>
        <type>tcp</type>
        <interval>5</interval>
        <timeout>15</timeout>
        <maxRetries>3</maxRetries>
        <name>default_tcp_monitor</name>
    </monitor>
    <monitor>
        <monitorId>monitor-2</monitorId>
        <type>http</type>
        <interval>5</interval>
        <timeout>15</timeout>
        <maxRetries>3</maxRetries>
        <method>GET</method>
        <url>/</url>
        <name>default_http_monitor</name>
    </monitor>
    <monitor>
        <monitorId>monitor-3</monitorId>
        <type>https</type>
        <interval>5</interval>
        <timeout>15</timeout>
        <maxRetries>3</maxRetries>
        <method>GET</method>
        <url>/</url>
        <name>default_https_monitor</name>
    </monitor>
    <logging>
        <enable>false</enable>
        <logLevel>info</logLevel>
    </logging>
</loadBalancer>

Enabled is false with the same default monitors.   So now we know how to identify which edges are load balancers:

  • Get list of all Edges via API and pull out id element
  • Query each id element for load balancer config and match on true

 

 

Adding virtual servers

You can add virtual servers assuming the application profile and pools are already in place with a POST command with a XML body payload like this (the virtual server IP must already be assigned to the Edge as an interface):

https://192.168.10.28/api/4.0/edges/edge-57/loadbalancer/config/virtualservers
<virtualServer>
<name>http_vip_2</name>
<description>http virtualServer 2</description>
<enabled>true</enabled>
<ipAddress>192.168.10.18</ipAddress>
<protocol>http</protocol>
<port>443,6000-7000</port> 
<connectionLimit>123</connectionLimit>
<connectionRateLimit>123</connectionRateLimit>
<applicationProfileId>applicationProfile-1</applicationProfileId>
<defaultPoolId>pool-1</defaultPoolId>
<enableServiceInsertion>false</enableServiceInsertion>
<accelerationEnabled>true</accelerationEnabled>
</virtualServer>

capture

You can see it’s been created.  A quick query:

https://192.168.10.28/api/4.0/edges/edge-57/loadbalancer/config/virtualservers
<loadBalancer>
    <virtualServer>
        <virtualServerId>virtualServer-5</virtualServerId>
        <name>http_vip_2</name>
        <description>http virtualServer 2</description>
        <enabled>true</enabled>
        <ipAddress>192.168.10.18</ipAddress>
        <protocol>http</protocol>
        <port>443,6000-7000</port>
        <connectionLimit>123</connectionLimit>
        <connectionRateLimit>123</connectionRateLimit>
        <defaultPoolId>pool-1</defaultPoolId>
        <applicationProfileId>applicationProfile-1</applicationProfileId>
        <enableServiceInsertion>false</enableServiceInsertion>
        <accelerationEnabled>true</accelerationEnabled>
    </virtualServer>
</loadBalancer>

 

Shows it’s been created.  To delete just use the virtualServerId and pass to DELETE

https://192.168.10.28/api/4.0/edges/edge-57/loadbalancer/config/virtualservers/virtualserverID

 

Pool Members

For pools you have to update the full information to add a backend member or for that matter remove a member.  So you first query it:

https://192.168.10.28/api/4.0/edges/edge-57/loadbalancer/config/pools
<?xml version="1.0" encoding="UTF-8"?>
<loadBalancer>
    <pool>
        <poolId>pool-1</poolId>
        <name>pool-1</name>
        <algorithm>round-robin</algorithm>
        <transparent>false</transparent>
    </pool>
</loadBalancer>

Then you form your PUT with the data elements you need (taken from API guide).

https://192.168.10.28/api/4.0/edges/edge-57/loadbalancer/config/pools/pool-1
<pool>
<name>pool-1</name>
<description>pool-tcp-snat</description>
<transparent>false</transparent>
<algorithm>round-robin</algorithm>
<monitorId>monitor-3</monitorId>
<member>
<ipAddress>192.168.10.14</ipAddress>
<weight>1</weight>
<port>80</port>
<minConn>10</minConn>
<maxConn>100</maxConn>
<name>m5</name>
<monitorPort>80</monitorPort>
</member>
</pool>

In the client we see a member added:

capture

Tie it all together

Each of these actions have a update delete and query function that can be done.  The real challenge is taking the API inputs and creating user friendly data into vRealize Input to make it user friendly.    NSX continues to amaze me as a great product that has a very powerful and documented API.    I have run into very little issues trying to figure out how to do anything in NSX with the API.  In a future post I may provide some vRealize Orchestrator actions to speed up configuration of load balancers.

 

 

 

 

 

 

 

 

 

vSphere 6.5 features that are exciting to me

Well yesterday VMware announced vSphere 6.5 and VSAN 6.5  both are huge leaps forward in technology.   They address some major challenges my customers face and I wanted to share a few features that I think are awesome:

vSphere 6.5

  • High Availability in vCenter Appliance – if you wanted a reason to switch to the appliance this has to be it… for years I have asked for high availability for vCenter.   Now we have it.   I look forward to testing and blogging about failure scenarios with this new version.  This has to be my #1 ask for the platform for the last three years!  – We are not talking about VMware HA we are talking about active / standby appliances.
  • VM EncryptionNotice this is a feature of vSphere not VSAN – this is huge the hypervisor can encrypt virtual machines at rest and while being vMotioned.   This is a huge enabler for public cloud allowing you to ensure your data is secure with your encryption keys.   This is going to make a lot of compliance folks happen and enable some serious hybrid cloud.
  • Integrated Containers – Docker compatible interface for containers in vSphere allowing you to spawn stateless containers while enforcing security, compliance and monitoring using vSphere tools (NSX etc..) – this allows you to run traditional and next generation applications side by side.

VSAN 6.5

  • iSCSI support – VSAN will be able to be a iSCSI target for physical workloads – a.k.a SQL failover clustering and Oracle RAC.   This is huge VSAN can now be a iSCSI server that has easy policy based management and scaleable performance.

There are a lot more annoucements but these features are just awesome.    You can read more about vSphere 6.5 here and VSAN 6.5 here.

vRO scriptable task to return top level folder of a VM

Every so often you have nested folders in a vCenter and want to return only the top level folder.  Here is a function to return the top level folder only:

function return_folder_path(vm)
{  
                var parent = vm.parent;  
                var parentNames = new Array();  
                var vmPathName = "";  
                while (parent instanceof VcFolder){  
                                parentNames.push(parent.name);  
                                parent = parent.parent;  
                }  
                for (i=(parentNames.length-2); i>=0; i--){  
                                //vmPathName += "/"+parentNames.pop();  
                                vmPathName += "/"+parentNames[i]; 
                }  
                return vmPathName;
}

 

Latency-sensitivity features of ESXi

In ESXi 5.5 there was a latency sensitive feature added to the web client.  This feature applies to individual virtual machines.  It was specifically added for latency sensitive applications.  VMware full recommendations for latency sensitive applications can be found here.   There are four settings exposed in the web client for latency sensitivity each denotes a maximum latency in micro-seconds for CPU scheduling:

  • Low
  • Normal – Latency sensitive features are disabled and the normal setting for all virtual machines by default
  • Medium
  • High – Feature is enabled

This left me with question related to the low and medium setting.   Official documentation does not even mention them as options.  I have requested they be documented in the future but what I can tell you is this:

  • Low = worse than normal shares think of it as limits for latency sensitivity
  • Medium = Not disabled and not fully enabled

End result don’t use Low or Medium you should be either normal or high today.

What does Enabled do?

CPU:

Enabled latency sensitivity essentially gives exclusive access to physical resources, bypassing all virtualization layers and tunes network virtualization layer.   Due to the bypass access to CPU the number of virtual CPU’s should not exceed the number of physical cores + 2 for VMkernel threads.   The lack of CPU scheduling from the virtualization layer reduces latency on CPU operations even more than just CPU reservation can. The effects you can see in esxtop are 100% run for each vCPU of a VM.

RAM:

Enabling the feature creates and automatic reservation for the full virtual machine memory.

Network:

Network frames will not be coalesced when enabled.

Design Guidance:

Don’t use this feature unless you really need the latency sensitivity.  You should design clusters around latency sensitivity instead of using commodity clusters.   The latency sensitive settings will not play with with others.

Rarely used ESXCLI commands

I have had this for a while figured I would share.   It’s a list of some commands that have used:

How to get ESX version

nroot@vmh1:~] esxcli system version get
Product: VMware ESXi
Version: 6.0.0
Build: Releasebuild-3825889
Update: 2
Patch: 37

How to get host uuid

[root@vmh1:~] esxcli system uuid get
5650706f-983e-c8cc-2315-001018530d76

How to get hostname and domain

[root@vmh1:~] esxcli system hostname get
Domain Name: griffiths.local
Fully Qualified Domain Name: vmh1.griffiths.local
Host Name: vmh1

How to get current load

[root@vmh1:~] esxcli system process stats load get
Load1Minute: 0.13
Load15Minutes: 0.15
Load5Minutes: 0.15

Count of running processes

[root@vmh1:~] esxcli system process stats running get
Running Processes: 667

Boot device information

[root@vmh1:~] esxcli system boot device get
Boot Filesystem UUID: c1341d71-f540cf9f-60c5-9b28cab78741
Boot NIC:
Stateless Boot NIC:

List of loaded modules

[root@vmh1:~] esxcli system module list
Name Is Loaded Is Enabled
—————————– ——— ———-
vmkernel true true
chardevs true true
user true true

Info on a specific module

[root@vmh1:~] esxcli system module get -m vmkernel
Module: vmkernel
Module File:
License:
Version: Version Releasebuild-3825889
Build Type:
Provided Namespaces:
Required Namespaces:
Containing VIB: unknown
VIB Acceptance Level: unknown

Log rotation and syslog settings

[root@vmh1:~] esxcli system syslog config get
Default Network Retry Timeout: 180
Dropped Log File Rotation Size: 100
Dropped Log File Rotations: 10
Enforce SSLCertificates: false
Local Log Output: /scratch/log
Local Log Output Is Configured: false
Local Log Output Is Persistent: true
Local Logging Default Rotation Size: 1024
Local Logging Default Rotations: 8
Log To Unique Subdirectory: false
Message Queue Drop Mark: 90
Remote Host: udp://log.griffiths.local:514

CPU Infomation

[root@vmh1:~] esxcli hardware cpu global get
CPU Packages: 1
CPU Cores: 4
CPU Threads: 4
Hyperthreading Active: false
Hyperthreading Supported: false
Hyperthreading Enabled: true
HV Support: 3
HV Replay Capable: true
HV Replay Disabled Reasons:

[root@vmh1:~] esxcli hardware cpu list
CPU:0
Id: 0
Package Id: 0
Family: 6
Model: 23
Type: 0
Stepping: 10
Brand: GenuineIntel
Core Speed: 2000070668
Bus Speed: 333345099
APIC ID: 0x0
Node: 0
L2 Cache Size: 6291456
L2 Cache Associativity: 24
L2 Cache Line Size: 64
L2 Cache CPU Count: 2
L3 Cache Size: -1
L3 Cache Associativity: -1
L3 Cache Line Size: -1
L3 Cache CPU Count: 2

Memory and Numa information

[root@vmh1:~] esxcli hardware memory get
Physical Memory: 34358984704 Bytes
Reliable Memory: 0 Bytes
NUMA Node Count: 1

Dump bios information and settings

[root@vmh1:~] smbiosDump
Dumping live SMBIOS data!
BIOS Info: #1
Size: 0x00018
Vendor: “Hewlett-Packard”
Version: “786F4 v01.32”
Date: “10/15/2008”
Start Address: 0xe0000
ROM Size: 1024 kB

 

Video’s on how to use vRealize Orchestrator

A few weeks ago I presented on Automation at the Louisville VMUG.   During the session I mentioned that using vRealize Orchestrator is really a good skill to learn.  It allows you to become the orchestrator of external services.   It’s a critical skill going forward.  I didn’t want to bore the group by watching how to video’s but promised to post them.   Here they are:

Part 1

Part 2

Enjoy

Remove a vCenter from a PSC Domain

With the new architecture for platform services controller (PSC)  I expect we will see a lot more vCenter’s joined to a central platform services controller.   At one point I had two vCenter’s connected to one PSC.  I deleted one vCenter but continued to get the error:

Could not connect to one or more vCenter Server systems.

vCenter-Gone

My PSC log was riddled with connection failures and it seemed to run really slow.   Removal is easy and thanks to the VMware team it’s well documented.   In my case the PSC was a Linux appliance so I ran the following command:

cmsso-util unregister –hostID 1234 –node-pnid vCenter2.griffiths.local -username administrator@vsphere.local –passwd password

Removal

This is all documented in the following KB: 2106736.

I could not locate an explaination of what hostID to use so I duplicated the one from the KB article and it worked.  My errors are gone.

vRO add all virtual machines to NSX exception list

Almost everyone is using a brownfield environment to implement NSX.   Switching the DFW firewall to deny all is the safest bet but hard to do with brownfield environments.   Denying all traffic is a bad idea.  Doing a massive application conversion at once into DFW rules is not practical.  One method to solve this issue is to create an exception for all virtual machines then move them out of exception once you have created the correct allow rules for the machine.   I didn’t want to manually via the GUI add all my machines so I explored the API.

How to explore the API for NSX

VMware’s beta developer center provides the easiest way to explore the NSX API.   You can find the NSX section here.  Searching the api for “excep” quickly turned up the following answer:

api

As you can see there are three methods (get, put, delete).  It’s always safe to start with a get as it does not produce changes.   Using postman for chrome.  I was quickly connected to NSX see my setting below:

n1

The return from this get was lots of lines of machines that I had manually added to the exception list.  For example the following

n1

Looking at this virtual machine you can see it’s identified by the objectID which aligns with the put and delete functions the following worked perfectly:

Delete
https://192.168.10.28/api/2.1/app/excludelist/vm-47

Put
https://192.168.10.28/api/2.1/app/excludelist/vm-47

A quick get showed the vm-47 was back on the list.  Now we had one issue the designation and inventory of objectID’s is not a construct of NSX but of vCenter.

The Plan

In order to be successful in my plans I needed to do the following

  • Gather list of all objectID’s from vCenter
  • Put the list one at a time into NSX’s exclude list
  • Have some way to orchestrate it all together

No surprise I turned to vRealize orchestrator.   I wanted to keep it generic rest connections and not use NSX plugins.   So my journey began.

Orchestrator REST for NSX

  • Login to orchestrator
  • Switch to workflow view
  • Expand the library and locate the add rest host workflow
  • Run the workflow

1

2

3

4

  • Hit submit and wait for it to complete
  • You can verify the connect by visiting the administration section and expanding rest connections

Now we need to add a rest operation for addition to the exception list.

  • Locate the Add REST operation workflow
  • Run it
  • Fill out as shown

1

You now have a put method that takes input of {vm-id} before it can run.  In order to test we go back to our Postman and delete vm-47 and do a get to verify it’s gone:

Delete:

https://192.168.10.28/api/2.1/app/excludelist/vm-47

Get:

https://192.168.10.28/api/2.1/app/excludelist

 

It’ is missing from the get.   Now we need to run our REST operation

  • Locate the workflow called: Invoke a REST operation
  • Run it as shown below

1

2

3

Once completed a quick postman get showed me vm-47 is back on the exclude list.   Now I am ready for prime time.

Creation of an Add to Exclude List workflow

I need to create a workflow that just runs the rest operation to add to exclude list.

  • Copy the Invoke a REST operation
  • New workflow should be called AddNSXExclude
  • Edit new workflow
  • Go to inputs and remove all param_xxx except param_0
  • Move everything else but param_0 to Attributes

1

  • Let’s edit the attributes next
  • Click on the value for ther restOperation and set it to “Put on Exclude List ..” operation you created earlier

1

  • Go to the Schema and edit the REST call scriptable task
  • Remove all param_xxx except param_0 from the IN on the scriptable task

1

  • Edit the top line of the scripting to read like this:

var inParamtersValues = [param_0];

  • Close the scriptable task
  • Click on presentation and remove everything but the content question

1

Now we have a new issue.  We need to have it not error when the return code is not 200.  For example if the object is already on the exception list.   We just want everything on the list right away.   So edit your schema to remove everything but the rest call:

1

 

Put it all together with a list of virtual machines

Time for a new workflow with a scriptable task.

  • In the general tab put a single attribute that is an array of string

1

  • Add a scriptable task to the schema
  • Add a foreach element to the schema after the scriptable task
    • Link the foreach look to the AddNSX workflow you made in the last step
    • Link vmid to param_0

1

  • Edit the scriptable task and add the following code:

//get list of all VM’s
vms = System.getModule(“com.vmware.library.vc.vm”).getAllVMs();

var vmid = new Array();

for each (vm in vms)
{

vmid.push(vm.id);

}

 

  • Add an IN for vmid and an OUT for vmid
  • Run it and your complete you can see the response headers in the logs section

Hope it helps you automate some NSX.

What does apply to mean in NSX Firewall?

When I first started using NSX I ran into this little problem.   What does apply to mean and how should I use it?

Background

I believe the background for the apply to is from physical firewalls.   They allowed you to apply rules to a specific interface.   Applying to an interface had the following effects:

  • Limit the number of rules that have to be processed
  • Allow specific fine-grained controls

Applying rules to specific interfaces had a few issues:

  • You had to have a good understanding of the network topology in order apply rules correctly
  • New interfaces may be missed by rules

You also had the ability to apply the rule to all interfaces that existed.   On the surface if you had enough hardware to apply the rules everywhere it worked great.  Tons of interfaces who didn’t need the rules now had them.    There are a few problems:

  • New interfaces would have no rules and all rules would have to be applyed to them
  • These rules exist only on a single firewall rule creation is specific to that firewall

NSX Firewall

The NSX firewall takes a similar approach to firewall application.  All firewall rules are created in NSX manager and stored inside the NSX manager database.   By default rules are applied to the “distributed firewall”.  This will apply the rules to all virtual machines vNIC, regardless of the virtual machines location.   This creates the same problem as applying on every interface, each vNIC will have a long list of rules to attempt to match.

This is where the apply to tag becomes interesting.   In order to explain I’ll use a simple example:

Two virtual machines: 172.16.0.2 on VNI 5000 and 172.16.20.2 on VNI 5002.

My default firewall rule set allows them to communicate without any issues.  Let assume I want to block all traffic between these machines so I create the following rule:

pic1

Source:  172.16.0.2 virtual machine

Destination: 172.16.20.2 virtual machine

Service: Any

Action: Block

Apply to: Distributed firewall (default)

 

Using Traceflow we can identify where it was blocked:

pic2

You can see clearly the default of distributed switch applied the drop action to the source.   This is really great because it limits the traffic on the physical wire.   Since the object is known as a managed object in NSX the rule is enforced as soon as possible.   If you have a physical entity that is not managed by NSX the rule will be applied upon the destination.   This is hard to prove because traceflow cannot provide visibility to physical entities.

What does apply to do?

Simply put it tells NSX where to apply the firewall rule.  Lets examine some of the options for my rule above:

  • Host
  • Cluster
  • Virtual machine
  • IP or Mac set
  • etc..

It provides the full list of objects that DFW rules can made with including dynamic sets and tags.   This is really powerful.   For the sake of this example lets apply the rule to the destination virtual machine instead of the DFW.

pic1

Using traceflow we can see the results:

pic2

My attempted connection was dropped at the destination where I applied the firewall rule.    You can also see how it between 7 and 8 the message left host 3 and went across my physical network to host 1 (black hole of visibility)

Why use the apply to feature?

  • Reduce the amount of rules applied to each vNIC
  • Enforce the rule at a specific location (think situations with VM overlap or rule overlap)

Apply to does add to the complexity of the environment and troubleshooting but can limit scope.   This is where careful planning and understanding of the environment can really help.   Arkin can help as well but that’s another days post.

Greatest tool for NSX!

I want to let you in on a little secret of NSX called Traceflow.   It was made available in the 6.2 release and I am in love with it.   In order to explain my love let’s do a history lesson a fantastic read :

History Lesson (Get off my lawn kids time)

Back in the old days (pretty much right now in every enterprise) you had a bunch of switches, routers and firewalls.   When a server was having a problem communicating with another server you had to trace its MAC address through every hop manually.   You might be lucky and use a SIEM to identify if a firewall was dropping the traffic.   Understanding each hop of the traffic is a pain.    It takes time and can be very complex in enterprise implementations.

Enter NSX

NSX does some complex routing, switching and firewalling.   Your visibility into the process in the past was articles like mine.   With traceflow you can prove your theory and identify data paths.    It still does not have visibility beyond the NSX world and into the physical.   Hopefully some day we will have that too.   Traceflow can get you pretty close.

Where is this traceflow of which you speak?

Login to vCenter, select networking and security and it’s on the right side most of the way down.   It allows you to select a source and a destination then inject packets.   The NSX components report back as the injected packet passes by allowing you to trace the flow of communication.

Show me some meat

Sounds good.  Lets assume we have two virtual machines 172.16.0.2 and 172.16.0.3 both on VNI (think vlan) 5000.   They are on the same ESXi host.   There are no firewall rules blocking traffic.   Here is the output from traceflow:

first_same_host

Look at that.  The injected packet came from 172.16.0.2 and hit the vNIC FW then was forwarded directly to 172.16.0.3’s vNIC firewall and into the machine.   This is simple and exactly what we expect.  Let do the exact same thing except move the second machine to another ESXi host:

second_diff_hosts

Now we have added the VTEP (virtual tunnel end point) connection between ESXi hosts.  VTEP communication is layer 3 between ESXi hosts creating a stretch of VNI 5000 between distances or right next to each other.

Neat meat but it really only shows layer 2 communication that’s easy

How about some routing then.  Two virtual machines 172.16.0.2 VNI 5000 and 172.16.10.2 VNI 5001.   Each on the same ESXi host:

Third_usingtwo_networks

Look at that now we see the logical router in the mix taking the traffic from Logical switch (LS-172.16.0) and routing it to Logical router LS-172.16.10.   Suddenly the flow of traffic is not a mystery.

What about if the firewall is blocking the traffic?

I assumed you would ask so here is a new firewall rule I added:

rule

And the traceflow:

after_fw_rule_added_1

Yep my packet was dropped and it tells me where and what rule number blocked it.

What is the only problem with traceflow?

That is does not show the traffic flow on my physical network.   This should be very simple given that all my traffic for NSX is routed we should not have complex layer 2 stretches or lots of vlans to ensure are in place.   It’s just routed communication that can start at top of rack with the correct design.