Redeploy NSX Edges to a different cluster / datacenter

First Issue my bad

I ran into an interesting issue in my home lab.  I recently replaced all my older HP servers with Intel NUC’s.  I could not be happier with the results.   Once I replaced all the ESXi hosts I mounted the storage and started up my virtual machines including vCenter.   Once vCenter and NSX Manager were available I moved all the ESXi hosts to the distributed switch.   This normal process was complicated by NSX.    I should have added the ESXi hosts to the transport zone allowing NSX to join the distributed switch.   Failure to do this made the NSX VXLAN process fail.   I could not prepare the hosts… ultimately I removed the VXLAN entries from the distributed switch and then re-prepared which re-created the VXLAN entries on the switch.   (This is not a good idea if you use it in production so follow the correct path.

Second Issue nice to know

This process generated a second issue the original cluster and datacenter on which my NSX edges used to live was gone.   I assumed that I could just re-deploy NSX edges from the manager.   While this is true the configuration assumes that it will be deploying the Edges to the same datacenter, resource pool and potentially the same host as when it was created.   So if I have a failure and expect to just bring up NSX manager and redeploy to a new cluster it will not work.   You have to adjust the parameters for the edges you can do this via the API or GUI.   I wanted to demonstrate the API method:

I needed to change the resource pool, datastore, and host for my Edge.   I identified my Edge via the identifier name in the GUI.  (edge-8 for me)  Grabbed my favorite REST tool (postman) and formed a query on the current state:

This returned the configuration for this edge device.  If you need to identify all edges just do

Then I needed the VMware identifier for resource pool, datastore and host – this can all be gathered via the REST API but I went for Powershell because it was faster for me.  I used the following commands in PowerCLI:

 

 

Once identified I was ready to form my adjusted query:

 

<appliances>
<applianceSize>compact</applianceSize>
<appliance>
<highAvailabilityIndex>0</highAvailabilityIndex>
<vcUuid>500cfc30-5b2a-6bae-32a3-360e0315ccd3</vcUuid>
<vmId>vm-924</vmId>
<resourcePoolId>domain-c861</resourcePoolId>
<resourcePoolName>domain-c861</resourcePoolName>
<datastoreId>datastore-865</datastoreId>
<datastoreName>datastore-865</datastoreName>
<hostId>host-881</hostId>
<vmFolderId>group-v122</vmFolderId>
<vmFolderName>NSX</vmFolderName>
<vmHostname>esg1-0</vmHostname>
<vmName>ESG-1-0</vmName>
<deployed>true</deployed>
<cpuReservation>
<limit>-1</limit>
<reservation>1000</reservation>
</cpuReservation>
<memoryReservation>
<limit>-1</limit>
<reservation>512</reservation>
</memoryReservation>
<edgeId>edge-9</edgeId>
<configuredResourcePool>
<id>domain-c26</id>
<name>domain-c26</name>
<isValid>false</isValid>
</configuredResourcePool>
<configuredDataStore>
<id>datastore-31</id>
<isValid>false</isValid>
</configuredDataStore>
<configuredHost>
<id>host-29</id>
<isValid>false</isValid>
</configuredHost>
<configuredVmFolder>
<id>group-v122</id>
<name>NSX</name>
<isValid>true</isValid>
</configuredVmFolder>
</appliance>
<deployAppliances>true</deployAppliances>
</appliances>

I used a PUT against https://{nsx-manager-ip}/api/4.0/edges/{edgeId}/appliances  with the above body in xml/application.   Then I was able to redeploy my edge devices without any challenge.

NSX Manager still running but disconnected from vCenter

A quick note in case you run into this issue.   I was running into problems where my NSX manager was running and everything seemed fine (NSX manager login / Console) but I could not manage NSX elements from inside vCenter.   No NSX manager was showing up.   Reconnecting to vCenter or rebooting would resolve this issue but then I had the problem again the next day.   I could not figure out the issue… then it dawned on me what happens every day…. BACKUP!   Somehow my NSX manager was added to the nightly backup and it would lose connection during this time.    Here is the only approved method for backing up a NSX manager:

  1. Use the configuration backup in the NSX manager administration console to make normal and regular backups

 

To recover a NSX manager do the following:

  1. Deploy a new NSX manager using OVF (same version of NSX as backup) with same IP as original manager
  2. Restore the configuration from the backup
  3. Reboot the NSX manager to ensure clean configuration
  4. Ensure it shows up in the GUI

 

Image level backups are not supported or a good idea 🙂

Configuring a NSX load balancer from API

A customer asked me this week if there was any examples of customers configuring the NSX load balancer via vRealize Automation.   I was surprised when google didn’t turn up any examples.  The NSX API guide (which is one of the best guides around) provides the details for how to call each element.  You can download it here. Once you have the PDF you can navigate to page 200 which is the start of the load balancer section.

Too many Edge devices

NSX load balancers are Edge service gateways.   A normal NSX environment may have a few while others may have hundreds but not all are load balancers.   A quick API lookup of all Edges provides this information: (my NSX manager is 192.168.10.28 hence the usage in all examples)

 

This is for a single Edge gateway in my case I have 57 Edges deployed over the life of my NSX environment and 15 active right now.   But only Edge-57 is a load balancer.   This report does not provide anything that can be used to identify it as a load balancer from a Edge as a firewall.   In order to identify if it’s a load balancer I have to query it’s load balancer configuration using:

Notice the addition of the edge-57 name to the query.   It returns:

Notice that this edge has load balancer enabled as true with some default monitors.   To compare here is a edge without the feature enabled:

Enabled is false with the same default monitors.   So now we know how to identify which edges are load balancers:

  • Get list of all Edges via API and pull out id element
  • Query each id element for load balancer config and match on true

 

 

Adding virtual servers

You can add virtual servers assuming the application profile and pools are already in place with a POST command with a XML body payload like this (the virtual server IP must already be assigned to the Edge as an interface):

capture

You can see it’s been created.  A quick query:

 

Shows it’s been created.  To delete just use the virtualServerId and pass to DELETE

 

Pool Members

For pools you have to update the full information to add a backend member or for that matter remove a member.  So you first query it:

Then you form your PUT with the data elements you need (taken from API guide).

In the client we see a member added:

capture

Tie it all together

Each of these actions have a update delete and query function that can be done.  The real challenge is taking the API inputs and creating user friendly data into vRealize Input to make it user friendly.    NSX continues to amaze me as a great product that has a very powerful and documented API.    I have run into very little issues trying to figure out how to do anything in NSX with the API.  In a future post I may provide some vRealize Orchestrator actions to speed up configuration of load balancers.

 

 

 

 

 

 

 

 

 

vRO add all virtual machines to NSX exception list

Almost everyone is using a brownfield environment to implement NSX.   Switching the DFW firewall to deny all is the safest bet but hard to do with brownfield environments.   Denying all traffic is a bad idea.  Doing a massive application conversion at once into DFW rules is not practical.  One method to solve this issue is to create an exception for all virtual machines then move them out of exception once you have created the correct allow rules for the machine.   I didn’t want to manually via the GUI add all my machines so I explored the API.

How to explore the API for NSX

VMware’s beta developer center provides the easiest way to explore the NSX API.   You can find the NSX section here.  Searching the api for “excep” quickly turned up the following answer:

api

As you can see there are three methods (get, put, delete).  It’s always safe to start with a get as it does not produce changes.   Using postman for chrome.  I was quickly connected to NSX see my setting below:

n1

The return from this get was lots of lines of machines that I had manually added to the exception list.  For example the following

n1

Looking at this virtual machine you can see it’s identified by the objectID which aligns with the put and delete functions the following worked perfectly:

Delete
https://192.168.10.28/api/2.1/app/excludelist/vm-47

Put
https://192.168.10.28/api/2.1/app/excludelist/vm-47

A quick get showed the vm-47 was back on the list.  Now we had one issue the designation and inventory of objectID’s is not a construct of NSX but of vCenter.

The Plan

In order to be successful in my plans I needed to do the following

  • Gather list of all objectID’s from vCenter
  • Put the list one at a time into NSX’s exclude list
  • Have some way to orchestrate it all together

No surprise I turned to vRealize orchestrator.   I wanted to keep it generic rest connections and not use NSX plugins.   So my journey began.

Orchestrator REST for NSX

  • Login to orchestrator
  • Switch to workflow view
  • Expand the library and locate the add rest host workflow
  • Run the workflow

1

2

3

4

  • Hit submit and wait for it to complete
  • You can verify the connect by visiting the administration section and expanding rest connections

Now we need to add a rest operation for addition to the exception list.

  • Locate the Add REST operation workflow
  • Run it
  • Fill out as shown

1

You now have a put method that takes input of {vm-id} before it can run.  In order to test we go back to our Postman and delete vm-47 and do a get to verify it’s gone:

Delete:

https://192.168.10.28/api/2.1/app/excludelist/vm-47

Get:

https://192.168.10.28/api/2.1/app/excludelist

 

It’ is missing from the get.   Now we need to run our REST operation

  • Locate the workflow called: Invoke a REST operation
  • Run it as shown below

1

2

3

Once completed a quick postman get showed me vm-47 is back on the exclude list.   Now I am ready for prime time.

Creation of an Add to Exclude List workflow

I need to create a workflow that just runs the rest operation to add to exclude list.

  • Copy the Invoke a REST operation
  • New workflow should be called AddNSXExclude
  • Edit new workflow
  • Go to inputs and remove all param_xxx except param_0
  • Move everything else but param_0 to Attributes

1

  • Let’s edit the attributes next
  • Click on the value for ther restOperation and set it to “Put on Exclude List ..” operation you created earlier

1

  • Go to the Schema and edit the REST call scriptable task
  • Remove all param_xxx except param_0 from the IN on the scriptable task

1

  • Edit the top line of the scripting to read like this:

var inParamtersValues = [param_0];

  • Close the scriptable task
  • Click on presentation and remove everything but the content question

1

Now we have a new issue.  We need to have it not error when the return code is not 200.  For example if the object is already on the exception list.   We just want everything on the list right away.   So edit your schema to remove everything but the rest call:

1

 

Put it all together with a list of virtual machines

Time for a new workflow with a scriptable task.

  • In the general tab put a single attribute that is an array of string

1

  • Add a scriptable task to the schema
  • Add a foreach element to the schema after the scriptable task
    • Link the foreach look to the AddNSX workflow you made in the last step
    • Link vmid to param_0

1

  • Edit the scriptable task and add the following code:

//get list of all VM’s
vms = System.getModule(“com.vmware.library.vc.vm”).getAllVMs();

var vmid = new Array();

for each (vm in vms)
{

vmid.push(vm.id);

}

 

  • Add an IN for vmid and an OUT for vmid
  • Run it and your complete you can see the response headers in the logs section

Hope it helps you automate some NSX.

What does apply to mean in NSX Firewall?

When I first started using NSX I ran into this little problem.   What does apply to mean and how should I use it?

Background

I believe the background for the apply to is from physical firewalls.   They allowed you to apply rules to a specific interface.   Applying to an interface had the following effects:

  • Limit the number of rules that have to be processed
  • Allow specific fine-grained controls

Applying rules to specific interfaces had a few issues:

  • You had to have a good understanding of the network topology in order apply rules correctly
  • New interfaces may be missed by rules

You also had the ability to apply the rule to all interfaces that existed.   On the surface if you had enough hardware to apply the rules everywhere it worked great.  Tons of interfaces who didn’t need the rules now had them.    There are a few problems:

  • New interfaces would have no rules and all rules would have to be applyed to them
  • These rules exist only on a single firewall rule creation is specific to that firewall

NSX Firewall

The NSX firewall takes a similar approach to firewall application.  All firewall rules are created in NSX manager and stored inside the NSX manager database.   By default rules are applied to the “distributed firewall”.  This will apply the rules to all virtual machines vNIC, regardless of the virtual machines location.   This creates the same problem as applying on every interface, each vNIC will have a long list of rules to attempt to match.

This is where the apply to tag becomes interesting.   In order to explain I’ll use a simple example:

Two virtual machines: 172.16.0.2 on VNI 5000 and 172.16.20.2 on VNI 5002.

My default firewall rule set allows them to communicate without any issues.  Let assume I want to block all traffic between these machines so I create the following rule:

pic1

Source:  172.16.0.2 virtual machine

Destination: 172.16.20.2 virtual machine

Service: Any

Action: Block

Apply to: Distributed firewall (default)

 

Using Traceflow we can identify where it was blocked:

pic2

You can see clearly the default of distributed switch applied the drop action to the source.   This is really great because it limits the traffic on the physical wire.   Since the object is known as a managed object in NSX the rule is enforced as soon as possible.   If you have a physical entity that is not managed by NSX the rule will be applied upon the destination.   This is hard to prove because traceflow cannot provide visibility to physical entities.

What does apply to do?

Simply put it tells NSX where to apply the firewall rule.  Lets examine some of the options for my rule above:

  • Host
  • Cluster
  • Virtual machine
  • IP or Mac set
  • etc..

It provides the full list of objects that DFW rules can made with including dynamic sets and tags.   This is really powerful.   For the sake of this example lets apply the rule to the destination virtual machine instead of the DFW.

pic1

Using traceflow we can see the results:

pic2

My attempted connection was dropped at the destination where I applied the firewall rule.    You can also see how it between 7 and 8 the message left host 3 and went across my physical network to host 1 (black hole of visibility)

Why use the apply to feature?

  • Reduce the amount of rules applied to each vNIC
  • Enforce the rule at a specific location (think situations with VM overlap or rule overlap)

Apply to does add to the complexity of the environment and troubleshooting but can limit scope.   This is where careful planning and understanding of the environment can really help.   Arkin can help as well but that’s another days post.

Greatest tool for NSX!

I want to let you in on a little secret of NSX called Traceflow.   It was made available in the 6.2 release and I am in love with it.   In order to explain my love let’s do a history lesson:

History Lesson (Get off my lawn kids time)

Back in the old days (pretty much right now in every enterprise) you had a bunch of switches, routers and firewalls.   When a server was having a problem communicating with another server you had to trace its MAC address through every hop manually.   You might be lucky and use a SIEM to identify if a firewall was dropping the traffic.   Understanding each hop of the traffic is a pain.    It takes time and can be very complex in enterprise implementations.

Enter NSX

NSX does some complex routing, switching and firewalling.   Your visibility into the process in the past was articles like mine.   With traceflow you can prove your theory and identify data paths.    It still does not have visibility beyond the NSX world and into the physical.   Hopefully some day we will have that too.   Traceflow can get you pretty close.

Where is this traceflow of which you speak?

Login to vCenter, select networking and security and it’s on the right side most of the way down.   It allows you to select a source and a destination then inject packets.   The NSX components report back as the injected packet passes by allowing you to trace the flow of communication.

Show me some meat

Sounds good.  Lets assume we have two virtual machines 172.16.0.2 and 172.16.0.3 both on VNI (think vlan) 5000.   They are on the same ESXi host.   There are no firewall rules blocking traffic.   Here is the output from traceflow:

first_same_host

Look at that.  The injected packet came from 172.16.0.2 and hit the vNIC FW then was forwarded directly to 172.16.0.3’s vNIC firewall and into the machine.   This is simple and exactly what we expect.  Let do the exact same thing except move the second machine to another ESXi host:

second_diff_hosts

Now we have added the VTEP (virtual tunnel end point) connection between ESXi hosts.  VTEP communication is layer 3 between ESXi hosts creating a stretch of VNI 5000 between distances or right next to each other.

Neat meat but it really only shows layer 2 communication that’s easy

How about some routing then.  Two virtual machines 172.16.0.2 VNI 5000 and 172.16.10.2 VNI 5001.   Each on the same ESXi host:

Third_usingtwo_networks

Look at that now we see the logical router in the mix taking the traffic from Logical switch (LS-172.16.0) and routing it to Logical router LS-172.16.10.   Suddenly the flow of traffic is not a mystery.

What about if the firewall is blocking the traffic?

I assumed you would ask so here is a new firewall rule I added:

rule

And the traceflow:

after_fw_rule_added_1

Yep my packet was dropped and it tells me where and what rule number blocked it.

What is the only problem with traceflow?

That is does not show the traffic flow on my physical network.   This should be very simple given that all my traffic for NSX is routed we should not have complex layer 2 stretches or lots of vlans to ensure are in place.   It’s just routed communication that can start at top of rack with the correct design.

NSX Controller forever deploying never working

I ran into an issue with NSX in the home lab where a new NSX controller was deploying and a power outage interrupted the deployment.  This left me in a state of deploying forever.  After waiting a day and being in the same state I removed the inoperable virtual machine from vCenter.  The issue persisted in NSX.

bad show

As you can see controller-18 is forever deploying.  The NSX manager command line showed it as deploying so it’s a database issue somewhere.  Since the deployment action was in place it was impossible to remove the controller and the cluster health was bad with only two controllers.   I don’t have any magic method for working with the NSX manager database (I assume it’s postgres but I really don’t know) other than the API.  So off to the API I went.   First I wanted to query for all controllers to make sure I had the correct name (ID field in picture above).   So I setup my REST connection for

https://IP_of_Manager/api/2.0/vdn/controller

no-controller

I returned that the ID is indeed controller-18.   Once I knew the controller number is was a simple delete method with the right command line:

Capture

Removal via

https://nsx_manager_ip/api/2.0/vdn/controller/controller-18?forceRemoval=True

 

After this command I queried again to confirm it was gone:

after

Since 18 was my last controller it returned nothing.   Hopefully if you have a stuck deploying NSX controller this article will help you remove it.