Redeploy NSX Edges to a different cluster / datacenter

First Issue my bad

I ran into an interesting issue in my home lab.  I recently replaced all my older HP servers with Intel NUC’s.  I could not be happier with the results.   Once I replaced all the ESXi hosts I mounted the storage and started up my virtual machines including vCenter.   Once vCenter and NSX Manager were available I moved all the ESXi hosts to the distributed switch.   This normal process was complicated by NSX.    I should have added the ESXi hosts to the transport zone allowing NSX to join the distributed switch.   Failure to do this made the NSX VXLAN process fail.   I could not prepare the hosts… ultimately I removed the VXLAN entries from the distributed switch and then re-prepared which re-created the VXLAN entries on the switch.   (This is not a good idea if you use it in production so follow the correct path.

Second Issue nice to know

This process generated a second issue the original cluster and datacenter on which my NSX edges used to live was gone.   I assumed that I could just re-deploy NSX edges from the manager.   While this is true the configuration assumes that it will be deploying the Edges to the same datacenter, resource pool and potentially the same host as when it was created.   So if I have a failure and expect to just bring up NSX manager and redeploy to a new cluster it will not work.   You have to adjust the parameters for the edges you can do this via the API or GUI.   I wanted to demonstrate the API method:

I needed to change the resource pool, datastore, and host for my Edge.   I identified my Edge via the identifier name in the GUI.  (edge-8 for me)  Grabbed my favorite REST tool (postman) and formed a query on the current state:

Get https://{nsx-manager-ip}/api/4.0/edges/edge-8/appliances

This returned the configuration for this edge device.  If you need to identify all edges just do

Get https://{nsx-manager-ip}/api/4.0/edges

Then I needed the VMware identifier for resource pool, datastore and host – this can all be gathered via the REST API but I went for Powershell because it was faster for me.  I used the following commands in PowerCLI:

 

get-vmhost | fl - returned host-881

get-resourcepool | fl - returned domain-c861

get-datastore | fl - returned datastore-865

 

Once identified I was ready to form my adjusted query:

 

<appliances>
<applianceSize>compact</applianceSize>
<appliance>
<highAvailabilityIndex>0</highAvailabilityIndex>
<vcUuid>500cfc30-5b2a-6bae-32a3-360e0315ccd3</vcUuid>
<vmId>vm-924</vmId>
<resourcePoolId>domain-c861</resourcePoolId>
<resourcePoolName>domain-c861</resourcePoolName>
<datastoreId>datastore-865</datastoreId>
<datastoreName>datastore-865</datastoreName>
<hostId>host-881</hostId>
<vmFolderId>group-v122</vmFolderId>
<vmFolderName>NSX</vmFolderName>
<vmHostname>esg1-0</vmHostname>
<vmName>ESG-1-0</vmName>
<deployed>true</deployed>
<cpuReservation>
<limit>-1</limit>
<reservation>1000</reservation>
</cpuReservation>
<memoryReservation>
<limit>-1</limit>
<reservation>512</reservation>
</memoryReservation>
<edgeId>edge-9</edgeId>
<configuredResourcePool>
<id>domain-c26</id>
<name>domain-c26</name>
<isValid>false</isValid>
</configuredResourcePool>
<configuredDataStore>
<id>datastore-31</id>
<isValid>false</isValid>
</configuredDataStore>
<configuredHost>
<id>host-29</id>
<isValid>false</isValid>
</configuredHost>
<configuredVmFolder>
<id>group-v122</id>
<name>NSX</name>
<isValid>true</isValid>
</configuredVmFolder>
</appliance>
<deployAppliances>true</deployAppliances>
</appliances>

I used a PUT against https://{nsx-manager-ip}/api/4.0/edges/{edgeId}/appliances  with the above body in xml/application.   Then I was able to redeploy my edge devices without any challenge.

Public cloud has forced change

Readers will immediately cry foul of this title.   Public cloud adoption is not huge and even in the most die hard cloud only shops it’s only around 40%.   I believe public cloud has and will continue to have a transformative effect on private cloud.  The presence of a second option has forced the current options hand.    I will not detail the challenges in public cloud adoption that is a blog post for another day.   I want to focus on elements that public cloud’s presence has forced into our private on-prem. clouds:

  • Life cycle management for hypervisor is now table stakes – gone are the days with hypervisor specific teams – you can roll that cost into the operational budget on public cloud.   Quite simply upgrading / maintaining / tweaking the hypervisor needs to become easier and cost a lot less OpEx.
  • Delivery of traditional IT services need to become transparent and quick – the buzz word agility applies – the business does not care how many engineer’s it takes to screw in the server they just want it now
  • Consumers of IT services don’t like limits or scale issues – all on-prem. offerings need to have some form of elasticity
  • No one really wants or needs IaaS (Infrastructure as a Service) they want platform because only platforms provide value to developers who in turn provide value to the business-  Platform has to include multiple servers, networking/networking constructs, security, and authentication.
  • Cost is important like never before… if you don’t control / understand your cost, comparisons will be made to public cloud and you will loose.
  • Service catalogs are only useful if they change and are responsive to business needs (think application development life cycle)
  • Infrastructure people need to learn from development – the future is automated and created by developers who understand infrastructure – you can try to stand still but it will not last.
  • IT shops now want to spend IT budget incrementally very few shops want to buy IT in my company spend every three years. Eventually, they can apply for business loan from this site here. You can also check out Qprofit System Test to learn the latest trend about online investment.

I want to be clear I believe public and especially hybrid cloud should be part of every IT strategy.   It’s a critical reality in our world.   I also believe that private cloud is here to stay for many years but expectations will continue to change based upon public cloud experience.

The real question for me is how will the new edge of IoT force public clouds hand.f

Double your storage capacity without buying a new storage shelf

I spent a good portion of my career moving storage from one array to another.   The driver is normally something like this:

  • Cost of older array (life cycle time)
  • New capacity, speed or feature

So off we went on another interruption migration of lun’s and data..  At one point I was sold on physical storage virtualization appliances.   They stood in front of the array and allowed me to move data between arrays without interruption to the WWID or application.   I loved them what a great solution.   Then storage vMotion became available and 95% of the workloads were running in VMware.   I no longer needed the storage virtualization appliance and my life became very VMware focused. I rather reed some access self-storage feedback instead.

 

New Storage paradigm

With the advent of all flash arrays and HCI (all flash or mixed) performance(speed) has almost gone away as a reason for moving data off arrays.  Most arrays offer the same features; replication capability aside.   So now we are migrating to new arrays / storage shelf’s because of capacity or life cycle issues.   Storage arrays and their storage shelves have a real challenge with linear growth.   They expect you to make a bet on the next three years capacity.   HCI allows a much better linear growth model for storage.

My HCI Gripe

My greatest grip with HCI solutions is that everyone needs more storage that does not always mean you need more compute.   Vendors that provide hardware locked (engineered) platforms suffer from this challenge.   The small box provides 10TB, Medium 20TB and large 40TB.   Which do I buy if I need 30TB?   I am once again stuck in the making a bet problem from arrays (at least it’s a smaller bet).   The software based platforms including VSAN (full disclosure – At time of writing I work for VMware and have run VSAN in my home for three years) have the advantage of offering better mixed sizing and linear growth.

What about massive growth?

What happens when you need to double your storage with HCI and your don’t have spare drive bays available?   Do you buy a new set of compute and migrate to it?  That’s just a replacement of the storage array model…  Recently at some meetings a friend from the Storage and availability group let me know the VSAN solution to this problem.   Quite simply replace the drives in your compute with larger drives in a rolling fashion.   You should create uniform clusters but it’s totally possible to replace all current drives with new double capacity drives.   Double the size of your storage for only the cost of the drives.   (doubling the size of cache is a more complex operation)  Once the new capacity is available and out of maintenance mode data is migrated by VSAN on to the new disks.

What is the process?

It’s documented in chapter 11 of the VSAN administration guide : https://pubs.vmware.com/vsphere-60/topic/com.vmware.ICbase/PDF/virtual-san-600-administration-guide.pdf

A high level overview of the steps (please use the official documentation)

  1. Maintenance mode the host
  2. Remove the disk from the disk group
  3. Replace the disk you removed with the new capacity drive
  4. Rescan for drives
  5. Add disk back into the disk group

 

Migrating off a distributed virtual switch to standard switch Article 2

Normally people want to migrate from virtual standard switches to distributed switches.   I am a huge fan of the distributed switch and feel it should be used everywhere.   The distributed switch becomes a challenge when you want to migrate hosts to a new vCenter.   I have seen a lot of migrations to new vCenters via detaching the ESXi hosts and connecting to the new vCenter.   This process works great assuming you are not using the distributed switch.   Removing or working with VM’s on a ghosted VDS is a real challenge.   So remove it before you migrate to a new vCenter.

In this multi-article solution I’ll provide some steps to migrate off a VDS to a VSS.

Article 2:  Migrating the host off the VDS.  In the last article we moved all the virtual machines off the VDS to a VSS.   We now need to migrate the vMotion and management off the VDS to a VSS.   This step will cause interruption to the management of a ESXi host.   Virtual machines will not be interrupted but the management / will be.   You must have console access to the ESXi host for this to work.  Steps at a glance:

  1. Confirm that a switch port exists for management and vMotion
  2. Remove vMotion, etc.. from VDS and add to VSS
  3. Remove management from VDS and add to VSS
  4. Confirm settings

Confirm that a switch port exists for management and vMotion

Before you begin examine the VSS to confirm that management and vMotion port groups were created correctly by Article 1's script.   Once your sure the VLAN settings for the port group are correct then you can move to the next step. ​​ You may want to confirm your host isolation settings it’s possible these steps will cause a HA failure if you take too long to switch over and don’t have independent datastore networking. ​​ Best practice would be to disable HA or switch to leave powered on isolation response.​​ 

Remove vMotion, etc.. from VDS and add to VSS

Login to the ESXi host via console and ssh.  (Comments are preceded with #) 

#use the following command to identify virtual adapters on your dvs

esxcfg-vswitch -l

# sample output from my home lab

DVS Name  ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Num Ports  ​​​​ Used Ports ​​ Configured Ports ​​ MTU  ​​ ​​ ​​​​ Uplinks

dvSwitch  ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 1792  ​​ ​​ ​​ ​​ ​​ ​​​​ 7  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 512  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 1600  ​​ ​​​​ vmnic1

 

 ​​​​ DVPort ID  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ In Use  ​​ ​​ ​​ ​​​​ Client

 ​​​​ 675  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0

 ​​​​ 676  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 1  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ vmnic1

 ​​​​ 677  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0

 ​​​​ 678  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 0

 ​​​​ 679  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 1  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ vmk0

 ​​​​ 268  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 1  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ vmk1

 ​​​​ 139  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 1  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ vmk2

 

# We can see we have three virtual adapters on our host use the following command to identify their use and IP addresses

esxcfg-vmknic -l

#Sample output from my home lab cut out some details to make it more readable

Interface ​​ Port Group/DVPort  ​​​​ IP Family IP Address  ​​ ​​ ​​​​ 

vmk0  ​​ ​​ ​​ ​​ ​​​​ 679  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ IPv4  ​​ ​​ ​​ ​​​​ 192.168.10.16  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 

vmk1  ​​ ​​ ​​ ​​ ​​​​ 268  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ IPv4  ​​ ​​ ​​ ​​​​ 192.168.10.26  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 

vmk2  ​​ ​​ ​​ ​​ ​​​​ 139  ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ IPv4  ​​ ​​ ​​ ​​​​ 192.168.10.22  ​​ ​​ ​​​​ 

 

Align you vmk# with vCenter to identify which adapter provides the function (vmk0 management, vmk1 vMotion, vmk2 FT)

 

# We can now move all adapter other than management which in my case is vmk0​​ #we will start with vmk1 on dvSwitch on port 268

esxcfg-vmknic -d -v 268 -s "dvSwitch"

 

# Then add to vSwitch0 vmk1

esxcfg-vmknic -a -i 192.168.10.26 -n 255.255.255.0 -p PG-vMotion

 

Remove FT

esxcfg-vmknic -d -v 139​​ -s "dvSwitch"

 

esxcfg-vmknic -a -i 192.168.10.22 -n 255.255.255.0 -p PG-FT

 

Remove management from VDS and add to VSS

Remove management (this stage will interrupt management access to ESXi host – make sure you have console access) You might want to pretype the add command in the console before you execute the remove. ​​ If you are having trouble getting the shell on a ESXi host do the following:

  • You will need to login to the console go to troubleshooting options -> Enable ESXi Shell

  • Press Alt-Cntr-F1 to enter shell and login

 

Remove management:

esxcfg-vmknic -d -v 679​​ -s "dvSwitch"

 

Add management to VSS:

esxcfg-vmknic -a -i 192.168.10.16 -n 255.255.255.0 -p PG-Mgmt

 

Confirm settings

Ping the host to ensure networking has returned to management.  ​​​​ Ensure the host returns to vCenter by waiting 2 minutes.  ​​ ​​​​ After you move the host to a new vCenter you can remove via:

  • Go to the host in vCenter and select dvs it should provide a remove button.

 

 

 

Migrating off a distributed virtual switch to standard switch Article 1

Normally people want to migrate from virtual standard switches to distributed switches.   I am a huge fan of the distributed switch and feel it should be used everywhere.   The distributed switch becomes a challenge when you want to migrate hosts to a new vCenter.   I have seen a lot of migrations to new vCenters via detaching the ESXi hosts and connecting to the new vCenter.   This process works great assuming you are not using the distributed switch.   Removing or working with VM’s on a ghosted VDS is a real challenge.   So remove it before you migrate to a new vCenter.

In this multi-article solution I’ll provide some steps to migrate off a VDS to a VSS.

It’s important to understand that assuming that networking is correct this process should not interrupt customer virtual machines.   The movement from a distributed switch to a standard switch at most will lose a ping.   When you assign a new network adapter a gratuitous arp is sent out the new adapter.   If you only have two network adapters this process does remove network adapter redundancy while moving.

Step 1:​​ Create a VSS with the same port groups

You need to create a standard switch with port groups on the correct VLAN ID’s.  ​​​​ You can do this manually but one of the challenges of the standard switch is the name must be exactly the same including case sensitivity to avoid vMotion errors. ​​ (One great reason for the VDS) ​​ So we need to use a script to create the standard switch and port groups.  ​​​​ Using PowerCLI (sorry orchestrator friends I didn’t do it in Orchestrator this time)

Code:

#Import modules for PowerCLI

 ​​ ​​ ​​​​ Import-Module​​ -Name​​ VMware.VimAutomation.Core

 ​​ ​​ ​​​​ Import-Module​​ -Name​​ VMware.VimAutomation.Vds

 

 ​​​​ #Variables to change

 ​​ ​​ ​​​​ $standardSwitchName​​ =​​ "StandardSwitch"

 ​​ ​​ ​​​​ $dvSwitchName​​ =​​ "dvSwitch"

 ​​ ​​ ​​​​ $cluster​​ =​​ "Basement"

 ​​ ​​ ​​​​ $vCenter​​ =​​ "192.168.10.14"

 

 ​​ ​​ ​​​​ #Connect to vCenter

 ​​ ​​ ​​​​ connect-viserver​​ -server​​ $vCenter

 

 

 

 ​​​​ $dvsPGs​​ =​​ Get-VirtualSwitch​​ -Name​​ $dvSwitchName​​ |​​ Get-VirtualPortGroup​​ |​​ Select​​ Name,​​ @{N="VLANId";E={$_.Extensiondata.Config.DefaultPortCOnfig.Vlan.VlanId}},​​ NumPorts

 

 ​​​​ #Get all ESXi hosts in a cluster

 ​​​​ $vmhosts​​ =​​ get-cluster​​ -Name​​ $cluster​​ |​​ get-vmhost

 

 ​​ ​​ ​​​​ #Loop ESXi hosts

 ​​ ​​ ​​​​ foreach​​ ($vmhost​​ in​​ $vmhosts)

 ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Create new VSS

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ $vswitch​​ =​​ New-VirtualSwitch​​ -VMHost​​ $vmhost​​ -Name​​ $standardSwitchName​​ -Confirm:$false

 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Look port groups and create

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ foreach​​ ($dvsPG​​ in​​ $dvsPGs)

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Validate the port group is a number the DVUplink returns an array

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ if​​ ($dvsPg.VLANId​​ -is​​ [int]​​ )

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ New-VirtualPortGroup​​ -Name​​ $dvsPG.Name​​ -VirtualSwitch​​ $vswitch​​ -VlanId​​ $dvsPG.VLANId​​ -Confirm:$false

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 

 ​​ ​​ ​​​​ }​​ 

 

Explained: ​​ 

  • Provide variables

  • Connect to vCenter

  • Get all port groups into $dvsPGs

  • Get all ESXi hosts

  • Loop though ESXi hosts one at a time

  • Create the new standard switch

  • Loop through port groups and create them with same name as DVS and VLAN ID

 

This will create a virtual standard switch with the same VLAN and port group configuration as your DVS.  ​​ ​​​​ 

 

I like to be able to validate that the source and destination are configured the same so this powercli script provides the checking:

Code:

#Validation check DVS vs VSS for differences

 

 ​​ ​​ ​​​​ $dvsPGs​​ =​​ Get-VirtualSwitch​​ -Name​​ $dvSwitchName​​ |​​ Get-VirtualPortGroup​​ |​​ Select​​ Name,​​ @{N="VLANId";E={$_.Extensiondata.Config.DefaultPortCOnfig.Vlan.VlanId}},​​ NumPorts

 ​​ ​​ ​​​​ #Get all ESXi hosts in a cluster

 ​​ ​​ ​​​​ $vmhosts​​ =​​ get-cluster​​ -Name​​ $cluster​​ |​​ get-vmhost

 

 ​​ ​​ ​​​​ #Loop ESXi hosts

 ​​ ​​ ​​​​ foreach​​ ($vmhost​​ in​​ $vmhosts)

 ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Write-Host "Host: "$vmhost.Name "VSS: "$standardSwitchName

 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Get VSSPortgroups for this host

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ $VSSPortGroups​​ =​​ $vmhost​​ |​​ Get-VirtualSwitch​​ -Name​​ $standardSwitchName​​ |​​ Get-VirtualPortGroup

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Sort based upon name of VSS

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ foreach​​ ($dvsPG​​ in​​ $dvsPGs)

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ if​​ ($dvsPg.VLANId​​ -is​​ [int]​​ )

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Write "VSSPortGroup: " $VSSPortGroup.Name

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Loop on DVS

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ $match​​ =​​ $FALSE

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ foreach​​ ($VSSPortGroup​​ in​​ $VSSPortGroups)

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ if​​ ($dvsPG.Name​​ -eq​​ $VSSPortGroup.Name)

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Write-Host "Found a Match vss: "$VSSPortGroup.Name" to DVS: "$dvsPG.Name" Host: " $vmhost.name

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ $match​​ =​​ $TRUE

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ $missing​​ =​​ $dvsPG.Name

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ if​​ ($match​​ -eq​​ $FALSE)

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Write-Host​​ "Did not find a match for DVS: "$missing​​ " on "$vmhost.name

 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 

 ​​ ​​ ​​​​ }​​ 

 

Explained:

  • Get the VDS

  • Get all ESXi hosts

  • Loop through VM hosts

  • Get port groups on standard switch

  • Loop though the standard switch port groups and look for matches on DVS

  • If missing then output missing element

 

 

Now we need to give the standard switch an uplink (this is critical otherwise VM’s will fail when moved)

 

Once it has an uplink you can use the following script to move all virtual machines:

 

Code:

#Move Virtual machines to new Adapters

 

 ​​ ​​ ​​​​ $vms​​ =​​ get-vm​​ 

 

 ​​ ​​ ​​​​ foreach​​ ($vm​​ in​​ $vms)

 ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #grab the virtual switch for the hosts​​ 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ $vss​​ =​​ Get-VirtualSwitch​​ -Name​​ $standardswitchname​​ -VMHost​​ $vm.VMHost

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #check that the virtual switch has at least one physical adapter

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ if​​ ($vss.ExtensionData.Pnic.Count​​ -gt​​ 0)

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #VMHost

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ $adapters​​ =​​ $vm​​ |​​ Get-NetworkAdapter​​ 

 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Loop through adapters

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ foreach​​ ($adapter​​ in​​ $adapters)

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Get VSS port group of same name returns port group on all hosts

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ $VSSPortGroups​​ =​​ Get-VirtualPortGroup​​ -Name​​ $adapter.NetworkName​​ -VirtualSwitch​​ $standardSwitchName

 ​​ ​​​​ 

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Loop the hosts

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ foreach​​ ($VSSPortGroup​​ in​​ $VSSPortGroups)

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Search for the PortGroup on our host

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ if​​ ([string]$VSSPortGroup.VMHostId​​ -eq​​ [string]$vm.VMHost.Id)

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ {

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ #Change network Adapter to standard switch

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ Set-NetworkAdapter​​ -NetworkAdapter​​ $adapter​​ -Portgroup​​ $VSSPortGroup​​ -Confirm:$false

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 ​​ ​​ ​​ ​​ ​​ ​​ ​​​​ }

 ​​ ​​ ​​​​ }​​ 

 

Explained: ​​ 

  • Used same variables from previous script

  • Get all virtual machines (you could use get-vm “name-of-vm” to test a single vm

  • Loop through all virtual machines one at a time

  • Get the VSS for the VM (host specific)

  • Check for at least one physical uplink to switch (gut / sanity check)

  • Loop though the adapters on a virtual machine​​ 

  • For each adapter get VDS port group name and switch the adapter

 

 

 

 

 

Design for Platform services controller (PSC)

This is the first part in a series about building PSC architecture the rest of the articles are here:

The platform services controller that was introduced in vSphere 6.0  has been a source of challenge for a lot of people who are upgrading into it.    I have struggled to identify the best architecture to follow.   This article assumes that you want to have a multi-vCenter single sign on domain with external PSC’s.   There are a few key items to consider in architecting PSC’s:

Recovery

  • If you lose all PSC’s you cannot connect a vCenter to a new PSC you must re-install the vCenter loosing all data
  • To recover all failed PSC’s restore a single PSC from backup (Image level backup is supported) then redeploy new PSC’s for the rest.   Restoring multiple PSC’s may introduce some inconsistencies depending on time of backup.
  • In 6.5 vCenter cannot be repointed to a PSC in a different site on the same domain (6.0 can)
  • All 6.x versions of vCenter do not support repointing to a PSC in a different domain
  • If you lose all PSC’s at a site you can install new PSC’s at the site as long as at least one PSC at another site survived then repoint the vCenter to the new PSC

 

Replication

  • All PSC replication is bi-directional but not automatically in a ring (big one)
  • By default each PSC is replicating with only a single other PSC (the one you select when installing the additional PSC)
  • Site names do not have anything to do with replication today they are a logical construct for load balancers and future usage
  • Changes are not unique to a site but to a domain – in other words all changes at all sites are replicated to all other PSC’s assuming they are part of the domain

 

Availability

  • vCenter points to a single PSC never more than one at a time
  • PSC’s behind a load balancer (up to 4 supported) are active/passive via load balancer configuration
  • If you use a load balancer configuration for PSC and have a failure of the active PSC the load balancer repoints to another PSC and no reconfiguration is required
  • Site name is important with load balancers you should place all PSC’s behind a load balancer in their own site – non-load balanced PSC’s at same site should have a different site name

 

Features

  • PSC’s have to be part of the same domain together to use enhanced linked mode

 

Performance

  • PSC can replicate to one or many other PSC’s  (with an impact with many).   You want to minimize the number of replication partners because of performance impact.

Topology

  • Ring is the supported topology best practice today
  • PSC’s know each other by IP address or domain name (ensure domain is correct including PTR) – using IP is discouraged because it can never be changed;  use of FQDN allows for IP mobility.
  • PSC’s are authentication sources so NTP is critical and the same NTP across all PSC’s is critical.  (If you join one PSC to AD all need to be joined to same AD – best not to mix appliance and windows PSC’s)
  • The only reason to have external PSC’s is to use enhanced linked mode – if you don’t need ELM use an embedded PSC with vCenter and back vCenter up at the same time – see http://vmware.com/go/psctree

 

Scalability

  • Current limits are on 8 PSC’s in a domain in 6.0 and 10 in a domain in 6.5

 

With all of these items in hand here are some design tips:

  • Always have n+1 PSC’s in other words never have a single PSC in a domain when using ELM
  • Have a solid method for restoring your PSC’s – Image level or 6.5 restore feature

 

So what is the correct topology for PSC’s? 

This is a challenging question.  Let’s identify some design elements to consider

  • Failure of a single component should not create replication partitions
  • Complexity of setup should be minimized
  • Number of replication agreements should be minimized for performance reasons
  • Scaling out additional PSC’s should be as simple as possible

Ring

I spent some time in the ISP world and learned to love rings.   They create two paths to every destination and are easy to setup and maintain.   They do have issues when two points fail at the same time and potentially create partitions of routing until one of the two is restored.   VMware recommends a ring topology for PSC’s at the time of this article as shown below:

Let’s review this topology against the design elements:

  • Failure of a single component should not create replication partitions
    • True due to ring there are two ways for everything to replicate
  • Complexity of setup should be minimized
    • The setup ensures redundancy without lots of manually created performance impacting replication agreements (one manual agreement)
  • Number of replication agreements should be minimized for performance reasons
    • True
  • Scaling out additional PSC’s should be as simple as possible
    • Adding a new PSC means the following:
      • Add new PSC joined to LAX-2
      • Add new agreement between new PSC and SFO-1
      • Remove agreement between LAX-2 and SFO-1

Looks mostly simple you do need to track who is providing your ring backup loop. Which is a manual documentation process today.

Ring with additional redundancy

The VMware validated design  states that for a two site enhanced linked mode topology you should build the following:

A few items to illustrate (in case you have not read the VVD)

  • Four vCenters
  • Four PSC’s (in blue)
  • Each PSC replicates with its same site peer and one remote site peer thus making sure it’s changes are stored at two sites and with two copies that are then replicated locally and remotely (all four get it)

Let’s evaluate against the design elements:

  • Failure of a single component should not create replication partitions
    • True due to ring there are four ways for everything to replicate
  • Complexity of setup should be minimized
    • The setup requires forethought and at least one manual replication agreements
  • Number of replication agreements should be minimized for performance reasons
    • It has more replication agreements
  • Scaling out additional PSC’s should be as simple as possible
    • Adding a new PSC means potentially more replication agreements or more design

 

Update: The VVD reached out and wanted to be clear that adding additional sites is pretty easy.   I believe the challenge comes when you try to identify disaster zones.   Because PSC’s are replicating all changes everywhere it does not matter if all replication agreements fail you can still regenerate a site.

Which option should I use?

That is really up to you.  I personally love the simplicity of a ring.  Nether of these options increase availability of the PSC layer they are about data consistency and integrity.   Use a load balancer if your management plane SLA does not support downtime.

NSX Manager still running but disconnected from vCenter

A quick note in case you run into this issue.   I was running into problems where my NSX manager was running and everything seemed fine (NSX manager login / Console) but I could not manage NSX elements from inside vCenter.   No NSX manager was showing up.   Reconnecting to vCenter or rebooting would resolve this issue but then I had the problem again the next day.   I could not figure out the issue… then it dawned on me what happens every day…. BACKUP!   Somehow my NSX manager was added to the nightly backup and it would lose connection during this time.    Here is the only approved method for backing up a NSX manager:

  1. Use the configuration backup in the NSX manager administration console to make normal and regular backups

 

To recover a NSX manager do the following:

  1. Deploy a new NSX manager using OVF (same version of NSX as backup) with same IP as original manager
  2. Restore the configuration from the backup
  3. Reboot the NSX manager to ensure clean configuration
  4. Ensure it shows up in the GUI

 

Image level backups are not supported or a good idea 🙂

VMkernel types updated with design guidance for multi-site

Holy crap what do all these VMware VMkernel type mean?  I started this article and realized I had already written one here.  Sad when google leads you to something you wrote… looks like I don’t remember too well… Perhaps I should just go yell for the kids to get off my lawn now.   I wanted to take a minute to revise my post with some new things I have learned and some guidance.

capture

From my previous post:

  • vMotion traffic – Required for vMotion – Moves the state of virtual machines (active datadisk svMotion, active memory and execution state) during a vMotion
  • Provisioning traffic – Not required will use management network if not setup – cold migration, cloning and snapshot creation (powered off virtual machines = cold)
  • Fault tolerance traffic (FT)  – Required for FT – Enables fault tolerance traffic on the host – only a single adapter may be used for FT per host
  • Management traffic – Required – Management of host and vCenter server
  • vSphere replication traffic – Only needed if using vSphere replication– outgoing replication data from ESXi host to vSphere replication server
  • vSphere replication NFC traffic – Only needed if using vSphere replication – handles incoming replication data on the target replication site
  • Virtual SAN – Required for VSAN – virtual san traffic on the host
  • VXLAN – used for NSX not controlled from the add vmkernel interface.

I wanted to provide a little better explanation around design elements with some interfaces.  Specifically I want to focus on vMotion and Provisioning traffic.    Let’s create a few scenario’s and see what interface is used assuming I have all the VMkernel interfaces listed above:

  1. VM1 is running and we want to migrate from host1 to host2 at datacenter1 – vMotion
  2. VM1 is running with a snapshot and we want to migrate from host1 to host2 at datacenter1 – Provisioning traffic (if it does not exist management network is used)
  3. VM1 is running with a snapshot and we want to storage migrate from host1 DC1 to host4 DC3 – storage vMotion – Provisioning traffic (if it does not exist management network is used)
  4. VM1 is not running and we want to migrate from host1 to host2 at datacenter1 – Provisioning traffic (very low bandwidth used)
  5. VM1 is not running has a snapshot and we want to migrate from host1 to host2 at datacenter1 – Provisioning traffic (very low bandwidth used)
  6. VM2 is being created at datacenter1 – Provisioning traffic

 

So design guidance in a multi-site implementation you should have the following interfaces if you wish to separate the TCP-IP stack  or use network IO control to avoid bad neighbor situations.   (Or you could just assign it all to management vmk and go nuts on that interface = bad idea)

  • Management
  • vMotion
  • Provisioning

Use of other vmkernel interfaces depends on if you are using replication, vSAN or NSX.

Should you have multi-nic vMotion? 

Multi-nic vMotion enables faster vMotion of multiple entries off a host (as long as they don’t have snapshots).   It still is a good idea if you have large vm’s or lots of vm’s on a host.

Should you have multi-nic Provisioning?

No idea if it’s even supported or a good idea.  Provisioning network is used for long distance vMotion so the idea might be good… I would not use it today.

Should IT build a castle or a mobile home?

So I have many hobbies to keep my mind busy during idle times… like when driving a car.   One of my favorite hobbies is to identify the best candidate locations to live in if the Zombie apocalypse was to happen.   As I drive in my car between locations I see many different buildings and I attempt to rate large buildings by their Zombie proof nature.   There are many things to consider in the perfect Zombie defense location for example:

  • Avoiding buildings with large amounts of windows or first floor windows
  • Building made of materials that cannot be bludgeoned open for example stone
  • More than one exit but not too many exits
  • A location that can be defended on all sides and allows visible approach

There are many other considerations like proximity to water and food etc..  but basically I am looking for the modern equivalent of a castle:pexels-photo

OK what does this have to do with IT

Traditional infrastructure is architected like a castle its primary goal is to secure at the perimeter and be very imposing to keep people out.   During a zombie attack this model is great until they get in then it becomes a grave yard.   IT architects myself include spend a lot of time considering all the factors that are required to build the perfect castle.   There are considerations like:

  • Availability
  • Recoverability
  •  Manageability
  • Performance
  • Security

That all have to be considered and as you add another wing to your castle every one of these elements of design must be considered for the whole castle.  We cannot add a new wing that bridges the moat without extending the moat etc..   Our design to build the perfect castle has created a monolithic drag.   While development teams move from annual releases to quarters or weeks or days we continue to attempt to control the world from a perimeter design perspective.   If we could identify all possible additions to the castle at the beginning we could potentially account for them.   This was true in the castle days:  there were only so many ways to get into the castle and so many methods to break in.    Even worse the castle provided lots of nooks and locations for zombies to hide and attack me when not expecting it..  This is the challenge with the Zombie attack they don’t follow the rules they just might create a ladder out of zombie bodies and get into your castle (World War Z style).   If we compare zombies to the challenges being thrown at IT today the story becomes valid.    How do we deal with constant change and unknown?   How do we become agile to change?   Is it from building a better castle?

Introducing the mobile home

pexels-photo-106401

Today I realized that the perfect solution to my Zombie question was the mobile home.   We can all assume that I need a place to sleep.   Something that I can secure with reasonable assurance.   I can re-enforce the walls and windows on a mobile home and I gain something I don’t have with a castle: mobility.  I can move my secured location and goods to new locations.  My mobile home is large enough to provide for my needs without providing too many places for zombies to hide.  IT needs this type of mobility.   Cloud has provided faster time to market for many enterprises but in reality you are only renting space in someone else’s castle.    There are all types of methods to secure your valuables from mine but in reality we are at the mercy of the castle owner.   What if my service could become a secured mobile home… that would provide the agility I need in the long run.   The roach motel is very alive and well in cloud providers today.   Many providers have no cross provider capabilities while others provide tools to transform the data between formats.   My mobile home needs to be secure and not reconfigured each time I move between locations while looking for resources or avoiding attack.  We need to reconsider IT as a secured mobile home and start to build this model.   Some functions to consider in my mobile home:

  • Small enough to provide the required functions (bathroom, kitchen and sleeping space or in IT terms business value) and not an inch larger than required
  • Self contained security the encircles the service
  • Mobility without interruption of services

Thanks for reading my rant.  Please feel free to provide your favorite zombie hiding location or your thoughts on the future of IT.

 

Breaking out a SSO/PSC to enable enhanced linked mode

The single sign on used to be a fairly painless portion of vCenter (once we got to 5.5, in 5.0 it was a major pain).    It was essentially a lightweight directory (vsphere.local) and gateway to active directory.  The platform services controller (PSC) of vCenter 6 is a completely different animal.  It performs a lot of new functions that are not easy to transfer between instances.  For example the PSC does the following:

  • Handles and stores SSL certificates
  • Handles and stores license keys
  • Handles and stores permissions via global permissions layer
  • Handles and stores replication of Tags and Catagories
  • Built in automation replication between different sites

Why does it do all these and why do I care?

Well VMware has come to understand that virtual machines cannot be bound to a specific location more and more customer want Hybrid and multi-site capabilities while keeping the same management.   A lot of the management functions are based around Tags and permissions have a over arching layer to provide that functionality is huge.   I assume that we are going to see more features passed up to the PSC layer in order to make cross site/ vCenter features available.

Architectural change

In 6.0 VMware changed the architecture to have external PSC’s as a preferred mode of operation.   In fact they support up to 8 replicated PSC’s and they have two constructs that matter:

  • Domain (traditionally this has been vsphere.local)
  • Sites (Physical locations)

Site designation changes how the PSC’s and their multi-masters replicate (choosing to replicate to a single instance at each site then have that instance replicate to local nodes)

The change to external PSC’s is a challenge for many users.  First let me be clear about a challenge you can only have one domain: merging domains is not supported. Once you get to 6 you cannot leave a domain and join a different domain I have not seen instructions to do it and it does not seem to be supported.  In 5 you can leave a SSO domain and join a different domain so if you are still on 5 and wish to join multiple machines to the same domain do it while on 5 using SSO.  If you wish to move from an embeded PSC to an external PSC the process is pretty simple:

  1. Install a new PSC (can be windows or Linux) joined to the embedded PSC
  2. Repoint the vCenter to the new PSC (instructions here)
  3. Remove the old PSC

The key takeaway for all of you who might have slotted off during this article is this: Make any topology changes to vCenter domains before upgrading to 6.