Advice to VCDX candidates from a Double VCDX

“Sometimes it’s the journey that teaches you a lot about your destination.”  – Drake

Update: I have updated the wording on the constraints section to reflect a Twitter comment from  thanks for the fix to wording and reading.

The VMware Certified Design Expert certification (VCDX) represents the highest tier of VMware’s certifications.   I recently contributed to a panel of VCDX’s at VMworld.  Candidates considering the VCDX certification had the opportunity to ask the panel questions.   The questions illustrated that candidates were concerned about the Herculean effort required to achieve the certification.   I wanted to take this opportunity to provide some guidance I have learned as a mentor.   I believe anyone can become a VCDX.   It does require some hard work but it is very achievable.

 

Requirements, Constraints, Assumptions and Risks

Becoming a VMware certified design expert does not mean you have to be the most technical person in the room.   It does mean you have to know how to align technology to business needs.    My experience has taught me that I can tell if a proposal for VCDX will be successful right away based upon requirements, constraints, assumptions and risks.   The ability to gather business and technical requirements is a key skill for any design expert.   Your technical requirements should be aligned to the business requirements. It’s important to understand the difference between business and technical requirements:

  • Business Requirements – Defines how the delivered product provides value. Other words often used are outcomes, or expected benefits.  For example, the solution must meet regulatory compliance.
  • Technical Requirements – Defines the technical “must haves” to achieve the outcome. For example the solution must be able to fail over and fail back from a disaster and support a RTO of four hours.

Many VCDX documents are solely focused on technical requirements and miss the “why” that drives the design.   Understanding the difference between requirements and constraints is another challenge for many candidates:

  • Requirements – Things the design must meet, such as: establish a RTO of four hours or provide capacity for twenty percent growth for the next three years.
  • Constraints – Things that form limits or boundaries that apply to the design.  For example a specific vendor relationship or reuse of current hardware.  Constraints should be met by the design unless they are resolved via conflict.

Once you have established your requirements and constraints you are left with assumptions and risks:

  • Assumptions – things you believe to be true but cannot verify. For example, storage usage will grow at the same rate as compute usage or the sample data provided represents reality.
  • Risks – are simply risks to the project meeting business requirements. If you identify risks they should be provided in this section.   Every project has risks.   For example, staff skills or timelines.

 

Correctly creating requirements and constraints that align with the elements of design are critical to a successful submission.    Identification of assumptions and risks provide important protections to the architect.  The goal of a VCDX design is to align technology to meet the requirements and constraints not provide the best technology mix.

 

Elements of Design

When working with infrastructure, VMware has designated five elements that should be considered in each design choice.  Each design choice should be evaluated against the elements of design for impact.  I personally like to use the acronym RAMPS to help me remember these elements:

  • Recoverability – Choices effect on disaster recovery
  • Availability – Choices effect on SLA
  • Manageability – Choices effect of management cost
  • Performance – Choices effect on performance
  • Security – Choices effect on security

It is not uncommon for availability, recoverability, security or performance to have a negative impact on manageability.   Not all choices can have a net benefit to all elements of design.   The tie breaker with these conflicts should be the requirements.   Conflicts between design elements may exist even after evaluating the requirements.   This allows for a conflict resolution section.   Conflict resolution is where the customer of the solution acknowledges the conflict and mitigates the conflict in some form.   Make sure your design has conflicts.   Each requirement and constrain should be aligned to an element of design.  When gathering business requirement, consider the RAMPS impact of each requirement to help gather a full list of requirements and constraints.    Each technical requirement or constraint should be aligned to a single element of RAMPS.

 

Fun with Formats

Every single candidate struggles with document format.    The VCDX requires far more detail than most designs in enterprise.    Format paralysis has slowed if not stopped many candidates.   My suggestion is identify an outline that aligns with the blueprint.

  • Overview
  • Requirements, constraints, assumptions and risks
  • Conceptual architecture
  • Logical architecture
  • Physical architect
  • Security
  • Appendix

 

Each of the different layers of architecture should address the sub elements: compute, storage, networking, applications, recovery, virtual machine, management, etc…   You cannot provide lip service to conceptual and logical architecture.   They must be developed just like physical architecture.    Design choices should be justified against RAMPS, with conflicts identified.   The secret is to determine a format and start writing, don’t get stuck on format.   In the end, the format is not as important as the content assuming the reviewer can locate the items required in the blueprint.

 

Time Management

Every candidate struggles with time.  We have family, friends, hobbies, faith and work conflicting with the VCDX goal.    My advice is to set a goal with a timeline.   Agree upon a set time each day.  Exercise discipline to work on the VCDX during that time and you will achieve your goal.   For me I used 8:00 – 9:00 PM each night.  It was after my kids’ bed time and before spending time with my wife.   I had to sacrifice computer game time, social media time and blogging time, but after six months I was done.   This model has worked for me to achieve two VCDX certifications and put me on the path to my third.   I’d like to end where I began.   I believe everyone can achieve this certification with hard work.   To start get a mentor by visiting vcdx.vmware.com and searching for a mentor including me.

Redeploy NSX Edges to a different cluster / datacenter

First Issue my bad

I ran into an interesting issue in my home lab.  I recently replaced all my older HP servers with Intel NUC’s.  I could not be happier with the results.   Once I replaced all the ESXi hosts I mounted the storage and started up my virtual machines including vCenter.   Once vCenter and NSX Manager were available I moved all the ESXi hosts to the distributed switch.   This normal process was complicated by NSX.    I should have added the ESXi hosts to the transport zone allowing NSX to join the distributed switch.   Failure to do this made the NSX VXLAN process fail.   I could not prepare the hosts… ultimately I removed the VXLAN entries from the distributed switch and then re-prepared which re-created the VXLAN entries on the switch.   (This is not a good idea if you use it in production so follow the correct path.

Second Issue nice to know

This process generated a second issue the original cluster and datacenter on which my NSX edges used to live was gone.   I assumed that I could just re-deploy NSX edges from the manager.   While this is true the configuration assumes that it will be deploying the Edges to the same datacenter, resource pool and potentially the same host as when it was created.   So if I have a failure and expect to just bring up NSX manager and redeploy to a new cluster it will not work.   You have to adjust the parameters for the edges you can do this via the API or GUI.   I wanted to demonstrate the API method:

I needed to change the resource pool, datastore, and host for my Edge.   I identified my Edge via the identifier name in the GUI.  (edge-8 for me)  Grabbed my favorite REST tool (postman) and formed a query on the current state:

This returned the configuration for this edge device.  If you need to identify all edges just do

Then I needed the VMware identifier for resource pool, datastore and host – this can all be gathered via the REST API but I went for Powershell because it was faster for me.  I used the following commands in PowerCLI:

 

 

Once identified I was ready to form my adjusted query:

 

<appliances>
<applianceSize>compact</applianceSize>
<appliance>
<highAvailabilityIndex>0</highAvailabilityIndex>
<vcUuid>500cfc30-5b2a-6bae-32a3-360e0315ccd3</vcUuid>
<vmId>vm-924</vmId>
<resourcePoolId>domain-c861</resourcePoolId>
<resourcePoolName>domain-c861</resourcePoolName>
<datastoreId>datastore-865</datastoreId>
<datastoreName>datastore-865</datastoreName>
<hostId>host-881</hostId>
<vmFolderId>group-v122</vmFolderId>
<vmFolderName>NSX</vmFolderName>
<vmHostname>esg1-0</vmHostname>
<vmName>ESG-1-0</vmName>
<deployed>true</deployed>
<cpuReservation>
<limit>-1</limit>
<reservation>1000</reservation>
</cpuReservation>
<memoryReservation>
<limit>-1</limit>
<reservation>512</reservation>
</memoryReservation>
<edgeId>edge-9</edgeId>
<configuredResourcePool>
<id>domain-c26</id>
<name>domain-c26</name>
<isValid>false</isValid>
</configuredResourcePool>
<configuredDataStore>
<id>datastore-31</id>
<isValid>false</isValid>
</configuredDataStore>
<configuredHost>
<id>host-29</id>
<isValid>false</isValid>
</configuredHost>
<configuredVmFolder>
<id>group-v122</id>
<name>NSX</name>
<isValid>true</isValid>
</configuredVmFolder>
</appliance>
<deployAppliances>true</deployAppliances>
</appliances>

I used a PUT against https://{nsx-manager-ip}/api/4.0/edges/{edgeId}/appliances  with the above body in xml/application.   Then I was able to redeploy my edge devices without any challenge.

Public cloud has forced change

Readers will immediately cry foul of this title.   Public cloud adoption is not huge and even in the most die hard cloud only shops it’s only around 40%.   I believe public cloud has and will continue to have a transformative effect on private cloud.  The presence of a second option has forced the current options hand.    I will not detail the challenges in public cloud adoption that is a blog post for another day.   I want to focus on elements that public cloud’s presence has forced into our private on-prem. clouds:

  • Life cycle management for hypervisor is now table stakes – gone are the days with hypervisor specific teams – you can roll that cost into the operational budget on public cloud.   Quite simply upgrading / maintaining / tweaking the hypervisor needs to become easier and cost a lot less OpEx.
  • Delivery of traditional IT services need to become transparent and quick – the buzz word agility applies – the business does not care how many engineer’s it takes to screw in the server they just want it now
  • Consumers of IT services don’t like limits or scale issues – all on-prem. offerings need to have some form of elasticity
  • No one really wants or needs IaaS (Infrastructure as a Service) they want platform because only platforms provide value to developers who in turn provide value to the business-  Platform has to include multiple servers, networking/networking constructs, security, and authentication.
  • Cost is important like never before… if you don’t control / understand your cost, comparisons will be made to public cloud and you will loose.
  • Service catalogs are only useful if they change and are responsive to business needs (think application development life cycle)
  • Infrastructure people need to learn from development – the future is automated and created by developers who understand infrastructure – you can try to stand still but it will not last.
  • IT shops now want to spend IT budget incrementally very few shops want to buy IT as a CapEx spend every three years

 

I want to be clear I believe public and especially hybrid cloud should be part of every IT strategy.   It’s a critical reality in our world.   I also believe that private cloud is here to stay for many years but expectations will continue to change based upon public cloud experience.

The real question for me is how will the new edge of IoT force public clouds hand.

Double your storage capacity without buying a new storage shelf

I spent a good portion of my career moving storage from one array to another.   The driver is normally something like this:

  • Cost of older array (life cycle time)
  • New capacity, speed or feature

So off we went on another interruption migration of lun’s and data..  At one point I was sold on physical storage virtualization appliances.   They stood in front of the array and allowed me to move data between arrays without interruption to the WWID or application.   I loved them what a great solution.   Then storage vMotion became available and 95% of the workloads were running in VMware.   I no longer needed the storage virtualization appliance and my life became very VMware focused.

 

New Storage paradigm

With the advent of all flash arrays and HCI (all flash or mixed) performance(speed) has almost gone away as a reason for moving data off arrays.  Most arrays offer the same features; replication capability aside.   So now we are migrating to new arrays / storage shelf’s because of capacity or life cycle issues.   Storage arrays and their storage shelves have a real challenge with linear growth.   They expect you to make a bet on the next three years capacity.   HCI allows a much better linear growth model for storage.

My HCI Gripe

My greatest grip with HCI solutions is that everyone needs more storage that does not always mean you need more compute.   Vendors that provide hardware locked (engineered) platforms suffer from this challenge.   The small box provides 10TB, Medium 20TB and large 40TB.   Which do I buy if I need 30TB?   I am once again stuck in the making a bet problem from arrays (at least it’s a smaller bet).   The software based platforms including VSAN (full disclosure – At time of writing I work for VMware and have run VSAN in my home for three years) have the advantage of offering better mixed sizing and linear growth.

What about massive growth?

What happens when you need to double your storage with HCI and your don’t have spare drive bays available?   Do you buy a new set of compute and migrate to it?  That’s just a replacement of the storage array model…  Recently at some meetings a friend from the Storage and availability group let me know the VSAN solution to this problem.   Quite simply replace the drives in your compute with larger drives in a rolling fashion.   You should create uniform clusters but it’s totally possible to replace all current drives with new double capacity drives.   Double the size of your storage for only the cost of the drives.   (doubling the size of cache is a more complex operation)  Once the new capacity is available and out of maintenance mode data is migrated by VSAN on to the new disks.

What is the process?

It’s documented in chapter 11 of the VSAN administration guide : https://pubs.vmware.com/vsphere-60/topic/com.vmware.ICbase/PDF/virtual-san-600-administration-guide.pdf

A high level overview of the steps (please use the official documentation)

  1. Maintenance mode the host
  2. Remove the disk from the disk group
  3. Replace the disk you removed with the new capacity drive
  4. Rescan for drives
  5. Add disk back into the disk group

 

Migrating off a distributed virtual switch to standard switch Article 2

Normally people want to migrate from virtual standard switches to distributed switches.   I am a huge fan of the distributed switch and feel it should be used everywhere.   The distributed switch becomes a challenge when you want to migrate hosts to a new vCenter.   I have seen a lot of migrations to new vCenters via detaching the ESXi hosts and connecting to the new vCenter.   This process works great assuming you are not using the distributed switch.   Removing or working with VM’s on a ghosted VDS is a real challenge.   So remove it before you migrate to a new vCenter.

In this multi-article solution I’ll provide some steps to migrate off a VDS to a VSS.

Article 2:  Migrating the host off the VDS.  In the last article we moved all the virtual machines off the VDS to a VSS.   We now need to migrate the vMotion and management off the VDS to a VSS.   This step will cause interruption to the management of a ESXi host.   Virtual machines will not be interrupted but the management / will be.   You must have console access to the ESXi host for this to work.  Steps at a glance:

  1. Confirm that a switch port exists for management and vMotion
  2. Remove vMotion, etc.. from VDS and add to VSS
  3. Remove management from VDS and add to VSS
  4. Confirm settings

Confirm that a switch port exists for management and vMotion

Before you begin examine the VSS to confirm that management and vMotion port groups were created correctly by Article 1's script.   Once your sure the VLAN settings for the port group are correct then you can move to the next step.  You may want to confirm your host isolation settings it’s possible these steps will cause a HA failure if you take too long to switch over and don’t have independent datastore networking.  Best practice would be to disable HA or switch to leave powered on isolation response. 

Remove vMotion, etc.. from VDS and add to VSS

Login to the ESXi host via console and ssh.  (Comments are preceded with #) 

#use the following command to identify virtual adapters on your dvs

esxcfg-vswitch -l

# sample output from my home lab

DVS Name         Num Ports   Used Ports  Configured Ports  MTU     Uplinks

dvSwitch         1792        7           512               1600    vmnic1

 

  DVPort ID           In Use      Client

  675                 0

  676                 1           vmnic1

  677                 0

  678                 0

  679                 1           vmk0

  268                 1           vmk1

  139                 1           vmk2

 

# We can see we have three virtual adapters on our host use the following command to identify their use and IP addresses

esxcfg-vmknic -l

#Sample output from my home lab cut out some details to make it more readable

Interface  Port Group/DVPort   IP Family IP Address     

vmk0       679                 IPv4      192.168.10.16                

vmk1       268                 IPv4      192.168.10.26                   

vmk2       139                 IPv4      192.168.10.22     

 

Align you vmk# with vCenter to identify which adapter provides the function (vmk0 management, vmk1 vMotion, vmk2 FT)

 

# We can now move all adapter other than management which in my case is vmk0 #we will start with vmk1 on dvSwitch on port 268

esxcfg-vmknic -d -v 268 -s "dvSwitch"

 

# Then add to vSwitch0 vmk1

esxcfg-vmknic -a -i 192.168.10.26 -n 255.255.255.0 -p PG-vMotion

 

Remove FT

esxcfg-vmknic -d -v 139 -s "dvSwitch"

 

esxcfg-vmknic -a -i 192.168.10.22 -n 255.255.255.0 -p PG-FT

 

Remove management from VDS and add to VSS

Remove management (this stage will interrupt management access to ESXi host – make sure you have console access) You might want to pretype the add command in the console before you execute the remove.  If you are having trouble getting the shell on a ESXi host do the following:

  • You will need to login to the console go to troubleshooting options -> Enable ESXi Shell

  • Press Alt-Cntr-F1 to enter shell and login

 

Remove management:

esxcfg-vmknic -d -v 679 -s "dvSwitch"

 

Add management to VSS:

esxcfg-vmknic -a -i 192.168.10.16 -n 255.255.255.0 -p PG-Mgmt

 

Confirm settings

Ping the host to ensure networking has returned to management.   Ensure the host returns to vCenter by waiting 2 minutes.    After you move the host to a new vCenter you can remove via:

  • Go to the host in vCenter and select dvs it should provide a remove button.

 

 

 

Migrating off a distributed virtual switch to standard switch Article 1

Normally people want to migrate from virtual standard switches to distributed switches.   I am a huge fan of the distributed switch and feel it should be used everywhere.   The distributed switch becomes a challenge when you want to migrate hosts to a new vCenter.   I have seen a lot of migrations to new vCenters via detaching the ESXi hosts and connecting to the new vCenter.   This process works great assuming you are not using the distributed switch.   Removing or working with VM’s on a ghosted VDS is a real challenge.   So remove it before you migrate to a new vCenter.

In this multi-article solution I’ll provide some steps to migrate off a VDS to a VSS.

It’s important to understand that assuming that networking is correct this process should not interrupt customer virtual machines.   The movement from a distributed switch to a standard switch at most will lose a ping.   When you assign a new network adapter a gratuitous arp is sent out the new adapter.   If you only have two network adapters this process does remove network adapter redundancy while moving.

Step 1: Create a VSS with the same port groups

You need to create a standard switch with port groups on the correct VLAN ID’s.   You can do this manually but one of the challenges of the standard switch is the name must be exactly the same including case sensitivity to avoid vMotion errors.  (One great reason for the VDS)  So we need to use a script to create the standard switch and port groups.   Using PowerCLI (sorry orchestrator friends I didn’t do it in Orchestrator this time)

Code:

#Import modules for PowerCLI

    Import-Module -Name VMware.VimAutomation.Core

    Import-Module -Name VMware.VimAutomation.Vds

 

  #Variables to change

    $standardSwitchName = "StandardSwitch"

    $dvSwitchName = "dvSwitch"

    $cluster = "Basement"

    $vCenter = "192.168.10.14"

 

    #Connect to vCenter

    connect-viserver -server $vCenter

 

 

 

  $dvsPGs = Get-VirtualSwitch -Name $dvSwitchName | Get-VirtualPortGroup | Select Name, @{N="VLANId";E={$_.Extensiondata.Config.DefaultPortCOnfig.Vlan.VlanId}}, NumPorts

 

  #Get all ESXi hosts in a cluster

  $vmhosts = get-cluster -Name $cluster | get-vmhost

 

    #Loop ESXi hosts

    foreach ($vmhost in $vmhosts)

    {

        #Create new VSS

        $vswitch = New-VirtualSwitch -VMHost $vmhost -Name $standardSwitchName -Confirm:$false

 

        #Look port groups and create

        foreach ($dvsPG in $dvsPGs)

        {

            #Validate the port group is a number the DVUplink returns an array

            if ($dvsPg.VLANId -is [int] )

            {

                New-VirtualPortGroup -Name $dvsPG.Name -VirtualSwitch $vswitch -VlanId $dvsPG.VLANId -Confirm:$false

            }

 

        }

 

    } 

 

Explained:  

  • Provide variables

  • Connect to vCenter

  • Get all port groups into $dvsPGs

  • Get all ESXi hosts

  • Loop though ESXi hosts one at a time

  • Create the new standard switch

  • Loop through port groups and create them with same name as DVS and VLAN ID

 

This will create a virtual standard switch with the same VLAN and port group configuration as your DVS.    

 

I like to be able to validate that the source and destination are configured the same so this powercli script provides the checking:

Code:

#Validation check DVS vs VSS for differences

 

    $dvsPGs = Get-VirtualSwitch -Name $dvSwitchName | Get-VirtualPortGroup | Select Name, @{N="VLANId";E={$_.Extensiondata.Config.DefaultPortCOnfig.Vlan.VlanId}}, NumPorts

    #Get all ESXi hosts in a cluster

    $vmhosts = get-cluster -Name $cluster | get-vmhost

 

    #Loop ESXi hosts

    foreach ($vmhost in $vmhosts)

    {

        #Write-Host "Host: "$vmhost.Name "VSS: "$standardSwitchName

 

        #Get VSSPortgroups for this host

        $VSSPortGroups = $vmhost | Get-VirtualSwitch -Name $standardSwitchName | Get-VirtualPortGroup

            #Sort based upon name of VSS

            foreach ($dvsPG in $dvsPGs)

            {

                if ($dvsPg.VLANId -is [int] )

                {

                #Write "VSSPortGroup: " $VSSPortGroup.Name

                #Loop on DVS

                $match = $FALSE

                foreach ($VSSPortGroup in $VSSPortGroups)

                {

                    if ($dvsPG.Name -eq $VSSPortGroup.Name)

                    {

                        #Write-Host "Found a Match vss: "$VSSPortGroup.Name" to DVS: "$dvsPG.Name" Host: " $vmhost.name

                        $match = $TRUE

                        $missing = $dvsPG.Name

                    

                    }

 

                }

                if ($match -eq $FALSE)

                {

                    Write-Host "Did not find a match for DVS: "$missing " on "$vmhost.name

 

                }

 

            }

            }

 

    } 

 

Explained:

  • Get the VDS

  • Get all ESXi hosts

  • Loop through VM hosts

  • Get port groups on standard switch

  • Loop though the standard switch port groups and look for matches on DVS

  • If missing then output missing element

 

 

Now we need to give the standard switch an uplink (this is critical otherwise VM’s will fail when moved)

 

Once it has an uplink you can use the following script to move all virtual machines:

 

Code:

#Move Virtual machines to new Adapters

 

    $vms = get-vm 

 

    foreach ($vm in $vms)

      {

        #grab the virtual switch for the hosts 

        $vss = Get-VirtualSwitch -Name $standardswitchname -VMHost $vm.VMHost

        #check that the virtual switch has at least one physical adapter

        if ($vss.ExtensionData.Pnic.Count -gt 0)

        {

        #VMHost

        $adapters = $vm | Get-NetworkAdapter 

 

        #Loop through adapters

        foreach ($adapter in $adapters)

        {

            #Get VSS port group of same name returns port group on all hosts

            $VSSPortGroups = Get-VirtualPortGroup -Name $adapter.NetworkName -VirtualSwitch $standardSwitchName

   

            #Loop the hosts

            foreach ($VSSPortGroup in $VSSPortGroups)

            {

                #Search for the PortGroup on our host

                if ([string]$VSSPortGroup.VMHostId -eq [string]$vm.VMHost.Id)

                {

                    #Change network Adapter to standard switch

                    Set-NetworkAdapter -NetworkAdapter $adapter -Portgroup $VSSPortGroup -Confirm:$false

                }

            }

        }

        }

    } 

 

Explained:  

  • Used same variables from previous script

  • Get all virtual machines (you could use get-vm “name-of-vm” to test a single vm

  • Loop through all virtual machines one at a time

  • Get the VSS for the VM (host specific)

  • Check for at least one physical uplink to switch (gut / sanity check)

  • Loop though the adapters on a virtual machine 

  • For each adapter get VDS port group name and switch the adapter

 

 

 

 

 

Design for Platform services controller (PSC)

This is the first part in a series about building PSC architecture the rest of the articles are here:

The platform services controller that was introduced in vSphere 6.0  has been a source of challenge for a lot of people who are upgrading into it.    I have struggled to identify the best architecture to follow.   This article assumes that you want to have a multi-vCenter single sign on domain with external PSC’s.   There are a few key items to consider in architecting PSC’s:

Recovery

  • If you lose all PSC’s you cannot connect a vCenter to a new PSC you must re-install the vCenter loosing all data
  • To recover all failed PSC’s restore a single PSC from backup (Image level backup is supported) then redeploy new PSC’s for the rest.   Restoring multiple PSC’s may introduce some inconsistencies depending on time of backup.
  • In 6.5 vCenter cannot be repointed to a PSC in a different site on the same domain (6.0 can)
  • All 6.x versions of vCenter do not support repointing to a PSC in a different domain
  • If you lose all PSC’s at a site you can install new PSC’s at the site as long as at least one PSC at another site survived then repoint the vCenter to the new PSC

 

Replication

  • All PSC replication is bi-directional but not automatically in a ring (big one)
  • By default each PSC is replicating with only a single other PSC (the one you select when installing the additional PSC)
  • Site names do not have anything to do with replication today they are a logical construct for load balancers and future usage
  • Changes are not unique to a site but to a domain – in other words all changes at all sites are replicated to all other PSC’s assuming they are part of the domain

 

Availability

  • vCenter points to a single PSC never more than one at a time
  • PSC’s behind a load balancer (up to 4 supported) are active/passive via load balancer configuration
  • If you use a load balancer configuration for PSC and have a failure of the active PSC the load balancer repoints to another PSC and no reconfiguration is required
  • Site name is important with load balancers you should place all PSC’s behind a load balancer in their own site – non-load balanced PSC’s at same site should have a different site name

 

Features

  • PSC’s have to be part of the same domain together to use enhanced linked mode

 

Performance

  • PSC can replicate to one or many other PSC’s  (with an impact with many).   You want to minimize the number of replication partners because of performance impact.

Topology

  • Ring is the supported topology best practice today
  • PSC’s know each other by IP address or domain name (ensure domain is correct including PTR) – using IP is discouraged because it can never be changed;  use of FQDN allows for IP mobility.
  • PSC’s are authentication sources so NTP is critical and the same NTP across all PSC’s is critical.  (If you join one PSC to AD all need to be joined to same AD – best not to mix appliance and windows PSC’s)
  • The only reason to have external PSC’s is to use enhanced linked mode – if you don’t need ELM use an embedded PSC with vCenter and back vCenter up at the same time – see http://vmware.com/go/psctree

 

Scalability

  • Current limits are on 8 PSC’s in a domain in 6.0 and 10 in a domain in 6.5

 

With all of these items in hand here are some design tips:

  • Always have n+1 PSC’s in other words never have a single PSC in a domain when using ELM
  • Have a solid method for restoring your PSC’s – Image level or 6.5 restore feature

 

So what is the correct topology for PSC’s? 

This is a challenging question.  Let’s identify some design elements to consider

  • Failure of a single component should not create replication partitions
  • Complexity of setup should be minimized
  • Number of replication agreements should be minimized for performance reasons
  • Scaling out additional PSC’s should be as simple as possible

Ring

I spent some time in the ISP world and learned to love rings.   They create two paths to every destination and are easy to setup and maintain.   They do have issues when two points fail at the same time and potentially create partitions of routing until one of the two is restored.   VMware recommends a ring topology for PSC’s at the time of this article as shown below:

Let’s review this topology against the design elements:

  • Failure of a single component should not create replication partitions
    • True due to ring there are two ways for everything to replicate
  • Complexity of setup should be minimized
    • The setup ensures redundancy without lots of manually created performance impacting replication agreements (one manual agreement)
  • Number of replication agreements should be minimized for performance reasons
    • True
  • Scaling out additional PSC’s should be as simple as possible
    • Adding a new PSC means the following:
      • Add new PSC joined to LAX-2
      • Add new agreement between new PSC and SFO-1
      • Remove agreement between LAX-2 and SFO-1

Looks mostly simple you do need to track who is providing your ring backup loop. Which is a manual documentation process today.

Ring with additional redundancy

The VMware validated design  states that for a two site enhanced linked mode topology you should build the following:

A few items to illustrate (in case you have not read the VVD)

  • Four vCenters
  • Four PSC’s (in blue)
  • Each PSC replicates with its same site peer and one remote site peer thus making sure it’s changes are stored at two sites and with two copies that are then replicated locally and remotely (all four get it)

Let’s evaluate against the design elements:

  • Failure of a single component should not create replication partitions
    • True due to ring there are four ways for everything to replicate
  • Complexity of setup should be minimized
    • The setup requires forethought and at least one manual replication agreements
  • Number of replication agreements should be minimized for performance reasons
    • It has more replication agreements
  • Scaling out additional PSC’s should be as simple as possible
    • Adding a new PSC means potentially more replication agreements or more design

 

Update: The VVD reached out and wanted to be clear that adding additional sites is pretty easy.   I believe the challenge comes when you try to identify disaster zones.   Because PSC’s are replicating all changes everywhere it does not matter if all replication agreements fail you can still regenerate a site.

Which option should I use?

That is really up to you.  I personally love the simplicity of a ring.  Nether of these options increase availability of the PSC layer they are about data consistency and integrity.   Use a load balancer if your management plane SLA does not support downtime.