Design for Platform services controller (PSC)

This is the first part in a series about building PSC architecture the rest of the articles are here:

The platform services controller that was introduced in vSphere 6.0  has been a source of challenge for a lot of people who are upgrading into it.    I have struggled to identify the best architecture to follow.   This article assumes that you want to have a multi-vCenter single sign on domain with external PSC’s.   There are a few key items to consider in architecting PSC’s:


  • If you lose all PSC’s you cannot connect a vCenter to a new PSC you must re-install the vCenter loosing all data
  • To recover all failed PSC’s restore a single PSC from backup (Image level backup is supported) then redeploy new PSC’s for the rest.   Restoring multiple PSC’s may introduce some inconsistencies depending on time of backup.
  • In 6.5 vCenter cannot be repointed to a PSC in a different site on the same domain (6.0 can)
  • All 6.x versions of vCenter do not support repointing to a PSC in a different domain
  • If you lose all PSC’s at a site you can install new PSC’s at the site as long as at least one PSC at another site survived then repoint the vCenter to the new PSC



  • All PSC replication is bi-directional but not automatically in a ring (big one)
  • By default each PSC is replicating with only a single other PSC (the one you select when installing the additional PSC)
  • Site names do not have anything to do with replication today they are a logical construct for load balancers and future usage
  • Changes are not unique to a site but to a domain – in other words all changes at all sites are replicated to all other PSC’s assuming they are part of the domain



  • vCenter points to a single PSC never more than one at a time
  • PSC’s behind a load balancer (up to 4 supported) are active/passive via load balancer configuration
  • If you use a load balancer configuration for PSC and have a failure of the active PSC the load balancer repoints to another PSC and no reconfiguration is required
  • Site name is important with load balancers you should place all PSC’s behind a load balancer in their own site – non-load balanced PSC’s at same site should have a different site name



  • PSC’s have to be part of the same domain together to use enhanced linked mode



  • PSC can replicate to one or many other PSC’s  (with an impact with many).   You want to minimize the number of replication partners because of performance impact.


  • Ring is the supported topology best practice today
  • PSC’s know each other by IP address or domain name (ensure domain is correct including PTR) – using IP is discouraged because it can never be changed;  use of FQDN allows for IP mobility.
  • PSC’s are authentication sources so NTP is critical and the same NTP across all PSC’s is critical.  (If you join one PSC to AD all need to be joined to same AD – best not to mix appliance and windows PSC’s)
  • The only reason to have external PSC’s is to use enhanced linked mode – if you don’t need ELM use an embedded PSC with vCenter and back vCenter up at the same time – see



  • Current limits are on 8 PSC’s in a domain in 6.0 and 10 in a domain in 6.5


With all of these items in hand here are some design tips:

  • Always have n+1 PSC’s in other words never have a single PSC in a domain when using ELM
  • Have a solid method for restoring your PSC’s – Image level or 6.5 restore feature


So what is the correct topology for PSC’s? 

This is a challenging question.  Let’s identify some design elements to consider

  • Failure of a single component should not create replication partitions
  • Complexity of setup should be minimized
  • Number of replication agreements should be minimized for performance reasons
  • Scaling out additional PSC’s should be as simple as possible


I spent some time in the ISP world and learned to love rings.   They create two paths to every destination and are easy to setup and maintain.   They do have issues when two points fail at the same time and potentially create partitions of routing until one of the two is restored.   VMware recommends a ring topology for PSC’s at the time of this article as shown below:

Let’s review this topology against the design elements:

  • Failure of a single component should not create replication partitions
    • True due to ring there are two ways for everything to replicate
  • Complexity of setup should be minimized
    • The setup ensures redundancy without lots of manually created performance impacting replication agreements (one manual agreement)
  • Number of replication agreements should be minimized for performance reasons
    • True
  • Scaling out additional PSC’s should be as simple as possible
    • Adding a new PSC means the following:
      • Add new PSC joined to LAX-2
      • Add new agreement between new PSC and SFO-1
      • Remove agreement between LAX-2 and SFO-1

Looks mostly simple you do need to track who is providing your ring backup loop. Which is a manual documentation process today.

Ring with additional redundancy

The VMware validated design  states that for a two site enhanced linked mode topology you should build the following:

A few items to illustrate (in case you have not read the VVD)

  • Four vCenters
  • Four PSC’s (in blue)
  • Each PSC replicates with its same site peer and one remote site peer thus making sure it’s changes are stored at two sites and with two copies that are then replicated locally and remotely (all four get it)

Let’s evaluate against the design elements:

  • Failure of a single component should not create replication partitions
    • True due to ring there are four ways for everything to replicate
  • Complexity of setup should be minimized
    • The setup requires forethought and at least one manual replication agreements
  • Number of replication agreements should be minimized for performance reasons
    • It has more replication agreements
  • Scaling out additional PSC’s should be as simple as possible
    • Adding a new PSC means potentially more replication agreements or more design


Update: The VVD reached out and wanted to be clear that adding additional sites is pretty easy.   I believe the challenge comes when you try to identify disaster zones.   Because PSC’s are replicating all changes everywhere it does not matter if all replication agreements fail you can still regenerate a site.

Which option should I use?

That is really up to you.  I personally love the simplicity of a ring.  Nether of these options increase availability of the PSC layer they are about data consistency and integrity.   Use a load balancer if your management plane SLA does not support downtime.

Design Scenario: Gigabit network and iSCSI ESXi 5.x

Many months ago I posted some design tips on the VMware forums (I am Gortee there if you are wondering).   Today a user updated the thread with a new scenario looking for some advise.  While it would be a bad idea personally and professionally for me to give specific advise without a design engagement I thought I might provide some thoughts about the scenario here.  This will allow me to justify some design choices I might make in the situation.   In no way should this be taken as law.  In reality everyone situation is different and little requirements can really change the design.   The original post is here.

The scenario provided was the following:

3 ESXI hosts (2xDell R620,1xDell R720) each with 3×4 port NICS (12 ports total), 64GB RAM. (Wish I would have put more on them ;-))

1 Dell MD3200i iSCSI disk array with 12 x 450GB SAS 15K Drives (11+1 Spare) w/2 4 port GB Ethernet Ports

2 x Dell 5424 switches dedicated for traffic between the MD3200i and the 3 Hosts

Each host is connected to the iSCSI network though 4 dedicated NIC Ports across two different cards

Each Host has 1 dedicated VMotion Nic Port connected to its own VLAN connected to a stacked N3048 Dell Layer 3 switch

Each Host will have 2 dedicated (active\standby) Nic ports (2 different NIC Cards) for management

Each Hosts will have a dedicated NIC for backup traffic (Has its own Layer 3 dedicated network/switch)

Each host will use the remaining 4 Nic Ports (two different NIC cards) for the production/VM traffic)

 would you be so kind to give me some recommendations based on our environment?


  • Support 150 virtual machines
  • Do not interrupt systems during the design changes


  • Cannot buy new hardware
  • Not all traffic is vlan segmented
  • Lots of 1GB ports per server


  • Standard Switches only (Assumed by me)
  • Software iSCSI is in use (Assumed again by me)
  • Not using Enterprise plus licenses



Dell MD3200i iSCSI disk array with 12 x 450GB SAS 15K Drives (11+1 Spare) w/2 4 port GB Ethernet Ports

2 x Dell 5424 switches dedicated for traffic between the MD3200i and the 3 Hosts

Each host is connected to the iSCSI network though 4 dedicated NIC Ports across two different cards

I personally have never used this array model, the vendor should be included on the design to make sure none of my suggestions here are not valid with this storage system.  Looking at the VMware HCL we learn the following:

  • Only supported on ESXi 4.1 U1 through 5.5 (no 5.5 U1 yet so don’t update)
  • You should be using the VMW_PSP_RR (Round Robin) for path fail over
  • The array supports the following VAAI natives Block Zero,Full Copy,HW Assisted Locking

The following suggestions should apply to physical cabling:


Looking at the diagram I made the following design choices:

  • From my limited understanding the array the cabling follows the best practice guide I could find.
  • Connection from the ESXi hosts to switches are done to create as much redundancy as possible including all available cards.  It is critical that the storage be as redundant as possible.
  • Each uplink (physical nic) should be configured to connect to an individual vmkernel port group.  Each port group should be configured with only one uplink.
  • Physical switches and port groups should be configured to use native port assuming these switches don’t so anything other than provide storage traffic between these four devices (three ESXi and one array)  if the array and switch is providing storage to more things you should follow your vendor’s best practices for segmenting traffic.
  • Port binding for iSCSI should be configured as per VMware document and vendor documents

New design considerations from storage:

  • 4 1GB’s will be used to represent max traffic the system will provide
  • The array does not support 5.5 U1 yet so don’t upgrade
  • We have some VAAI natives to help speed up processes and avoid SCSI locks
  • Software iSCSI requires that forged transmissions be allowed on the switch

Advise to speed up iSCSI storage

  • Bind your bottle neck – is it switch speeds, array processors, ESXi software iSCSI and solve it.
  • You might want to consider Storage DRS on your array to automatically balance load and IO metrics (requires enterprise plus license but saves so much time) – Also has an impact on CBT backups making them do a full backup.
  • Hardware iSCSI adapters might also be worth the time… thou they have little real benefit in the 5.x generation of ESXi



We will assume that we now have 8 total 1GB ports available on each host.   We have a current network architecture that looks like this (avoided the question of how many virtual switches):


I may have made mistakes from my reading a few items pop out to me:

  • vMotion does not have any redundancy which means if that card fails we will have to power off VM’s to move them to another host.
  • Backup also does not have redundancy which is less of an issue than the vMotion network
  • All traffic does not have redundant switches creating single points of failure

A few assumptions have to be made:

  • No single virtual machine will require more than 1Gb of traffic at any time (otherwise we have to be looking into LACP or etherchannel solutions.
  • Management traffic, vMotion and virtual machine traffic can live on the same switches as long as they are segmented with VLAN’s


Recommended design:


  • Combine the management switch and VM traffic switch into dual function switches to provide both types of traffic.
  • This uses vlan tags to include vMotion and management traffic on the same two uplinks providing card redundancy (configured active / passive)  Could also be configured with multi-nic vMotion but I would avoid due to complexity around management network starvation in your situation.
  • Backup continues to have it’s own two adapters to avoid contention

This does require some careful planning and may not be the best possible use of links.   I am not sure you need 6 links for your VM traffic but it cannot hurt.


Final Thoughts:

Is any design perfect?  Nope lots of room for error and unknowns.  Look at the design and let me know what I missed.  Tell me how you would have done it differently… share so we can both learn.  Either way I hope it helps.

Radically simple storage design with VMware

Storage is my bread and butter.  I cut my teeth on fine storage arrays from EMC.  Since then I have moved on to many different vendors and I have learned one truth: storage can be hard or simple.   VMware can make storage easy.   I am very excited about SDS (software defined storage)  I personally love VSAN and Nutanix they are the commercial solution to something google figured out long ago.   Storage is simple but storage arrays are hard.   VMware has been making great strides to simply storage but I find lots of people are afraid to use them.   They prefer to stick non-flexable arrays and provisioning methods.   Please don’t get me wrong these designs are required for some solutions.  Some transnational processing requires insane IOP’s or low latency.  This design is for the rest of you.

Design Overview:

You have a VMware cluster with highly available shared storage.   You have a mixed VMware cluster running lots of different applications.  Some of your virtual machines have lots of drives spread all over your storage luns.   Some of your virtual machines have 2TB drives attached so you have standardized on 4TB lun’s for all VMFS datastores.  All of your luns are thin provisioned.  You need to provide a solution that is easy to manage but avoids lun’s running out of disk space in the middle of the night.  You are also concerned about performance you would love an automated way to move virtual machines if I/O on a lun is a problem.


The following assumptions have been made:

  • You have enterprise plus licensing
  • You are running 5.5 and all VMFS luns are at least 5.XX format native
  • You do not have an array that provides auto tiering
  • You do not need to take into account path selection in the process or physical array



VMware’s Storage cluster provides for all the requirements and needs.  By using all storage in a storage cluster management of storage becomes easy.  Just group storage together based on IO metrics (do not mix 15,000 disks with 7,200 k disks)  into a pool or datastore cluster.  Enable storage DRS and your life just got a lot easier.  Enable automated storage DRS for ease of management.   This will help you place new virtual machine and move virtual machines off luns that are above a certain threshold (80%) by default.   Now you just need to enable IO latency moves.  This will move virtual machines to other datastores if the latency on the datastore passes a threshold (default 10ms) for a specific duration.   I have used storage DRS just like this with over 2,000,000 successful storage moves without a single outage.


Abstract storage -> Pool -> Automate


All are provided by this design.

Radically simple networking design with VMware


VMware has lots of great options and features.  Filtering through all the best practices combined with legacy knowledge can be a real challenge.  I envy people starting with VMware now they don’t have knowledge of all the things that were broken on 3.5, 4.0, 4.1 etc…  It’s been a great journey but you have  to be careful not to let the legacy knowledge influence the design of today.   In this design I will provide a radically simple solution to networking with VMware.


Design overview:

You have been given a VMware cluster running on HP blades.  Each blade has a total of 20GB’s of potential bandwidth that can be divided anyway you want.   You should make management of this solution easy and provide as much bandwidth as possible to each traffic type.  You have the following traffic types:

  • Management
  • vMotion
  • Fault Tolerance
  • Backup
  • Virtual machine

Your storage is fiber channel and not in scope for the network design.   You chassis is connected to a two upstream switches that are stacked.  You cannot configure the switches beyond assigning vlans.


This design takes into account the following assumptions:

  • Etherchannel and LAG are not desired or available
  • You have enterprise plus licensing and vcenter

Physical NIC/switch Design:

We want a simple solution with maximum available bandwidth.  This means we should use two 10Gb nic’s on our blades.   The connections to the switch for each nic should be identical (exact same vlans) and include the vlans for management, FT, vMotion, backup and all virtual machines.   Each with their own vlan ID for security purposes.  This solution provides the following benefits:

  • Maximum bandwidth available to all traffic types
  • Easy configuration on the switch and nics (identical configuration)

The one major draw back to this solution is some environments require physical separation of traffic and require traffic to be segregated by nics.

Virtual Switch Design:

On the virtual switch side we will use a dVS.  In the past there has been major concerns with using a dVS for management and virtual center.  There are a number of chicken and the egg scenarios that come into play.   If you still have concerns then make the port group for vCenter ephemeral so it does not need vcenter to allocate ports.   Otherwise vDS brings a lot to the table over standard switches including:

  • Centralized consistent configuration
  • Traffic Shaping with NIOC
  • Load based teaming
  • Netflow
  • dVS automatic health check


Traffic Shaping:

The first thing to understand about traffic shaping in VMware is it can only have effect ingress traffic and is unique to each host.   We use a numeric value known as a share to enforce traffic shaping.  These share values are only used during time of contention by default.  This unique ability allows you to ensure nothing uses 100% of a link while other neighbors want access to the link.   This is a unique and awesome feature that automates traffic policing in VMware solutions.  You can read about the default NIOC pools here.   I suggest you leave the default pools in place with their default values and then add a custom pool for backup.   Traffic shares are applied a value from 1 to 100.  Another design factor is that traffic that is not in use is not applied to the share algorithm.   For example assume the following:


You would assume that the total shares would be 10+25+25+50 = 110  but if you are not using any FT traffic then it’s 10+25+50=95  either way this number can be divided by total bandwidth so worst case scenario with 100% contention with all traffic types would get the following:

  • Management (20/110=.18*10) 1.8 GB
  • FT (20/110=.18*25) 4.5 GB
  • vMotion (20/110=.18*25) 4.5GB
  • Virtual machine (20/110=.18*50) 9 GB

And remember this is per host.   You will want to adjust the default settings to fit your requirements and traffic patterns.

This design has some real advantages:

  • The vmotion nic is seen as 10GB which means you can do 8 concurrent vmotions at the same time
  • No more wasted bandwidth
  • Easy to setup and forget about

Load balancing:

Load balancing algorithms in vSphere each have their own personality and physical requirements.   But we want simple above everything else.  So we choose to use Load Balanced teaming (LBT) known as physical nic load in vDS.  This is a great choice for enterprise plus customers.  It allocates usage of any one link to 80%, Once 80% is reached then some of the traffic is moved over to the next link.  This configuration will work with any number of uplinks without any configuration on the physical switch.  We avoid loops because unique traffic does not share uplinks.  For example virtual machine 1 will use uplink1 exclusively while virtual machine 2 uses uplink2.   With this load balancing method we don’t have to assign different uplink priorities to port groups in order to balance traffic just let LBT handle it.    It is 100% fire and forget.  If you find you need more bandwidth just add more uplinks to the switch and you will be using it.

Radically simple networking

It’s simple and it works.  Here is a simple diagram of the solution:



Once setup it scales and provides for all your needs.   It’s consistent clean and designed around possible failures.  It allows all traffic types to use as much network as needed unless contention is present.   Just think of it as DRS for networking.   I just wish I could handle my physical switches this way… maybe some day NSX.