Design for Platform services controller (PSC)

This is the first part in a series about building PSC architecture the rest of the articles are here:

The platform services controller that was introduced in vSphere 6.0 has been a source of challenge for a lot of people who are upgrading into it. I have struggled to identify the best architecture to follow. This article assumes that you want to have a multi-vCenter single sign on domain with external PSC’s. There are a few key items to consider in architecting PSC’s:

Recovery

If you lose all PSC’s you cannot connect a vCenter to a new PSC you must re-install the vCenter loosing all data
To recover all failed PSC’s restore a single PSC from backup (Image level backup is supported) then redeploy new PSC’s for the rest. Restoring multiple PSC’s may introduce some inconsistencies depending on time of backup.
In 6.5 vCenter cannot be repointed to a PSC in a different site on the same domain (6.0 can)
All 6.x versions of vCenter do not support repointing to a PSC in a different domain
If you lose all PSC’s at a site you can install new PSC’s at the site as long as at least one PSC at another site survived then repoint the vCenter to the new PSC

Replication

All PSC replication is bi-directional but not automatically in a ring (big one)
By default each PSC is replicating with only a single other PSC (the one you select when installing the additional PSC)
Site names do not have anything to do with replication today they are a logical construct for load balancers and future usage
Changes are not unique to a site but to a domain – in other words all changes at all sites are replicated to all other PSC’s assuming they are part of the domain

Availability

vCenter points to a single PSC never more than one at a time
PSC’s behind a load balancer (up to 4 supported) are active/passive via load balancer configuration
If you use a load balancer configuration for PSC and have a failure of the active PSC the load balancer repoints to another PSC and no reconfiguration is required
Site name is important with load balancers you should place all PSC’s behind a load balancer in their own site – non-load balanced PSC’s at same site should have a different site name

Features

PSC’s have to be part of the same domain together to use enhanced linked mode

Performance

PSC can replicate to one or many other PSC’s (with an impact with many). You want to minimize the number of replication partners because of performance impact.

Topology

Ring is the supported topology best practice today
PSC’s know each other by IP address or domain name (ensure domain is correct including PTR) – using IP is discouraged because it can never be changed; use of FQDN allows for IP mobility.
PSC’s are authentication sources so NTP is critical and the same NTP across all PSC’s is critical. (If you join one PSC to AD all need to be joined to same AD – best not to mix appliance and windows PSC’s)
The only reason to have external PSC’s is to use enhanced linked mode – if you don’t need ELM use an embedded PSC with vCenter and back vCenter up at the same time – see http://vmware.com/go/psctree

Scalability

Current limits are on 8 PSC’s in a domain in 6.0 and 10 in a domain in 6.5

With all of these items in hand here are some design tips:

Always have n+1 PSC’s in other words never have a single PSC in a domain when using ELM
Have a solid method for restoring your PSC’s – Image level or 6.5 restore feature

So what is the correct topology for PSC’s?

This is a challenging question. Let’s identify some design elements to consider

Failure of a single component should not create replication partitions
Complexity of setup should be minimized
Number of replication agreements should be minimized for performance reasons
Scaling out additional PSC’s should be as simple as possible

Ring

I spent some time in the ISP world and learned to love rings. They create two paths to every destination and are easy to setup and maintain. They do have issues when two points fail at the same time and potentially create partitions of routing until one of the two is restored. VMware recommends a ring topology for PSC’s at the time of this article as shown below:

Let’s review this topology against the design elements:

Failure of a single component should not create replication partitions
- True due to ring there are two ways for everything to replicate
Complexity of setup should be minimized
- The setup ensures redundancy without lots of manually created performance impacting replication agreements (one manual agreement)
Number of replication agreements should be minimized for performance reasons
- True
Scaling out additional PSC’s should be as simple as possible
- Adding a new PSC means the following:
  - Add new PSC joined to LAX-2
  - Add new agreement between new PSC and SFO-1
  - Remove agreement between LAX-2 and SFO-1

Looks mostly simple you do need to track who is providing your ring backup loop. Which is a manual documentation process today.

Ring with additional redundancy

The VMware validated design states that for a two site enhanced linked mode topology you should build the following:

A few items to illustrate (in case you have not read the VVD)

Four vCenters
Four PSC’s (in blue)
Each PSC replicates with its same site peer and one remote site peer thus making sure it’s changes are stored at two sites and with two copies that are then replicated locally and remotely (all four get it)

Let’s evaluate against the design elements:

Failure of a single component should not create replication partitions
- True due to ring there are four ways for everything to replicate
Complexity of setup should be minimized
- The setup requires forethought and at least one manual replication agreements
Number of replication agreements should be minimized for performance reasons
- It has more replication agreements
Scaling out additional PSC’s should be as simple as possible
- Adding a new PSC means potentially more replication agreements or more design

Update: The VVD reached out and wanted to be clear that adding additional sites is pretty easy. I believe the challenge comes when you try to identify disaster zones. Because PSC’s are replicating all changes everywhere it does not matter if all replication agreements fail you can still regenerate a site.

Which option should I use?

That is really up to you. I personally love the simplicity of a ring. Nether of these options increase availability of the PSC layer they are about data consistency and integrity. Use a load balancer if your management plane SLA does not support downtime.