Platform Services Controller Failures

Recently I was asked to review the vCenter 6 architecture in preparation for a move off 5.5 to 6.0.   Part of this process required a recommendation on best practices around 6.0 vCenter architecture.  I have always found the best way to understand a product is to induce failures.   It helps you understand the gaps in order to architect the best solution.

 

The Tale of Roles

vCenter has had a mixed up past.  It’s has gone from being physical to virtual on a single server.  To breaking out the database and vCenter roles.  To breaking out every role.   People have literally made a career out of dividing and smashing together vCenter in order to provide for customer needs.   In vCenter 6 VMware provides two total roles that can be together or divided:

  • Platform Services Controller (PSC)
  • vCenter Server (VC)

 

The roles are divided as shown:

The Platform service controller (PSC) provides the following functions:

  • License service
  • Component Manager
  • Identity Management service
  • Service Control agent
  • Security Token service
  • Common logging service
  • Syslog health service
  • Authentication Framework
  • Certificate service
  • Directory service

 

The vCenter Server (VC) provides the following functions:

  • Inventory service
  • vCenter service
  • Auto deploy services
  • VMware update manager
  • Web client
  • Third party plugins

 

In addition vCenter requires a database off the supported list (SQL or Oracle).  If you upgrade from a large-scale deployment with each of the following services on their own server they will all be pulled into these two roles.    You cannot deploy an Inventory only VC server.   You are stuck with a one machine or two machine deployment model.   VMware has improved the watch dog process in java making sure that invetory or vpxd processes cannot kill each other (a problem in 5.5) .  In 6 a java master process manages all the java to control resource usage instead of multiple independent java processes.  Communication between these roles is done via API access so you can mix and match Linux and Windows machines.  At this time it is not supported to mix Linux and Windows PSC’s behind the same load balancer.   I personally would stick with one format for support purposes either Linux or Windows.

 

PSC Options

PSC’s can be embedded (installed on same machine as VC) or stand alone (their own machine).   PSC’s can also be behind a load balancer.   PSC has two critical concepts

  • vSphere domain – defined as a authentication domain – not tied to active directory – each PSC is joined to a single domain normally called vsphere.local – to identify the local authentication source of the PSC
  • PSC Site – each vSphere domain can have multiple sites that are user defined strings – each PSC joined to a site understands it location and other PSC at the same site – replication between sites are done via a master at each site – this cuts down on network traffic

Domain is a carry over from SSO and is known in most implementations as vsphere.local (it’s the local domain for local SSO authentication)  Domain becomes a critical concept with Enhanced linked mode below.    Site is a designation provided to PSC’s to understand who is close and far away for replication and potential fail over options.   It’s not heavily used right now but keep your eyes on this option in the future.

 

Enhanced Linked Mode (ELM)

This is one of my favorite new features it allows vCenters in the same PSC domain to be connected.  You can login to a single vCenter web client and access all other vCenters in the same domain.   It has a current limit of 10 vCenters but provides a single point of administration for multiple vCenters allowing for a better scale out approach.   It does require that all PSC’s servicing the vCenter’s be in the same domain and stand alone PSC’s (embedded are not supported joined to domains).  It also offers the following features across all linked machines:

  • Single pane of glass for all vCenters via the web client (using enhanced linked mode)
  • Common central location for tags and categories
  • Permissions applied in a single location
  • Central authentication for all VMware services (future release)
  • Storing and generation of SSL certificates in a single location
  • Replication between sites for Authentication

Sizing for PSC

Currently the PSC has the following limits as per configuration maximums document:

Item Maximum
Max PSC per vSphere domain 8
Max PSC behind a load balancer 4
Max objects within a vSphere domain (Users and groups) 1,0000,0000
Max number of vCenters to a single PSC 4
Max number of vCenters in vSphere domain 10
Max number of Web Client sessions 180

 

So this sizing does provide some limits on your architecture.  The most critical is number of web client sessions.   If you have four vCenters in your ELM and login to one you just created one web client session for yourself on each vCenter.   Even if you never use the other vCenters that’s a session on all until you logout.  So if you are going to have 180 users as the same time you might have a problem.

API Sessions

API and solution connections are all not considered web client logins and in 6 have a limit of 500 connections at a time.   So the web client limit is specifically human users.

 

PSC Failures (the whole point of this article)

So I wanted to test what happens with PSC’s or VC’s fail on 6.   Here is my test setup:

Capture

We have three sites each with their own VC and PSC.  Each vCenter has a one to one relationship with the PSC’s above them.    All are in the same PSC domain allowing ELM.   I also wanted to test mixing and matching Linux and Windows so I used the following:

Item Description
Site 1 PSC Linux based appliance
Site 1 vCenter Linux based appliance
Site 2 PSC Windows based PSC
Site 2 vCenter Linux based appliance
Site 3 PSC Linux based appliance
Site 3 vCenter Windows based vCenter

 

Test Case 1:

Failure of a PSC while user is logged in to vCenter and attempting to manage each vCenter:

Scenarios Site1vc Site2vc Site3vc Logged into
Failure of site2psc logged in to site3 Manageable Manageable Manageable Site3
Failure of site3psc logged in to site3 Manageable Manageable Manageable Site3
Failure of site1psc logged in to site3 Manageable Manageable Manageable Site3
Failure of site2psc logged in to site1 Manageable Manageable Manageable Site1
Failure of site3psc logged in to site1 Manageable Manageable Manageable Site1
Failure of site1psc logged in to site1 Manageable Manageable Manageable Site1
Failure of site2psc logged in to site2 Manageable Manageable Manageable Site2
Failure of site3psc logged in to site2 Manageable Manageable Manageable Site2
Failure of site1psc logged in to site2 Manageable Manageable Manageable Site2

 

The short version of this test is if you are logged into any of the vCenters then a PSC becomes available all primary vCenter functions are available (like access to vm’s)  access to PSC functions for the connected vCenter are not available (like add global tags, applied tags are still present and working).

Test case 2

Failure of PSC and user attempts to login

Failure of site2psc trying to log into site1 Yes Site1
Failure of site2psc trying to log into site2 No Site2
Failure of site2psc trying to log into site3 Yes Site3
Failure of site1psc trying to log into site1 No Site1
Failure of site1psc trying to log into site2 Yes Site2
Failure of site1psc trying to log into site3 Yes Site3
Failure of site3psc trying to log into site1 Yes Site1
Failure of site3psc trying to log into site2 Yes Site2
Failure of site3psc trying to log into site3 No Site3

As you can see a user is unable to login to the vCenter if its PSC is unavailable.  Simple it only allows it’s connected PSC to authorize authenticated users.   If the PSC becomes available after login it will eventually allow the user to see and manage it’s vCenter.  All other vCenters are available.

Test Case 3

Failure of vCenter while logged in to ELM (this is the placebo test)

 

Scenarios Site1vc Site2vc Site3vc Logged in to
Failure of site2vc when logged into site1 Manageable No Manageable site1
Failure of site2vc when logged into site2 No No No site2
Failure of site2vc when logged into site3 Manageable No Manageable site3
Failure of site1vc when logged into site1 No No No site1
Failure of site1vc when logged into site2 No Manageable Manageable site2
Failure of site1vc when logged into site3 No Manageable Manageable site3
Failure of site3vc when logged into site1 Manageable Manageable No site1
Failure of site3vc when logged into site2 Manageable Manageable No site2
Failure of site3vc when logged into site3 No No No site3

If vCenter goes away you cannot manage it.  It takes about 120 seconds for the vCenter to be removed from vCenter as a managable object.  During that time some cached displays will still work and seem to be slow.   If the vCenter returns it requires a web client refresh to display as a manageable object.

Overall Failures

No real surprises on the PSC architecture failures.   I am happy that ELM continues to work even if the PSC has failed for as long as your session lasts.   (New sessions will not work)

 

Possible Enterprise architecture

One possible solution is to use a load balancer to create a high availability pair of PSC’s then tie more than one vCenter into this HA pair (or up to 4 PSC’s):

Capture

This solution will control the number of PSC’s in your environment.  Remote branch offices or other sites can be tied into the same ELM as shown on the right.   At this time VMware does not support using Linux mixed with Windows behind a load balancer (make it one or the other).   In addition the PSC’s behind the load balancer should be listed as at the same PSC site.   I have seen other architectures including a great discussion with @arielsanchezmor about a possible highly available PSC across sites.  I look forward to hearing if that architecture is supported by VMware.

 

Upgrade paths:

Simple answer is whatever you want the architecture to be in 6 you need to do it first in 5.5.  Don’t upgrade to 6 and make the PSC stand alone it does not work well.   VMware support has a lot of great articles on breaking out the SSO into its own machine and very little on PSC right now.   Each week the PSC documentation gets better but take my advise and break it into your designated architecture on 5.5 before you go to 6.

 

vBrownBag presentation:

I did a 15 minute vBrownbag presentation on this topic a VMworld which you can watch here:

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.