Get a VCDX certification without a VCAP certification for the next three weeks

This is huge for those willing to make the investment.  This applies to VCP-Cloud or VCP-DT only.   You can submit a VCDX Design in those fields and potentially get the VCDX without passing the VCAP or new VCIX exams.   As a current VCP-Cloud and VCDX-DCV I am super temped to make the next three weeks really bad for the chance at a second VCDX.   If you have one of these VCP’s you should really go out grab a VCDX mentor and make it happen.  It’s roughly $300 dollars to submit a design, worth every penny.   Take it from me the VCDX is huge.   It’s my understanding this only applies to the April submission deadline of the 1st.   Good luck.

 

Official announcement:

https://communities.vmware.com/thread/504038#.VPiBatfUNX8.twitter#sf36749421

VCIX-NV Study Guide Objective 1.2

To see other posts in this series go here.

This section deals with upgrading from older versions of vShield to NSX.   The simple answer is there is a specific order that must be followed.  Upgradeds from vShield require version 5.5.  Most of it is in the GUI via vCenter except the vShield Manager which will be replaced by NSX Manager.  Most of these processes roughly follow the documented process in this document.

Products name translation:

Roughly here are the old names to new names or new service providing function:

vShield Manager -> NSX Manager

Virtual Wires -> NSX Logical Switch

vShield App -> NSX Firewall

vShield -> NSX Edge

vShield Endpoint ->vShield Endpoint

Data security -> NSX Datasecurity

 

Practicing this process:

Unless you want to take a few hours configuring all vShield products it’s hard to practice.  You can do the upgrade from vShield Manager to NSX manager really quickly.   Just download the vShield Manager and setup with the following:

  • Deploy OVF
  • Power on
  • Console login as admin with password of default
  • type enable with password of default
  • type setup
  • Setup your IP settings
  • Wait 5 minutes
  • Login via IP with web browser and do upgrade

The rest of the upgrade requires you understand vShield products which is not required for NV so I vote you skip it and be familiar with process, order and requirements.

 

Objective  1.2 Denotes the following items:

Upgrade vShield Manager 5.5 to NSX Manager 6.x.

Upgrading vShield Manager to NSX Manager can only be done from version 5.5 of vShield.  It also requires the following things:

  • vCenter 5.5
  • vShield Data Security uninstalled
  • vShield Edges be upgraded to 5.5

 

Process:

  1. Download the NSX upgrade bundle called vCNS to NSX for vSphere xxx Upgrade Bundle
  2. Login to vShield Manager and click Settings & Reports
  3. Click Updates tab and click upload upgrade bundle
  4. Click Choose file Browse to the vCNS to NSX for vSphere xxx Upgrade Bundle  and click open
  5. Click Upload file – this process will take some time
  6. Click Install to begin the upgrade process
  7. Click confirm install – this will reboot the vShield manager – none of the other components are rebooted
  8. After upgrade visit the ip address of your vShield manager again via https
  9. Login and look at summary page to confirm you are running NSX Manager
  10. Log off all windows and close your browser to clear cache
  11. Login to vSphere Web client
  12. Shutdown your NSX Manager vm and increase memory to 12GB and vCPU to 4

Upgrade NSX Manager 6.0 to NSX Manager 6.0.x
Upgrade Virtual Wires to Logical Switches

Virtual wires must be upgraded to NSX logical switches to use NSX features.   The process is required even if you don’t use virtual wires.   In order for this to work you need to upgrade your vShield manager to NSX manager and make sure it’s connected to vSphere.

Process

  • Login to Web client
  • Networking and Security Tab click install
  • Click host prepare
  • Virtual wires will show as Legacy
  • Click update on each wire
  • Wait for them to show green and no longer legacy

Upgrade vShield App to NSX Firewall

You can only upgrade vShield App 5.5 to NSX.  It requires that vShield manager be upgraded to NSX manager and virtual wires upgraded to NSX logical switches.

  • A pop up window should ask if you want to upgrade
  • Click update and wait
  • Done

Upgrade vShield 5.5 to NSX Edge 6.x

This upgrade requires the following:

  • vShield 5.5
  • NSX Manager
  • Virtual wires upgraded to NSX logical switches

Processes:

  • Login to web client
  • Networking & Security tab
  • NSX Edges button
  • Select upgrade version from actions menu on each edge
  • After compete check the version number tab

Upgrade vShield Endpoint 5.x to vShield Endpoint 6.x

This upgrade requires the following:

  • vShield Manager upgraded to NSX Manager
  • Virtual wires upgraded to NSX Logical switches

Process:

  • Login to web client
  • Networking & Security tab
  • Click Installation
  • Click Service deployments tab
  • Click on upgrade available
  • Select datastore (must be shared) and network and ok

Upgrade to NSX Data Security

There is no clean upgrade path you have to remove before install of NSX manager.  You have to re-register the solution with NSX if available.

How does VM and Application Monitoring work?

Almost everyone uses VMware HA to protect against host failures.  At the same time almost no one uses VM and Application monitoring.  I believe the lack of usage is due to two factors:

  • Lack of understanding of how it works leading to fears about it causing outages
  • It is disabled by default in vSphere

 

VM Monitoring

How does it work?

In the most basic form it works just like host HA, when a virtual machine fails it restarts the operating system.   Sounds simple and perfect right?   Here is where the fear begins.   No one likes the idea of possible false reboots so it’s important to understand how it detects a failure.

  1. VMware Tools heartbeats – This is configurable built-in internal monitoring between the VMware tools and host running the virtual machine.  (How do they communicate?  Via a build in virtual chip in the virtual hardware)
  2. Monitoring the I/O on the virtual machine –  The VM monitoring checks for any I/O in the last 120 seconds – both network and storage

If both these conditions result in failure then your virtual machine is restarted via guest reboot.  It possible if you have an operating system issue that your virtual machine will be in a constant reboot loop.   By default a virtual machine will only be rebooted three times in an hour.

The implementation of VM monitoring is done via the HA agent FDM on each host.   Since it is tied to FDM it will operate without vCenter.   FDM and VM monitoring does report to vCenter but works without vCenter being present.   The each failure detection mechanism is configurable to meet your availability needs.  The heart beats have the following possible settings:

  • Failure interval – How long the heartbeat has to fail to restart a VM
  • Minimum uptime – VM monitoring will wait this long before starting to monitor the VM tool heartbeats
  • Maximum per-VM resets – Max number of restarts within the maximum resets time window
  • Maximum resets time window – The window to evaulate the maximum per-VM resets

 

You can adjust the IO monitoring setting by using the advance setting das.iostatsinterval by default it is 120 seconds.  hostd implements the IO monitoring and may increase the failure detection scenario by an additional 30 seconds.   (This is because we need the stats from the last xx seconds and hostd rolls up stats every 30 seconds.   So understand that if you expect after 120 seconds of failure your machine to restart it’s possible it will be 150 seconds depending on hostd polling.

Many people are uncomfortable with VM restarts because they may never gather the source of the failure.   VM monitoring implements a screenshot feature that will store a screen shot of the failure in the virtual machine directory each time it is restarted.

 

Application Monitoring what is that?

Application monitoring takes advantage of third-party API’s via VMware tools to monitor the application.   These tools can be used to reboot the machine or the application.   VMware exposes these API’s to the vendors then allows the monitoring and control to be in the vendors hands.

 

Design Choices

From my perspective I cannot see a reason not to implement VM monitoring.  It can be disabled on specific virtual machines if they are known issues but it will greatly increase your uptime.

 

Let me know your experiences with these features.

Design Scenario: Gigabit networking with 10GB for storage SMB setup

Yesterday I got a comment on a older blog article asking for some help.

Caution

While it would be a bad idea personally and professionally for me to give specific advise without a design engagement I thought I might provide some thoughts about the scenario here.  This will allow me to justify some design choices I might make in the situation.   In no way should this be taken as law.  In reality everyone situation is different and little requirements can really change the design.   Please do not blindly create this infrastructure these are only guidelines.  It does not take into account specific vendor best practices (because I am too lazy to look them up).

 

Information provided:

We are a SMB that’s starting to cross over to the world of virtualization. I could really use your help on our network design. This is the current equipment we have:

 

2 (ESXi Hosts) Dell R630 with 512GB Ram, 2×4 1GB port NICS each (8 Total each host) and 2 x dual port 10GB NIC(4 Total) on each host

 

Equal Logic PS6210XS SAN with Dual 10GB Controllers

 

2 Dell N4032F 10GbE switch

 

We are planning to use the 10GbE for the SAN(isolated) and use the remaining 8 x 1GB port for Management/vMotion and our Server Network.

 

How would you go about designing the network for our environment?

 

Requirements

  • Must use the current hardware

 

Constraints

  • The 10GB network adapters are for isolated SAN only

 

Assumptions

  • Since this customer is a SMB i doubt they will buy Enterprise plus licenses so we will design around standard switches
  • The virtual machine / management network ports are distributed on two different upsteam switches
  • Your storage solution supports some type of multipathing with two switches

 

The question was related to networking so here we go:

Virtual machine and vSphere networking

It’s hard to make a determination here without understanding the number of virtual machines and network bandwidth needs.   It is really tempting to use two of the 10Gb nic’s (total of 4)  for the vSphere and virtual machine networking.  Due to the constraints it we will avoid that temptation.

Management Network

Management is easy.  vCenter and console access I assume.  If this is true I would assign two network adapters to Management.  One active the other standby.   You really want two in order to assure it’s up and for host isolation.

vMotion network

Our hosts are large (512GB of RAM) which would lead me to believe we are going to have  a lot of virtual machines on each host.   With only two hosts I am very concerned about taking down one host to patch and how long it will take to move virtual machines between host with one single 1GB network adapter.  You might want to consider multi-nic vMotion, which introduces complexity in the vSphere design and managability.    You should weigh how often you are going to schedule downtime on a host against the complexity.  My guess is that you will not patch all that often on a SMB.   So I would assign two network adapters to vMotion.  One should be active the other standby, You can use the same network adapter as management just use opposite adapters.  (Nic1 active for management nic2 standby for management,  nic1 standby for vMotion nic2 active for vmotion)

Virtual machine networks

At this point we have 6 adapters left for virtual machines.  Assign them all to virtual machines.   What really matters is the load balancing we use for these adapters.  Let’s be clear you cannot provide more than 1GB of total bandwidth to an individual virtual machine with this configuration without using port channel or LACP configurations.   I assume you don’t want to mess with port channel or virtual port channel across two switches.  So we need to look at the remaining options for balancing and using these nics:

Options (taken from here.) with IP hash removed due to lack of port channel, Route based on physical nic load removed due to lack of enterprise plus

  • Route based on the originating port ID: Choose an uplink based on the virtual port where the traffic entered the virtual switch.
  • Route based on a source MAC hash: Choose an uplink based on a hash of the source Ethernet.
  • Use explicit failover order: Always use the highest order uplink from the list of Active adapters which passes failover detection criteria.

There is a holy war between factions of VMware on which one to choose.  None will balance traffic perfectly.  Personally I would go with the default load balancing method of Route based on originating port ID.

How many VLANS

If possible please use a different VLAN for at least the following: Management, vMotion and virtual machines.  Multiple virtual machine vlans are wonderful.   It is critical from a security perspective that vMotion not be shared.

How many virtual switches

Now to the question of virtual switches.   Remember no enterprise plus so we are using standard switches.  These have to have the same configuration including case sensitivity on each host (good thing we only have 2 hosts).   You might want to consider configuring them via a script (I have a older blog post on that somewhere.)   You have two sets of network adapters vMotion/Management and virtual machine.   I would connect them all to the same virtual switch just for ease of management.   So your setup would look like this assuming your 1GB nics come into ESXi as nic0 – nic7

vSwitch0

Port Group or PG

PG-vMotion

Active nic1

Standby nic0

PG-Management  

Active nic0

Standby nic1

Port groups for virtual machines (one port group per VLAN)

Active nic2-nic7

Storage networking

This choice is determined by the vendor best practices.  It’s been a while on Equal Logic and you should use Dell’s documentation 100% before doing anything.  Let me say that again consult Dell’s documentation before doing this and make sure it aligns.   Any EQLogic master is welcome to add via comments.   I assume you will be using software iSCSI to do these connections.   You have 4 total 10GB nic’s with two switches.   I would create another virtual standard switch for these connections (does it have to be another switch?  no but I would for ease of management)  So it’s pretty cut and dry two dual port nics like this:

Card 1 Port 1  – we will call it nic8

Card 1 Port 2 – we will call it nic9

Card 2 Port 1 – we will call it nic10

Card 2 Port 2 – we will call it nic11

We have the following switches

SwitchA

SwitchB

I would do the following physical connections:

SwitchA -nic8,nic10

SwitchB – nic9,nic11

 

Normally software iscsi has you setup a port group per uplink all on the same vlan or native if your switches are only doing iSCSI. So I would create the following port groups

PG-iSCSI-Nic8-SwitchA

PG-iSCSI-Nic9-SwitchB

PG-iSCSI-Nic10-SwitchA

PG-iSCSI-Nic11-SwitchB

 

Assign the nics to be active only on their designated port groups (nic8 active on PG-iSCSI-Nic8-SwitchA and unused on all others)  Then setup iSCSI storage.   Your multipathing on the port groups should be setup as explicit failover.

 

Last Thoughts

With limited information it’s hard to comment on additional options.  I would carefully consider and implement percentage based admission control (think 50% or more reserved on each host).  If possible monitor your network bandwidth usage to make sure your virtual machine are getting the required traffic.   I hope this rant is useful to someone.  Leave me your thoughts or questions.

Warning to all readers using Snapshot CBT based backups

Over the last few days I have become aware of a pretty nasty bug with VMware Snapshot API based backups (Any Image based solutions that is not array based and use change block tracking I will not give names).  This bug has been around for a while but has recently been fixed.   The problem happens when you expand a currently presented drive by 128GB’s or larger.   This expansion causes a bug in the CBT that will make all CBT based backups junk.  You will not be able to restore them.   It’s a major pain in the butt.   What is worse you cannot detect this issue until your restore.  So here is how you create the bug:

  • Expand a currently presented drive 128GB’s or more
  • Do a CBT backup
  • Try to restore that backup or any following backup

You can work around this issue with the following process:

  • Expand a currently presented drive 128GB’s or more
  • Disable CBT
  • Re-enable CBT
  • Do a new full backup

This bug has been around since the 4.1 days and I have never run into it.  I believe this is because I have mostly worked in Linux heavy shops.  We always added a new drive and use logical volume management to expand the mount points thus avoiding this issue.

Please give me some good news

Well today I can this problem is fixed in 5.5 U4 so patch away.  It does not fix machines that are incorrectly backing up just avoids future occurrences.  You can read more about it here.

How to make your support contracts work for you!

Support contracts – also known as the get out of jail free card.   Allow me to illustrate the situation.  You have run into a production outage.  You have tried all the normal fixes and every googled the crap out of the problem without any resolution.    Now you are left with admitting you have no idea how to resolve the problem or calling the support team.    You have avoided calling support because you are convinced that you will be able to solve the problem in a few minutes.  You know that calling support means you will have to spend the next hour on the phone trying to explain the problem to someone who will ask you if it’s plugged in.    It’s painful… but sometimes it’s the only way to get resolution.   I have been known to be brutal to vendors.   I learned from the best.  It seems that these days the vendor needs to be scared of loosing your business before they will pull in the right resources.   Right?   — Wrong!

Help support help you

I recently started working with VMware BCS (Business Critical Support) support.  For those not familiar it’s an elevated support contract that allows you to get access to a support engineer.  In addition you have access to BCS engineers for first line support.  The quality of the engineers in BCS is a lot higher.  My time to resolution has really been reduced.  They really read the longs and can be required to provide detailed root cause analysis on problems.   In short I have been really impressed.   So does my story end with a plug for BCS or buy more VMware?  Nope.  What really impressed me was an email from my BCS assigned engineer Frederic Giroux.   Here is his email posted with his permission (trimmed some sections that may not apply to non-BCS customers):

Hi guys!

 

As most of you do not know, in a previous life, I was a paramedic (for 16 years). Very early in training, we are taught how to gather a story from patients, family members or simple bystanders. The accuracy of the story might very well change the outcome. Will the patient have permanent damages or not or even if he will survive the traumatic events he is facing are often based on the story provided to the medical staff.

 

In IT, the story of a support case is certainly not as critical. Nobody will die (or should not anyway except in rare cases), but disturbances can be quite important and how the story is gathered will also affect the outcome in the sense that it will take more or less time to resolution. A good story, clearly identifying the symptoms and putting them in perspective will allow the IT staff (you, your colleagues, VMware Support and other vendors) to better isolate the solution and work faster.

 

You may have already faced a situation, or seen one, where the medical staff asks questions to a patient, looking like they are not doing anything short of asking questions, and the patient or family members getting very impatient (or even panicking) and starting yelling at the staff. I know I have often faced this situation. Getting the story out is almost as important as the treatment itself. Asking about allergies, medication, time of last meal, description of symptoms, etc. is paramount.

 

Do not give up… I have a point to make ;-) In IT, it is similar. Again, the outcome may not be as critical as in the medical field, but when you are stuck in a bad situation, with the pressure coming from all sides, you may very well feel like it is close to life-threatening ;-) I know the feeling…

 

Now, the point to all of this… When I read the story in SRs, I realize that very few of you have a medical backgrounds ;-) The stories are often sketchy, poorly documented and we, TSEs, are trying to guess what the issue really is. So, I wanted to give you some tricks on how to write a good description and help us help you in resolving the case.

 

Start with a general description of the symptoms. Include the exact error message (if available) and, if possible, a screenshot.

 

Then, answer the following questions:

 

  1. What products and versions are involved? Include build numbers.
  2. What is affected? What VMs, hosts, clusters or systems? Please provide names.
  3. How severe are the symptoms? Down, partially down? Mission critical or not?
  4. Do you know what could have provoked the symptoms (changes recently made)?
  5. Did it start suddenly or gradually? Details.
  6. Provide dates and times. When did it start? When did you try the failed operation? Looking at logs, it helps tremendously to know where (or when) to look.
  7. What steps have you taken to correct the issue? Did it work (partially or in full)?
  8. Do you have a workaround? If so, what is it and is it sustainable and, if so, for how long?

These three are not always necessary:

 

  1. Provide the host server brand and model. Please include firmware versions.
  2. For storage issues, provide the storage array brand and model. Please include firmware versions.
  3. If applicable, do not hesitate to include the topology of the environment as an attached document.

If you do not know an answer, say it. That way, we will know you do not know and it will be clear.

 

Finally, add further description as you see fit. Do not hesitate to tell us about KB articles you already checked and the outcome.

 

Trick #1: Take the above questions and paste them into the case log with your answers. Do this for every case, even the ones looking more obvious as they may not be obvious to the TSE working the case.

 

Trick #2: When you create the case, you may open with it with the brief description and, after that is done, you can send a full email with all the details. It is easier to write in full screen and you can easily add screenshots into the email.

 

Trick #3: Do this for every SR you open, even non urgent ones. Practice makes perfect :-)

 

Some guidelines should be respected to avoid delays and confusion.

 

  • Make sure your text is clear, concise and to the point. Review it carefully and, if possible, have someone review it as well. This is time well spent as it may save hours, even days, because it is better understood by support (remember my example above on medical staff gathering a story while the patient is panicking).
  • Avoid political information. Remain technical and factual.
  • Upload all the necessary files immediately. Do not wait to be asked. You will be saving time.

— End of Quote

Wow! that’s a lot of information

Points to Fred this is the first time my support infrastructure has provided me education on how to make my experience better.   I don’t know how many tickets I have opened with a single one line statement and logs.   (Fred knows and it’s not a good number).   I love that my support engineer took the time to help refine the process.   As I tell my co-workers all the time just let me know the human protocol to get it done and I will follow it.   Fred is giving us the support protocol to get it done.    Also as you answer these questions you might find that you resolve the issue yourself.   As anyone who has done design might tell you it’s the process that makes provides good infrastructure not the idea.   I suggest that you consider following his process provided and see if it helps your support situation.

 

What do you think?

Have you had a good experience with support ?  What made it that way?

 

 

All Paths Down my new short term enemy

Edit: Thanks to comments on Twitter from Duncan Epping, , and  I have corrected some errors in the original article.  This is one of the things I love about the internet, I can make a mistake and others are kind enough to help me correct it.  

 

Most of my VMware career I have been blessed with very solid fiber channel arrays.  These arrays have rarely gone down and when they do a reboot of the whole environment normally solves the issues (really have only done this once and it was a software bug in the array).    In so many ways this single point of failure (the storage array) is still a major problem in our journey to the software defined datacenter.    Recently during functionality tests we ran into the dreaded All Paths Down (APD)  situation.   My experience with APD has prompted this post.   In order to understand APD you have to understand Permanent device loss (PDL).

 

What is PDL?

PDL is when your storage array is removing a lun or about to reboot it sends out SCSI codes to ESXi to let it know the lun or path is going away.   It is the same as my renter letting me know he is moving away.  I have some warning and I can prepare.  I also know he is really leaving and not coming back.   PDL has seen a number of improvements over the years.  At this point if your ESXi host gets a PDL and has a virtual machine on that storage it starts a HA event.   If any other ESXi host can mount that storage it will power on the virtual machine and return to operation.  If the storage is 100% lost due to PDL the virtual machine will appear as disconnected and be unavailable.   PDL is not desirable  any data not committed to storage will be lost.   Virtual machine may be very unhappy with this interruption and require manual recovery but at least they try to restart.   You can resolve PDL by rebooting or rescaning ESXi.  Once storage is present you can restart virtual machines.

 

Why is APD the ugly brother to PDL?

APD is very different than PDL.  There are not SCSI code storage just goes 100% away.  It is my renter move out in the middle of the night without any warning.    I have no idea if they are coming back or what the situation could be.   I want to be very clear All paths down as the name suggest means all paths to a storage lun are down at the same time.   No warning, no notice just not available.    This is a bad situation.  ESXi does not know if the lun is going to return.  Much like my rental apartment I don’t want to paint and re-carpet until I am sure they are gone.    This delayed response can cause me to loose money but I want to be on the safe side.   VMware has taken the same approach.   During an APD situation VMware does nothing.  Virtual machines continue to run in memory.   Each operating system act’s differently.   I believe Windows continues to run with memory cache using FIFO (which means data will be lost because it cannot be written to disk).  Once storage has returned Windows will write to disk like nothing was lost.   Linux once finding it’s storage to be unwrittable goes read only (this can be resolved once storage is back with a OS remount or reboot).      This problem is complicated by the fact that ESXi will constantly try to write to these devices.  This creates load on the ESXi (because it’s scanning for storage that is not present) and can cause hostd to crash making a ESXi host disconnect from vCenter.   In 5.1 they added an advanced parameter (Misc.APDTimeout default 140 seconds)  which will cause the rescans to stop after 140 seconds.   From that point forward they wait for the storage to identify its presence.   As you can imagine APD is bad.  You can read more about APD and PDL in a number of VMware KB articles but this is a really good link.

 

Wait how do I even get a all paths down?

Well… here is the fun part.   That depends on your array.  Good chance if your reading this article you have network based storage or you are running metro cluster.   Most other customers will not see this issue unless they run into a bug or really bad change management.    If you have fiber channel arrays you must have either all your HBA’s or both your Fiber switches fail at the same time to create a APD.   If you have network storage it can be caused by broadcast storms, multiple switch failures etc, but it can only happen if you have your traditional networking and storage networking separate.    If you have them together on the same switch then you would have a host isolation event and HA would work.

You said something about Metro right?

Correct.  vMSC (vSphere Metro cluster – or stretch cluster) is one situation where you will see APD potentially.   In vMSC you have two sites and a single cluster stretched between them.  Your storage is synchronously replicated between sites.  If you loose storage only in a single site then you could have APD and be in a world of hurt.  You have created a solution that assures downtime prevention by having two sites and the ability to vMotion between them but now you have virtual machines running and potentially loosing data.   Very bad things.

My hyper-xxx solution avoids this issue 100%

It is true that some hyperconverged solutions have avoidance when it comes to APD.  Some do this by making the storage local to the workload.   Other do distribution to avoid the issue.   Most vendors share the network for both storage and networking making a APD impossible.   A failure would mean the host is isolated and your host isolation response would solve the issue.

Why does VMware allow this state to continue?

Well the first and best answer is it’s a rare condition.  I will throw out a C3PO prediction and call it 1:10,000.   It’s pretty rare assuming the following is true:  You have redundant fabrics and you have good documented change processes.   The best way to avoid APD is to architect it away.   Redundant dedicated paths to storage are your friend. To be 100% fair to VMware they have done a number of enhancements over the years to reduce the impact of APD issues (for example the change in 5.1 Misc.APDTimeout)

 

What about Metro?

Again it’s rare.  If you are building metro spend the money on the architecture.  In this case you will want to reboot your hosts on the failed side and allow them to HA to the other side.

 

What is the good news?

Well I do have some good news.  Once again VMware has provided a solution.  In vSphere 6.0 you will have a feature called component protection (read more here) which allows you to choose what to do in a PDL and APD situation.  It included timers and actions (like shutdown the VM and HA it to another host if possible.    Solid future solution to a rare event from VMware.