Edit: Thanks to comments on Twitter from Duncan Epping, MAN1$H, and Martin Geda Banda I have corrected some errors in the original article. This is one of the things I love about the internet, I can make a mistake and others are kind enough to help me correct it.
Most of my VMware career I have been blessed with very solid fiber channel arrays. These arrays have rarely gone down and when they do a reboot of the whole environment normally solves the issues (really have only done this once and it was a software bug in the array). In so many ways this single point of failure (the storage array) is still a major problem in our journey to the software defined datacenter. Recently during functionality tests we ran into the dreaded All Paths Down (APD) situation. My experience with APD has prompted this post. In order to understand APD you have to understand Permanent device loss (PDL).
What is PDL?
PDL is when your storage array is removing a lun or about to reboot it sends out SCSI codes to ESXi to let it know the lun or path is going away. It is the same as my renter letting me know he is moving away. I have some warning and I can prepare. I also know he is really leaving and not coming back. PDL has seen a number of improvements over the years. At this point if your ESXi host gets a PDL and has a virtual machine on that storage it starts a HA event. If any other ESXi host can mount that storage it will power on the virtual machine and return to operation. If the storage is 100% lost due to PDL the virtual machine will appear as disconnected and be unavailable. PDL is not desirable any data not committed to storage will be lost. Virtual machine may be very unhappy with this interruption and require manual recovery but at least they try to restart. You can resolve PDL by rebooting or rescaning ESXi. Once storage is present you can restart virtual machines.
Why is APD the ugly brother to PDL?
APD is very different than PDL. There are not SCSI code storage just goes 100% away. It is my renter move out in the middle of the night without any warning. I have no idea if they are coming back or what the situation could be. I want to be very clear All paths down as the name suggest means all paths to a storage lun are down at the same time. No warning, no notice just not available. This is a bad situation. ESXi does not know if the lun is going to return. Much like my rental apartment I don’t want to paint and re-carpet until I am sure they are gone. This delayed response can cause me to loose money but I want to be on the safe side. VMware has taken the same approach. During an APD situation VMware does nothing. Virtual machines continue to run in memory. Each operating system act’s differently. I believe Windows continues to run with memory cache using FIFO (which means data will be lost because it cannot be written to disk). Once storage has returned Windows will write to disk like nothing was lost. Linux once finding it’s storage to be unwrittable goes read only (this can be resolved once storage is back with a OS remount or reboot). This problem is complicated by the fact that ESXi will constantly try to write to these devices. This creates load on the ESXi (because it’s scanning for storage that is not present) and can cause hostd to crash making a ESXi host disconnect from vCenter. In 5.1 they added an advanced parameter (Misc.APDTimeout default 140 seconds) which will cause the rescans to stop after 140 seconds. From that point forward they wait for the storage to identify its presence. As you can imagine APD is bad. You can read more about APD and PDL in a number of VMware KB articles but this is a really good link.
Wait how do I even get a all paths down?
Well… here is the fun part. That depends on your array. Good chance if your reading this article you have network based storage or you are running metro cluster. Most other customers will not see this issue unless they run into a bug or really bad change management. If you have fiber channel arrays you must have either all your HBA’s or both your Fiber switches fail at the same time to create a APD. If you have network storage it can be caused by broadcast storms, multiple switch failures etc, but it can only happen if you have your traditional networking and storage networking separate. If you have them together on the same switch then you would have a host isolation event and HA would work.
You said something about Metro right?
Correct. vMSC (vSphere Metro cluster – or stretch cluster) is one situation where you will see APD potentially. In vMSC you have two sites and a single cluster stretched between them. Your storage is synchronously replicated between sites. If you loose storage only in a single site then you could have APD and be in a world of hurt. You have created a solution that assures downtime prevention by having two sites and the ability to vMotion between them but now you have virtual machines running and potentially loosing data. Very bad things.
My hyper-xxx solution avoids this issue 100%
It is true that some hyperconverged solutions have avoidance when it comes to APD. Some do this by making the storage local to the workload. Other do distribution to avoid the issue. Most vendors share the network for both storage and networking making a APD impossible. A failure would mean the host is isolated and your host isolation response would solve the issue.
Why does VMware allow this state to continue?
Well the first and best answer is it’s a rare condition. I will throw out a C3PO prediction and call it 1:10,000. It’s pretty rare assuming the following is true: You have redundant fabrics and you have good documented change processes. The best way to avoid APD is to architect it away. Redundant dedicated paths to storage are your friend. To be 100% fair to VMware they have done a number of enhancements over the years to reduce the impact of APD issues (for example the change in 5.1 Misc.APDTimeout)
What about Metro?
Again it’s rare. If you are building metro spend the money on the architecture. In this case you will want to reboot your hosts on the failed side and allow them to HA to the other side.
What is the good news?
Well I do have some good news. Once again VMware has provided a solution. In vSphere 6.0 you will have a feature called component protection (read more here) which allows you to choose what to do in a PDL and APD situation. It included timers and actions (like shutdown the VM and HA it to another host if possible. Solid future solution to a rare event from VMware.