Almost everyone uses VMware HA to protect against host failures. At the same time almost no one uses VM and Application monitoring. I believe the lack of usage is due to two factors:
- Lack of understanding of how it works leading to fears about it causing outages
- It is disabled by default in vSphere
How does it work?
In the most basic form it works just like host HA, when a virtual machine fails it restarts the operating system. Sounds simple and perfect right? Here is where the fear begins. No one likes the idea of possible false reboots so it’s important to understand how it detects a failure.
- VMware Tools heartbeats – This is configurable built-in internal monitoring between the VMware tools and host running the virtual machine. (How do they communicate? Via a build in virtual chip in the virtual hardware)
- Monitoring the I/O on the virtual machine – The VM monitoring checks for any I/O in the last 120 seconds – both network and storage
If both these conditions result in failure then your virtual machine is restarted via guest reboot. It possible if you have an operating system issue that your virtual machine will be in a constant reboot loop. By default a virtual machine will only be rebooted three times in an hour.
The implementation of VM monitoring is done via the HA agent FDM on each host. Since it is tied to FDM it will operate without vCenter. FDM and VM monitoring does report to vCenter but works without vCenter being present. The each failure detection mechanism is configurable to meet your availability needs. The heart beats have the following possible settings:
- Failure interval – How long the heartbeat has to fail to restart a VM
- Minimum uptime – VM monitoring will wait this long before starting to monitor the VM tool heartbeats
- Maximum per-VM resets – Max number of restarts within the maximum resets time window
- Maximum resets time window – The window to evaulate the maximum per-VM resets
You can adjust the IO monitoring setting by using the advance setting das.iostatsinterval by default it is 120 seconds. hostd implements the IO monitoring and may increase the failure detection scenario by an additional 30 seconds. (This is because we need the stats from the last xx seconds and hostd rolls up stats every 30 seconds. So understand that if you expect after 120 seconds of failure your machine to restart it’s possible it will be 150 seconds depending on hostd polling.
Many people are uncomfortable with VM restarts because they may never gather the source of the failure. VM monitoring implements a screenshot feature that will store a screen shot of the failure in the virtual machine directory each time it is restarted.
Application Monitoring what is that?
Application monitoring takes advantage of third-party API’s via VMware tools to monitor the application. These tools can be used to reboot the machine or the application. VMware exposes these API’s to the vendors then allows the monitoring and control to be in the vendors hands.
From my perspective I cannot see a reason not to implement VM monitoring. It can be disabled on specific virtual machines if they are known issues but it will greatly increase your uptime.
Let me know your experiences with these features.