How does VM and Application Monitoring work?

Almost everyone uses VMware HA to protect against host failures.  At the same time almost no one uses VM and Application monitoring.  I believe the lack of usage is due to two factors:

  • Lack of understanding of how it works leading to fears about it causing outages
  • It is disabled by default in vSphere

 

VM Monitoring

How does it work?

In the most basic form it works just like host HA, when a virtual machine fails it restarts the operating system.   Sounds simple and perfect right?   Here is where the fear begins.   No one likes the idea of possible false reboots so it’s important to understand how it detects a failure.

  1. VMware Tools heartbeats – This is configurable built-in internal monitoring between the VMware tools and host running the virtual machine.  (How do they communicate?  Via a build in virtual chip in the virtual hardware)
  2. Monitoring the I/O on the virtual machine –  The VM monitoring checks for any I/O in the last 120 seconds – both network and storage

If both these conditions result in failure then your virtual machine is restarted via guest reboot.  It possible if you have an operating system issue that your virtual machine will be in a constant reboot loop.   By default a virtual machine will only be rebooted three times in an hour.

The implementation of VM monitoring is done via the HA agent FDM on each host.   Since it is tied to FDM it will operate without vCenter.   FDM and VM monitoring does report to vCenter but works without vCenter being present.   The each failure detection mechanism is configurable to meet your availability needs.  The heart beats have the following possible settings:

  • Failure interval – How long the heartbeat has to fail to restart a VM
  • Minimum uptime – VM monitoring will wait this long before starting to monitor the VM tool heartbeats
  • Maximum per-VM resets – Max number of restarts within the maximum resets time window
  • Maximum resets time window – The window to evaulate the maximum per-VM resets

 

You can adjust the IO monitoring setting by using the advance setting das.iostatsinterval by default it is 120 seconds.  hostd implements the IO monitoring and may increase the failure detection scenario by an additional 30 seconds.   (This is because we need the stats from the last xx seconds and hostd rolls up stats every 30 seconds.   So understand that if you expect after 120 seconds of failure your machine to restart it’s possible it will be 150 seconds depending on hostd polling.

Many people are uncomfortable with VM restarts because they may never gather the source of the failure.   VM monitoring implements a screenshot feature that will store a screen shot of the failure in the virtual machine directory each time it is restarted.

 

Application Monitoring what is that?

Application monitoring takes advantage of third-party API’s via VMware tools to monitor the application.   These tools can be used to reboot the machine or the application.   VMware exposes these API’s to the vendors then allows the monitoring and control to be in the vendors hands.

 

Design Choices

From my perspective I cannot see a reason not to implement VM monitoring.  It can be disabled on specific virtual machines if they are known issues but it will greatly increase your uptime.

 

Let me know your experiences with these features.

Design Scenario: Gigabit networking with 10GB for storage SMB setup

Yesterday I got a comment on a older blog article asking for some help.

Caution

While it would be a bad idea personally and professionally for me to give specific advise without a design engagement I thought I might provide some thoughts about the scenario here.  This will allow me to justify some design choices I might make in the situation.   In no way should this be taken as law.  In reality everyone situation is different and little requirements can really change the design.   Please do not blindly create this infrastructure these are only guidelines.  It does not take into account specific vendor best practices (because I am too lazy to look them up).

 

Information provided:

We are a SMB that’s starting to cross over to the world of virtualization. I could really use your help on our network design. This is the current equipment we have:

 

2 (ESXi Hosts) Dell R630 with 512GB Ram, 2×4 1GB port NICS each (8 Total each host) and 2 x dual port 10GB NIC(4 Total) on each host

 

Equal Logic PS6210XS SAN with Dual 10GB Controllers

 

2 Dell N4032F 10GbE switch

 

We are planning to use the 10GbE for the SAN(isolated) and use the remaining 8 x 1GB port for Management/vMotion and our Server Network.

 

How would you go about designing the network for our environment?

 

Requirements

  • Must use the current hardware

 

Constraints

  • The 10GB network adapters are for isolated SAN only

 

Assumptions

  • Since this customer is a SMB i doubt they will buy Enterprise plus licenses so we will design around standard switches
  • The virtual machine / management network ports are distributed on two different upsteam switches
  • Your storage solution supports some type of multipathing with two switches

 

The question was related to networking so here we go:

Virtual machine and vSphere networking

It’s hard to make a determination here without understanding the number of virtual machines and network bandwidth needs.   It is really tempting to use two of the 10Gb nic’s (total of 4)  for the vSphere and virtual machine networking.  Due to the constraints it we will avoid that temptation.

Management Network

Management is easy.  vCenter and console access I assume.  If this is true I would assign two network adapters to Management.  One active the other standby.   You really want two in order to assure it’s up and for host isolation.

vMotion network

Our hosts are large (512GB of RAM) which would lead me to believe we are going to have  a lot of virtual machines on each host.   With only two hosts I am very concerned about taking down one host to patch and how long it will take to move virtual machines between host with one single 1GB network adapter.  You might want to consider multi-nic vMotion, which introduces complexity in the vSphere design and managability.    You should weigh how often you are going to schedule downtime on a host against the complexity.  My guess is that you will not patch all that often on a SMB.   So I would assign two network adapters to vMotion.  One should be active the other standby, You can use the same network adapter as management just use opposite adapters.  (Nic1 active for management nic2 standby for management,  nic1 standby for vMotion nic2 active for vmotion)

Virtual machine networks

At this point we have 6 adapters left for virtual machines.  Assign them all to virtual machines.   What really matters is the load balancing we use for these adapters.  Let’s be clear you cannot provide more than 1GB of total bandwidth to an individual virtual machine with this configuration without using port channel or LACP configurations.   I assume you don’t want to mess with port channel or virtual port channel across two switches.  So we need to look at the remaining options for balancing and using these nics:

Options (taken from here.) with IP hash removed due to lack of port channel, Route based on physical nic load removed due to lack of enterprise plus

  • Route based on the originating port ID: Choose an uplink based on the virtual port where the traffic entered the virtual switch.
  • Route based on a source MAC hash: Choose an uplink based on a hash of the source Ethernet.
  • Use explicit failover order: Always use the highest order uplink from the list of Active adapters which passes failover detection criteria.

There is a holy war between factions of VMware on which one to choose.  None will balance traffic perfectly.  Personally I would go with the default load balancing method of Route based on originating port ID.

How many VLANS

If possible please use a different VLAN for at least the following: Management, vMotion and virtual machines.  Multiple virtual machine vlans are wonderful.   It is critical from a security perspective that vMotion not be shared.

How many virtual switches

Now to the question of virtual switches.   Remember no enterprise plus so we are using standard switches.  These have to have the same configuration including case sensitivity on each host (good thing we only have 2 hosts).   You might want to consider configuring them via a script (I have a older blog post on that somewhere.)   You have two sets of network adapters vMotion/Management and virtual machine.   I would connect them all to the same virtual switch just for ease of management.   So your setup would look like this assuming your 1GB nics come into ESXi as nic0 – nic7

vSwitch0

Port Group or PG

PG-vMotion

Active nic1

Standby nic0

PG-Management  

Active nic0

Standby nic1

Port groups for virtual machines (one port group per VLAN)

Active nic2-nic7

Storage networking

This choice is determined by the vendor best practices.  It’s been a while on Equal Logic and you should use Dell’s documentation 100% before doing anything.  Let me say that again consult Dell’s documentation before doing this and make sure it aligns.   Any EQLogic master is welcome to add via comments.   I assume you will be using software iSCSI to do these connections.   You have 4 total 10GB nic’s with two switches.   I would create another virtual standard switch for these connections (does it have to be another switch?  no but I would for ease of management)  So it’s pretty cut and dry two dual port nics like this:

Card 1 Port 1  – we will call it nic8

Card 1 Port 2 – we will call it nic9

Card 2 Port 1 – we will call it nic10

Card 2 Port 2 – we will call it nic11

We have the following switches

SwitchA

SwitchB

I would do the following physical connections:

SwitchA -nic8,nic10

SwitchB – nic9,nic11

 

Normally software iscsi has you setup a port group per uplink all on the same vlan or native if your switches are only doing iSCSI. So I would create the following port groups

PG-iSCSI-Nic8-SwitchA

PG-iSCSI-Nic9-SwitchB

PG-iSCSI-Nic10-SwitchA

PG-iSCSI-Nic11-SwitchB

 

Assign the nics to be active only on their designated port groups (nic8 active on PG-iSCSI-Nic8-SwitchA and unused on all others)  Then setup iSCSI storage.   Your multipathing on the port groups should be setup as explicit failover.

 

Last Thoughts

With limited information it’s hard to comment on additional options.  I would carefully consider and implement percentage based admission control (think 50% or more reserved on each host).  If possible monitor your network bandwidth usage to make sure your virtual machine are getting the required traffic.   I hope this rant is useful to someone.  Leave me your thoughts or questions.

Warning to all readers using Snapshot CBT based backups

Over the last few days I have become aware of a pretty nasty bug with VMware Snapshot API based backups (Any Image based solutions that is not array based and use change block tracking I will not give names).  This bug has been around for a while but has recently been fixed.   The problem happens when you expand a currently presented drive by 128GB’s or larger.   This expansion causes a bug in the CBT that will make all CBT based backups junk.  You will not be able to restore them.   It’s a major pain in the butt.   What is worse you cannot detect this issue until your restore.  So here is how you create the bug:

  • Expand a currently presented drive 128GB’s or more
  • Do a CBT backup
  • Try to restore that backup or any following backup

You can work around this issue with the following process:

  • Expand a currently presented drive 128GB’s or more
  • Disable CBT
  • Re-enable CBT
  • Do a new full backup

This bug has been around since the 4.1 days and I have never run into it.  I believe this is because I have mostly worked in Linux heavy shops.  We always added a new drive and use logical volume management to expand the mount points thus avoiding this issue.

Please give me some good news

Well today I can this problem is fixed in 5.5 U4 so patch away.  It does not fix machines that are incorrectly backing up just avoids future occurrences.  You can read more about it here.

How to make your support contracts work for you!

Support contracts – also known as the get out of jail free card.   Allow me to illustrate the situation.  You have run into a production outage.  You have tried all the normal fixes and every googled the crap out of the problem without any resolution.    Now you are left with admitting you have no idea how to resolve the problem or calling the support team.    You have avoided calling support because you are convinced that you will be able to solve the problem in a few minutes.  You know that calling support means you will have to spend the next hour on the phone trying to explain the problem to someone who will ask you if it’s plugged in.    It’s painful… but sometimes it’s the only way to get resolution.   I have been known to be brutal to vendors.   I learned from the best.  It seems that these days the vendor needs to be scared of loosing your business before they will pull in the right resources.   Right?   — Wrong!

Help support help you

I recently started working with VMware BCS (Business Critical Support) support.  For those not familiar it’s an elevated support contract that allows you to get access to a support engineer.  In addition you have access to BCS engineers for first line support.  The quality of the engineers in BCS is a lot higher.  My time to resolution has really been reduced.  They really read the longs and can be required to provide detailed root cause analysis on problems.   In short I have been really impressed.   So does my story end with a plug for BCS or buy more VMware?  Nope.  What really impressed me was an email from my BCS assigned engineer Frederic Giroux.   Here is his email posted with his permission (trimmed some sections that may not apply to non-BCS customers):

Hi guys!

 

As most of you do not know, in a previous life, I was a paramedic (for 16 years). Very early in training, we are taught how to gather a story from patients, family members or simple bystanders. The accuracy of the story might very well change the outcome. Will the patient have permanent damages or not or even if he will survive the traumatic events he is facing are often based on the story provided to the medical staff.

 

In IT, the story of a support case is certainly not as critical. Nobody will die (or should not anyway except in rare cases), but disturbances can be quite important and how the story is gathered will also affect the outcome in the sense that it will take more or less time to resolution. A good story, clearly identifying the symptoms and putting them in perspective will allow the IT staff (you, your colleagues, VMware Support and other vendors) to better isolate the solution and work faster.

 

You may have already faced a situation, or seen one, where the medical staff asks questions to a patient, looking like they are not doing anything short of asking questions, and the patient or family members getting very impatient (or even panicking) and starting yelling at the staff. I know I have often faced this situation. Getting the story out is almost as important as the treatment itself. Asking about allergies, medication, time of last meal, description of symptoms, etc. is paramount.

 

Do not give up… I have a point to make 😉 In IT, it is similar. Again, the outcome may not be as critical as in the medical field, but when you are stuck in a bad situation, with the pressure coming from all sides, you may very well feel like it is close to life-threatening 😉 I know the feeling…

 

Now, the point to all of this… When I read the story in SRs, I realize that very few of you have a medical backgrounds 😉 The stories are often sketchy, poorly documented and we, TSEs, are trying to guess what the issue really is. So, I wanted to give you some tricks on how to write a good description and help us help you in resolving the case.

 

Start with a general description of the symptoms. Include the exact error message (if available) and, if possible, a screenshot.

 

Then, answer the following questions:

 

  1. What products and versions are involved? Include build numbers.
  2. What is affected? What VMs, hosts, clusters or systems? Please provide names.
  3. How severe are the symptoms? Down, partially down? Mission critical or not?
  4. Do you know what could have provoked the symptoms (changes recently made)?
  5. Did it start suddenly or gradually? Details.
  6. Provide dates and times. When did it start? When did you try the failed operation? Looking at logs, it helps tremendously to know where (or when) to look.
  7. What steps have you taken to correct the issue? Did it work (partially or in full)?
  8. Do you have a workaround? If so, what is it and is it sustainable and, if so, for how long?

These three are not always necessary:

 

  1. Provide the host server brand and model. Please include firmware versions.
  2. For storage issues, provide the storage array brand and model. Please include firmware versions.
  3. If applicable, do not hesitate to include the topology of the environment as an attached document.

If you do not know an answer, say it. That way, we will know you do not know and it will be clear.

 

Finally, add further description as you see fit. Do not hesitate to tell us about KB articles you already checked and the outcome.

 

Trick #1: Take the above questions and paste them into the case log with your answers. Do this for every case, even the ones looking more obvious as they may not be obvious to the TSE working the case.

 

Trick #2: When you create the case, you may open with it with the brief description and, after that is done, you can send a full email with all the details. It is easier to write in full screen and you can easily add screenshots into the email.

 

Trick #3: Do this for every SR you open, even non urgent ones. Practice makes perfect 🙂

 

Some guidelines should be respected to avoid delays and confusion.

 

  • Make sure your text is clear, concise and to the point. Review it carefully and, if possible, have someone review it as well. This is time well spent as it may save hours, even days, because it is better understood by support (remember my example above on medical staff gathering a story while the patient is panicking).
  • Avoid political information. Remain technical and factual.
  • Upload all the necessary files immediately. Do not wait to be asked. You will be saving time.

— End of Quote

Wow! that’s a lot of information

Points to Fred this is the first time my support infrastructure has provided me education on how to make my experience better.   I don’t know how many tickets I have opened with a single one line statement and logs.   (Fred knows and it’s not a good number).   I love that my support engineer took the time to help refine the process.   As I tell my co-workers all the time just let me know the human protocol to get it done and I will follow it.   Fred is giving us the support protocol to get it done.    Also as you answer these questions you might find that you resolve the issue yourself.   As anyone who has done design might tell you it’s the process that makes provides good infrastructure not the idea.   I suggest that you consider following his process provided and see if it helps your support situation.

 

What do you think?

Have you had a good experience with support ?  What made it that way?

 

 

All Paths Down my new short term enemy

Edit: Thanks to comments on Twitter from Duncan Epping, , and  I have corrected some errors in the original article.  This is one of the things I love about the internet, I can make a mistake and others are kind enough to help me correct it.  

 

Most of my VMware career I have been blessed with very solid fiber channel arrays.  These arrays have rarely gone down and when they do a reboot of the whole environment normally solves the issues (really have only done this once and it was a software bug in the array).    In so many ways this single point of failure (the storage array) is still a major problem in our journey to the software defined datacenter.    Recently during functionality tests we ran into the dreaded All Paths Down (APD)  situation.   My experience with APD has prompted this post.   In order to understand APD you have to understand Permanent device loss (PDL).

 

What is PDL?

PDL is when your storage array is removing a lun or about to reboot it sends out SCSI codes to ESXi to let it know the lun or path is going away.   It is the same as my renter letting me know he is moving away.  I have some warning and I can prepare.  I also know he is really leaving and not coming back.   PDL has seen a number of improvements over the years.  At this point if your ESXi host gets a PDL and has a virtual machine on that storage it starts a HA event.   If any other ESXi host can mount that storage it will power on the virtual machine and return to operation.  If the storage is 100% lost due to PDL the virtual machine will appear as disconnected and be unavailable.   PDL is not desirable  any data not committed to storage will be lost.   Virtual machine may be very unhappy with this interruption and require manual recovery but at least they try to restart.   You can resolve PDL by rebooting or rescaning ESXi.  Once storage is present you can restart virtual machines.

 

Why is APD the ugly brother to PDL?

APD is very different than PDL.  There are not SCSI code storage just goes 100% away.  It is my renter move out in the middle of the night without any warning.    I have no idea if they are coming back or what the situation could be.   I want to be very clear All paths down as the name suggest means all paths to a storage lun are down at the same time.   No warning, no notice just not available.    This is a bad situation.  ESXi does not know if the lun is going to return.  Much like my rental apartment I don’t want to paint and re-carpet until I am sure they are gone.    This delayed response can cause me to loose money but I want to be on the safe side.   VMware has taken the same approach.   During an APD situation VMware does nothing.  Virtual machines continue to run in memory.   Each operating system act’s differently.   I believe Windows continues to run with memory cache using FIFO (which means data will be lost because it cannot be written to disk).  Once storage has returned Windows will write to disk like nothing was lost.   Linux once finding it’s storage to be unwrittable goes read only (this can be resolved once storage is back with a OS remount or reboot).      This problem is complicated by the fact that ESXi will constantly try to write to these devices.  This creates load on the ESXi (because it’s scanning for storage that is not present) and can cause hostd to crash making a ESXi host disconnect from vCenter.   In 5.1 they added an advanced parameter (Misc.APDTimeout default 140 seconds)  which will cause the rescans to stop after 140 seconds.   From that point forward they wait for the storage to identify its presence.   As you can imagine APD is bad.  You can read more about APD and PDL in a number of VMware KB articles but this is a really good link.

 

Wait how do I even get a all paths down?

Well… here is the fun part.   That depends on your array.  Good chance if your reading this article you have network based storage or you are running metro cluster.   Most other customers will not see this issue unless they run into a bug or really bad change management.    If you have fiber channel arrays you must have either all your HBA’s or both your Fiber switches fail at the same time to create a APD.   If you have network storage it can be caused by broadcast storms, multiple switch failures etc, but it can only happen if you have your traditional networking and storage networking separate.    If you have them together on the same switch then you would have a host isolation event and HA would work.

You said something about Metro right?

Correct.  vMSC (vSphere Metro cluster – or stretch cluster) is one situation where you will see APD potentially.   In vMSC you have two sites and a single cluster stretched between them.  Your storage is synchronously replicated between sites.  If you loose storage only in a single site then you could have APD and be in a world of hurt.  You have created a solution that assures downtime prevention by having two sites and the ability to vMotion between them but now you have virtual machines running and potentially loosing data.   Very bad things.

My hyper-xxx solution avoids this issue 100%

It is true that some hyperconverged solutions have avoidance when it comes to APD.  Some do this by making the storage local to the workload.   Other do distribution to avoid the issue.   Most vendors share the network for both storage and networking making a APD impossible.   A failure would mean the host is isolated and your host isolation response would solve the issue.

Why does VMware allow this state to continue?

Well the first and best answer is it’s a rare condition.  I will throw out a C3PO prediction and call it 1:10,000.   It’s pretty rare assuming the following is true:  You have redundant fabrics and you have good documented change processes.   The best way to avoid APD is to architect it away.   Redundant dedicated paths to storage are your friend. To be 100% fair to VMware they have done a number of enhancements over the years to reduce the impact of APD issues (for example the change in 5.1 Misc.APDTimeout)

 

What about Metro?

Again it’s rare.  If you are building metro spend the money on the architecture.  In this case you will want to reboot your hosts on the failed side and allow them to HA to the other side.

 

What is the good news?

Well I do have some good news.  Once again VMware has provided a solution.  In vSphere 6.0 you will have a feature called component protection (read more here) which allows you to choose what to do in a PDL and APD situation.  It included timers and actions (like shutdown the VM and HA it to another host if possible.    Solid future solution to a rare event from VMware.

 

vSphere 6.0 What excites me

Yesterday VMware announced the general release of vSphere 6.0.   It is not yet available for download or install but it’s exciting.  There are lots of really awesome new features and improvements.  I want to focus on what I consider to be the two most exciting features:

  • vMotion improvements
  • VVol’s

 

vMotion improvements:

  • Long Distance vMotion – You can now live vMotion across distances of up to 100ms RTT (up from 10ms RTT)  – Removing the distance factors
  • Cross vSwitch and vCenter vMotion – Move virtual machines live between different virtual switches and different vCenters removing almost all boundaries on vMotion
  • vMotion of MSCS VMs using pRDMs – I could not find much on this I really want to know more here.
  • vMotion L2 adjacency restrictions removed – vMotion no longer requires layer 2 stretch you can now route vMotion – something I have been trying to get a RPQ on for a while. (When combined with NSX you now no longer require any spanning tree protocols, you could route everything)

 

What does all this mean.  Better design removing past layer 2 requirements and the true ability to migrate active work loads into the cloud.  Make no mistake all these changes are about flexible movement into the cloud.   Combine with NSX and your workload can move anywhere.   These changes have a big impact on vSphere metro clusters.  I would love to see HA domains include more than one site using multiple vCenters and site replication for fail over of load.  (I expect it to come next version just my personal thoughts)

 

VVol’s

This is the holy grail of software defined storage.  Giving the storage system awareness to the individual object or virtual machine.  This enabled granular level performance.   It can enable flash clone/copy style backup removing the current kludgy process.   I have been super excited about VVol’s for a while now.  I saw a demo at VMworld 2013 done by HDS that was awesome.  This object level storage solution enables awesome features like replication, deduplication, compression, storage tiering at the hard drive level, real performance metrics on the complete storage system and stack.   This is really going to blow up the hyper-converged storage vendors solutions.   Mark my words the first vendor to adopt and correctly market VVol’s will be huge.

 

Here is the problem:  storage vendors have to enable and support the features.  Expect some of the start up storage vendors to be supporting it right away.  While larger vendors may take a while to solve the issue.

 

There are a ton of improvements to the new version including a solution to the lack of HA for vCenter (multiple processor FT).

 

Future

It is clear that VMware feels that the future is in hybrid cloud and mobility.  I am sure they have lots of smart research to prove it.   Compliance and configuration management continue to be my largest problem with public cloud.   I think solutions like NSX start to resolve the public cloud problems.    I look forward to the future with my friend VMware.  If anyone from VMware ever reads this article please consider HA across vCenters with replication as an option, it would be just perfect (maybe combined with VMware VSAN to up sales).