VMware Network Failover Policies

I have been reading up on these policies, I find it odd that almost everything I read talks about the Load Balancing policies but not much about the failover policies  so here we go.

Network Failover Detection

This is the method used to detect when a failure occurs on a physical network.   There are two options:

  • Link Status only: This uses the next hop port to test for outage.  The problem is this does not detect misconfigurations or upstream outages on switches.  This is essentially a port down condition.
  • Beacon Probing: This is a process of sending out a packet on all nic’s in a team.  These packets are destined for the other nic’s in the team.  This way a nic can detect upstream issues if it cannot talk to it’s team members.   It also uses link status.  This method has a problem if you don’t have at least three members which can create isolation of all nic’s.  It’s the standard quorum issue in clusters.  Assume you have two nic’s and one of them is misconfigured in a upstream switch.  They both beacon probe and when they are unable to connect to each other they both determine they are failed and all nic’s go down.

Notify Switches

This setting determines when a failure occurs if the virtual switch should notify the physical switches using a gratuitous ARP.  In almost all situations you want the default of yes but you can choose no and wait 5 minutes for the ARP/CAM tables to expire before you vm’s work.  (This is not an issue with HA because OS’s when booting send out a Gratuitous ARP)

Failback

This setting determines how a nic is returned to use after a failure.

  • Yes – Return the adapter to active use right away when available
  • No – Adapter is set to inactive until an administrator changes it manually

This requires some discussion.  Assume that you have a failing adapter due to hardware.  When it fails everything goes to your second adapter.  You don’t know if it’s a hardware or software issue.  When you reboot if you are set to yes then it’s back servicing your loads until it fails again, if set to no it stays away until you allow it back into the mix.   An even worst scenario is a nic that is flapping up and down or a switch going up and down.  Both of the scenario’s create interruptions you may not want.   Then again changing it to no means you have all your traffic on a single link perhaps causing contention.

VMware Virtual SAN (vSAN) Future or just for small business

I mentioned recently that I have been playing with VMware’s Virtual SAN (vSAN) in beta.  VMware is really pushing the software defined datacenter.  Since they already have software defined compute and memory it really only leaves storage and network.  This year at VMworld they announced products to fill both these needs.  I will discuss network virtualization in another post (NSX).  The key item on my mind is if vsan is for small businesses or if they expect it to take over enterprise.   Everything in this article is mearly my opinion I do not have any insider knowledge so enjoy it with a little salt. Let me be the first to say I am very excited about vSAN and in order to understand my thoughts it requires a little history lesson:

VMware Virtual Storage Appliance

A few years ago VMware saw a large hole in there market adoption: small business.  For years small business was buying servers and shoving them into a closet.  This was a model they could afford and works.  As VMware saw enterprise adoption fill up they needed new customers.  Small business would not mind getting the advantages of virtualization in fact in a lot of cases it would really fit them better: flexibility, vmotion etc..  There was one critical issue in the way.. Central shared storage.  Small business could run a single node of ESXi and gain some of virtualization benefits but this did not provide VMware with any profit or software support.   So how to we get around the entrance cost for storage?  The simple process had already been provided by white box and home labs NFS datastores provided by a virtual machine.  (In other words fake shared storage)  VSA was VMware’s first and second attempt to address this issue.  It worked ok but was slow compared to traditional SAN’s it also suffered from lack of scalability.

Slow Adoption by SAN providers

VMware desperately wants to provide new storage features via their software offerings.  As such they created VASA (vSphere Storage API’s for Storage AwarenessVsphere 5.0) a set of extensions which allowed storage array’s to integrate into Vsphere via plug-ins.  VASA allows the storage provider to serve up capabilities around availablity, provisioning and performance.  When combined with VAAI VMware finally was able to design software around the arrays.   But very few vendors adopted the standards.  Their implementations were bare bones to certify to customers support for these API’s / Standards.  This slow and poor implementation really hurt VMware’s ability to move quickly on software defined storage solutions (like replication, some forms of FT and stretch cluster solutions).

How does cache fit into the discussion

Cache has always been a part if high end arrays.  They use ram to store reads and writes to speed up common operations.  In my environment we have lots if arrays for example

  • 1 Hp eva 4400 which has 126 1tb 7200 rpm drives with 1gb of total cache per controller
  • 2 Hp eva 4400 which has 126 300 gb 15000 rpm drives with 1gb of total cache per controller
  • 3 Hp p9500 which has 50 300 gb 15000 rpm drives with 96gb of total cache per controller
  • 4 Hitachi hus vm which has 600 1tb rpm drives with 64gb of total cache per controller

It should be noted that these arrays can be defined two different ways

  • Evas are aula arrays that stripes all data across all disks creating some artifical performance
  • P9500 is really a hitachi vsp meaning it is the big brother ofa hus vm so they are essentially the same minus the cache differences.  They are true active active arrays with 8+2 or in my case 14+2. They also support front ending other arrays.

In order to propose my thoughts around cache I will give you the following scenarios

  • Virtual machine running a heavy sybase server.  It has the ability to completely trash storage with a single badly written query using temp tables …. so it is a mixed load of reads and writes
  • The operating system metrics will be observed and completion time of the query was recorded.

So here are the basic results this is just an illustration of my thoughts

  • Query on eva 1 operating system shows waiting on storage takes 45 seconds
  • Query on eva 2 operating system shows waiting on storage takes 42 seconds
  • Query on p9500 3 operating system shows waiting on processor takes 22 seconds
  • Query on hus vm 4 operating system shows waiting on processor takes 23 seconds
  • Query on eva 1 while presented though p9500 3 operating system shows waiting on processor takes 23 seconds
  • Query on eva 2 while presented through p9500 3 operating system shows waiting on processor takes 23 seconds

Full disclosure during all these tests the p9500 and hus vm were servicing other production vmware and non vmware loads.  So the take away cache is everything until it is full.  Disk speed dies not equal performance .. cache does.  So we all need a lot more cache.

What is VMware VSan

Well now we are down to the nuts and bolts.  Vsan combines local disks and SSD on servers into a single distributed (Think RAIN) VMFS datastore over a VSAN network.  It can scale very large and includes a ton of features.  Including:

  • Striping across disks for performance on a vmdk level
  • Truely VMFS aware storage system
  • Ability to set the number of copies you want of your vm (0 copies, 1 copy or mirror, or additional copies)
  • Storage build into the kernel to create shortest path for I/O
  • Thin provisioning for speed of copy and deployment
  • Use SSD for 30% write / 70% read cache
  • Adjustment of striping or replication copies on the fly

In short Vsan uses SSD (remember that SSD for servers comes in 400GB / 800Gb / 1.2 TB models) for cache then uses back end slower disk for sequential work and offload.   This allows us to run a vmware cluster without any shared storage.

Future or just for small business

Everyone I ask answers with it depends which is really true.  But if you want my two cents it’s the future.  I think companies like Nutanix have been in a similar market for a few years and done very well.   VMware’s vSAN is only in beta and needs to grow up a little… and we will see new features but I consider it to be the future of virtualization and datacenter.  I think it will start with the poorly prices EUC (VDI) market as a inexpensive entrance into the market… but as it proves it’s ability (Just think about it 800GB of cache per server not 256GB’s shared on the array) it will get adoption.

Understand thy place VSAN

Do not get me wrong I am not suggesting that traditional SAN’s are dead.  I have been working with SAN’s for over ten years and they have their place.  They will continue to exist but VMware would like to have full control over the stack and have additional products to sell.  In trade they promise increased flexability features and speed of deployment… all good things.  SAN makers will counter with smaller arrays and new features and the battle will begin.

 

Schedule Storage DRS IO moves

It’s 10 AM monday morning.  Everything is going great.  Then you start to get lots of tickets… everyone is complaining about slow response from your virtual servers.  Time to check vcenter… yep Storage vmotion just kicked off a re-balance during the worst time.  How do you avoid this?  Until 5.1 you could not.  In 5.1 they added inside the web client (yep web client only – if you do in c# client it will not stick) the ability to schedule it.   In my case I want moves based upon a full lun to happen anytime and I/O moves to happen only during evening hours.

 

  • Log in to web client
  • Vcenter -> datastore clusters -> select cluster -> select Manage at top
  • Untitled
  • Do All Actions -> Add and add a new schedule.
  • Setup schedule to Disable at 6:00 AM
  • Untitled
  • Give it a name and schedule a time
  • Untitled
  • Untitled
  • Then create one to enable at 6:00 PM
  • Untitled
  • Untitled
  • Untitled

Enjoy the freedom from I/O moves during the morning.

Vcloud Director virtual machine unable to TCP to internet

Wierd one here I was working inside a vcloud org and I had a new virtual machine unable to browse the internet.  DNS would work… traceroute would work… at first I figured it was a configuration issue as mentioned in the KB here.  But it was not the same issue existed as long as the virtual machine was behind the vshield edge appliance.  If I took it in front (exposed to internet) everything was fine.   I ended up calling vmware support and it turned out to be IPV6 on the Windows virtual machine.   IPV6 is not supported on vshield Edge.   As documented here.   Disabling IPV6 in windows is a registry modification and a reboot so it’s a pain… make sure you turn it off or it will be a fun fail.

 

Reset Windows 2008 R2 Admin password

So I did it… I cloned a domain join virtual machine… I did not sysprep it and I left the domain… so when I went back to the original machine no domain logins would work.  Even worse I lost the local admin password.   So how do you reset it… in linux it’s always a reboot away… Windows is a reboot and a few steps:

  • Boot from the Micrsoft Windows Server 2008 DVD
  • From the Install Windows menu, click “Next”.
  • Select “Repair your computer”
  • In the System Recovery Options, select the Operating System instance that you wish to repair and click “Next”.
  • Select “Command Prompt”. The
  • At the command prompt, run the following commands:c:
    cd windows\system32
    ren Utilman.exe Utilman.exe.old
    copy cmd.exe Utilman.exe
  • Reboot the server allowing Windows to load as normal
  • At the logon screen, press Windows Key + U.
  • As the command prompt, enter the following command:net user administrator New_PASSWORD
  • Log into the server with New_PASSWORD
  • Reboot into the repair command prompt

c:
cd windows\system32
del utilman.exe
copy Utilman.exe.old utilman.exe

Reboot and enjoy

VMware vshield setup hostname

I never thought to setup the hostname on a vshield manager but it’s a pretty critical process.  I can be done via the console with the following commands:

  • login as admin
  • enable
  • configure terminal
  • hostname your_fqdn

VMware Lost access to volume on Brocade SAN switches

Earlier today I ran across the following errors in my ESXi logs:

Nov 15 08:03:10 esx02 Hostd: [6BA33B90 info 'Vimsvc.ha-eventmgr'] Event 305408 : Lost access to volume 51f7bd04-ace1a4c9-0114-0017a4770000 (LUN23) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

It would always recover right away within 1 second and since I am on a Active/Active array it was not really an interruption.  It did bother me.  So I did some research and came up with fillwords.

The process as I understand it is that switches use these fillwords to setup link initialization.  Up until firmware 6.1 this was done by sending IDLE/IDLE messages.  Both sides would respond with IDLE this was specific to devices running 1G/2G/4G speeds.  Brocade then setup new fillword modes as show below:

MODE          MEANING
Mode 0        Use IDLE in link init and IDLE as fill word
Mode 1        Use ARB in link init and ARB as fill word
Mode 2        Use IDLE in link init and ARB as fill word
Mode 3         Try mode 1 first; if it fails then try mode 2

You should set this mode as per your storage and server best practices.  I know that for my HP Blades using 8GB virtual connect it is critical to use Mode 3 which I was not using.  You can check you current config by logging into the switch and using this command:

portcfgshow port_number

And you can switch it via:

portcfgfillword <Port#> <Mode>

You can read a much better written article on that matter here.