How does storage multipathing work?

Every week I spend some time answering questions on the vmware forums.  It also provides me great idea’s for blog posts just like this one.   It started with a simple question how does multipathing work?   Along with a lot of well thought out specific questions.   I tried to answer the questions but figured it would be best with some diagrams and a blog post.    I will focus this post on fiber channel multipathing.  First it’s important to understand that Fiber channel is nothing more than L2 communication using frames to push scsi commands.   Fiber channel switches are tuned to pass scsi packets as past as possible.

Types of Arrays

There are really three types connectivity with fiber channel (FC) arrays

  • Active/Active – I/O can be sent to a LUN via any of the arrays storage processors (SP) and port.  Normally this is implemented in larger arrays with lots of cache.  Writes are sent to the cache then destaged to disk.   Since everything is delivered to cache SP and port does not matter.
  • Active/Passive – I/O is sent down to a single SP and port that owns the LUN.  If I/O is send down any other path it is denied by array.
  • Pseudo Active/Active – I/O can be sent down any SP and port but there is a SP and port combination that owns the LUN.  Traffic send to the owner of the LUN is much faster than traffic sent to non-owners.

The most common implementation of pseudo active/active is asymmetric logical unit access (AULA) defined in the SCSI-3 protocol.  In AULA the SP identifies the owner of a LUN with SCSI sense codes.

Access States

AULA has a few possible access states for any SP port combination:

  • Active/Optimized (AO) – this is the SP and port that owns the lun best possible path to use for performance
  • Active/Non-Optimized (ANO) – this is a SP and port that can be used to access a lun but it’s slower than the AO
  • Transitioning – this lun is changing from one state to another and not available for IO – Not used in most AULA now
  • Standby – Not active but available – Not used in most AULA now
  • Unavailable – SP and port not available

In a active/active array the following states exist:

  • Active – All SP and ports should be this state.
  • Unavailable – SP and port not available

In a active/passive array the following states exist:

  • Active – SP and port to access the lun (single owner)
  • Standby – SP and port available is active is gone
  • Transitioning – Switch to Active or Standby

In AULA arrays you also have Target port groups (TPG) which are SP and ports that have a similar state.  For example all the ports on a single SP may be a TPG since the LUN is owned by the SP.

How does your host know what the state is?

Great question.  Using SCSI commands a host and array communicate state.   There are lots of commands in the standard.  I will show three management commands from AULA array’s since they are the most interesting:

  • Inquiry – Ask a scsi question
  • Report Target port – Reports what TPG has the optimized path
  • Set Target port group – ask the array to switch the target port group ownership

 

This brings up some fun scenario’s who can initiate these commands and when…  All of these will use a AULA array

Setup:

So we have a server with two HBA’s connected to san switches.  In turn the SP’s are connected to the san switches.  SPa owns LUN1 via AO and SPb owns LUN2 via AO.

Untitled

 

Consider the following failures:

  • HBA1 fails – assuming the pathing software on the OS is set correctly (more on this later) The operating system access LUN1 via ANO path to SPb to continue to access storage.  Then it initiates a set target group command to SPb asking it to take over LUN1.  Which is fulfilled and the array sends out a report target port groups to all known systems that they should use SPb for access to LUN1 for AO.
  • SPa fails – assuming the pathing in OS is good.  Access to LUN1 fails via SPa and the OS fails over the SPb and initiates the LUN fail over.

This is designed just to show the interaction in a real environment you would want san switch a and b both connected to SPa and SPb if possible for redundancy.

How does ESXi deal with paths?

ESXi has three possible path states:

  • Active
  • Standby
  • Dead – cable unplug, bad connection / switch

It will always try to access to the lun via any path available.

Why does path selection policy matter?

The path selection policy can make a huge difference.  For example if you have a AULA array you would not use the round robin path selection policy.  Doing this would cause at least half your I/O’s to go down the ANO path which would be slow.   ESXi supports three policies out of the box:

  • Fixed – Honors the AO path until available most commonly used with AULA arrays
  • Most recently used (MRU) – Ignores the prefered path and uses the most recently used path until it’s dead (used in active/passive arrays)
  • Round Robin (RR) – sends a fixed number or I/O’s / bytes down a path then switches to next path.  Ignores AO.  Used normally with active/active arrays

The number of I/O’s or bytes sent before switching in RR can be configured but defaults to 1000 io’s and 10485760 bytes.

Which path should you use?  That depends on your storage array and you should work with your vendor to understand their best practices.  In addition a number of vendors have their own multipath systems that you should use (for example EMC’s powerpath).

 

VMware Virtual SAN (vSAN) Future or just for small business

I mentioned recently that I have been playing with VMware’s Virtual SAN (vSAN) in beta.  VMware is really pushing the software defined datacenter.  Since they already have software defined compute and memory it really only leaves storage and network.  This year at VMworld they announced products to fill both these needs.  I will discuss network virtualization in another post (NSX).  The key item on my mind is if vsan is for small businesses or if they expect it to take over enterprise.   Everything in this article is mearly my opinion I do not have any insider knowledge so enjoy it with a little salt. Let me be the first to say I am very excited about vSAN and in order to understand my thoughts it requires a little history lesson:

VMware Virtual Storage Appliance

A few years ago VMware saw a large hole in there market adoption: small business.  For years small business was buying servers and shoving them into a closet.  This was a model they could afford and works.  As VMware saw enterprise adoption fill up they needed new customers.  Small business would not mind getting the advantages of virtualization in fact in a lot of cases it would really fit them better: flexibility, vmotion etc..  There was one critical issue in the way.. Central shared storage.  Small business could run a single node of ESXi and gain some of virtualization benefits but this did not provide VMware with any profit or software support.   So how to we get around the entrance cost for storage?  The simple process had already been provided by white box and home labs NFS datastores provided by a virtual machine.  (In other words fake shared storage)  VSA was VMware’s first and second attempt to address this issue.  It worked ok but was slow compared to traditional SAN’s it also suffered from lack of scalability.

Slow Adoption by SAN providers

VMware desperately wants to provide new storage features via their software offerings.  As such they created VASA (vSphere Storage API’s for Storage AwarenessVsphere 5.0) a set of extensions which allowed storage array’s to integrate into Vsphere via plug-ins.  VASA allows the storage provider to serve up capabilities around availablity, provisioning and performance.  When combined with VAAI VMware finally was able to design software around the arrays.   But very few vendors adopted the standards.  Their implementations were bare bones to certify to customers support for these API’s / Standards.  This slow and poor implementation really hurt VMware’s ability to move quickly on software defined storage solutions (like replication, some forms of FT and stretch cluster solutions).

How does cache fit into the discussion

Cache has always been a part if high end arrays.  They use ram to store reads and writes to speed up common operations.  In my environment we have lots if arrays for example

  • 1 Hp eva 4400 which has 126 1tb 7200 rpm drives with 1gb of total cache per controller
  • 2 Hp eva 4400 which has 126 300 gb 15000 rpm drives with 1gb of total cache per controller
  • 3 Hp p9500 which has 50 300 gb 15000 rpm drives with 96gb of total cache per controller
  • 4 Hitachi hus vm which has 600 1tb rpm drives with 64gb of total cache per controller

It should be noted that these arrays can be defined two different ways

  • Evas are aula arrays that stripes all data across all disks creating some artifical performance
  • P9500 is really a hitachi vsp meaning it is the big brother ofa hus vm so they are essentially the same minus the cache differences.  They are true active active arrays with 8+2 or in my case 14+2. They also support front ending other arrays.

In order to propose my thoughts around cache I will give you the following scenarios

  • Virtual machine running a heavy sybase server.  It has the ability to completely trash storage with a single badly written query using temp tables …. so it is a mixed load of reads and writes
  • The operating system metrics will be observed and completion time of the query was recorded.

So here are the basic results this is just an illustration of my thoughts

  • Query on eva 1 operating system shows waiting on storage takes 45 seconds
  • Query on eva 2 operating system shows waiting on storage takes 42 seconds
  • Query on p9500 3 operating system shows waiting on processor takes 22 seconds
  • Query on hus vm 4 operating system shows waiting on processor takes 23 seconds
  • Query on eva 1 while presented though p9500 3 operating system shows waiting on processor takes 23 seconds
  • Query on eva 2 while presented through p9500 3 operating system shows waiting on processor takes 23 seconds

Full disclosure during all these tests the p9500 and hus vm were servicing other production vmware and non vmware loads.  So the take away cache is everything until it is full.  Disk speed dies not equal performance .. cache does.  So we all need a lot more cache.

What is VMware VSan

Well now we are down to the nuts and bolts.  Vsan combines local disks and SSD on servers into a single distributed (Think RAIN) VMFS datastore over a VSAN network.  It can scale very large and includes a ton of features.  Including:

  • Striping across disks for performance on a vmdk level
  • Truely VMFS aware storage system
  • Ability to set the number of copies you want of your vm (0 copies, 1 copy or mirror, or additional copies)
  • Storage build into the kernel to create shortest path for I/O
  • Thin provisioning for speed of copy and deployment
  • Use SSD for 30% write / 70% read cache
  • Adjustment of striping or replication copies on the fly

In short Vsan uses SSD (remember that SSD for servers comes in 400GB / 800Gb / 1.2 TB models) for cache then uses back end slower disk for sequential work and offload.   This allows us to run a vmware cluster without any shared storage.

Future or just for small business

Everyone I ask answers with it depends which is really true.  But if you want my two cents it’s the future.  I think companies like Nutanix have been in a similar market for a few years and done very well.   VMware’s vSAN is only in beta and needs to grow up a little… and we will see new features but I consider it to be the future of virtualization and datacenter.  I think it will start with the poorly prices EUC (VDI) market as a inexpensive entrance into the market… but as it proves it’s ability (Just think about it 800GB of cache per server not 256GB’s shared on the array) it will get adoption.

Understand thy place VSAN

Do not get me wrong I am not suggesting that traditional SAN’s are dead.  I have been working with SAN’s for over ten years and they have their place.  They will continue to exist but VMware would like to have full control over the stack and have additional products to sell.  In trade they promise increased flexability features and speed of deployment… all good things.  SAN makers will counter with smaller arrays and new features and the battle will begin.

 

Schedule Storage DRS IO moves

It’s 10 AM monday morning.  Everything is going great.  Then you start to get lots of tickets… everyone is complaining about slow response from your virtual servers.  Time to check vcenter… yep Storage vmotion just kicked off a re-balance during the worst time.  How do you avoid this?  Until 5.1 you could not.  In 5.1 they added inside the web client (yep web client only – if you do in c# client it will not stick) the ability to schedule it.   In my case I want moves based upon a full lun to happen anytime and I/O moves to happen only during evening hours.

 

  • Log in to web client
  • Vcenter -> datastore clusters -> select cluster -> select Manage at top
  • Untitled
  • Do All Actions -> Add and add a new schedule.
  • Setup schedule to Disable at 6:00 AM
  • Untitled
  • Give it a name and schedule a time
  • Untitled
  • Untitled
  • Then create one to enable at 6:00 PM
  • Untitled
  • Untitled
  • Untitled

Enjoy the freedom from I/O moves during the morning.

Vmware VSAN

For those of your who have not heard to vsan it’s vmware’s new software defined storage solution.  It uses local disks on servers to form distributed VMFS datastores.  It’s is able to provide performance via SSD’s used as read and write cache.  It’s exciting stuff… VMware has posted a great slide deck on how to configure and install here.

Also Cormac Hogan has posted a whole series on vSAN that rocks here.

I have been playing around with vsan for about a week and I am really excited about it.  I’ll post some of my thoughts and discoveries soon.