Storage in Virtualization is it a real problem?

As a VMUG leader and a double VCDX I have seen one technology trend only increase over the years.  It’s the number of storage vendors!   Last year at our VMUG UserCon every sponsor looking for a presentation slot was a storage vendor.   We had to choose between storage vendors and other storage vendors I would have killed for another type of vendor.  In past years we had presentations from Backup vendors, management tools, monitoring tools and IT service companies.   Now it’s all storage companies.   As a double VCDX I get contacted by start-up companies looking to sell their products to VMware customers.  Some are well known company’s others are still in stealth but they all have the same request… how do we get VMware guys to buy our awesome technology.   Almost all of these companies are using Super Micro white box solution with some secret sauce.  The sauce is what makes them different, some are web-scale while others are all flash or awesome dedupe ratios.   All attempting to address some segment of storage problems.  It really begs to question is there a storage problem?

 

What does storage provide?

Storage essentially provides two things that virtualization professionals care about:

  • Capacity (Space to store information)
  • Performance (divided into IOPS and latency)
    • IOPS – input/output per second number of commands you can shovel into the system
    • Latency – how long it takes to shovel each IOP end to end

There are subsections of software that each vendor provides in order to improve these metrics for example dedupe for capacity or Hot blocking for performance.   Essentially this is the role of storage systems to provide these functions.

How has virtualization made it worse?

Virtualization has made management of these metrics a challenge.   In traditional storage a single entity controls a LUN or mount.  It runs an application that has certain predictable patterns for usage of the lun.   For example, a web server does a lot of reads and a few writes.   We can identify and classify this usage pattern and thus “right size” the lun to meet these needs.  This right sizing can take the form of both capacity and performance metrics.   Virtualization created a new pattern lots of guest servers with different applications sharing the same lun.   This makes the usage metrics pretty wild.   The storage system has not idea what the virtual machines are doing beyond a bulk understanding of reads and writes.   This seems like a problem but in reality the storage system just see’s reads and write and does not care, unless capacity or performance for that lun are exhausted.    This issue might drive the acquisition of more performance storage in order to meet the needs of our new “super luns” but in most cases it just takes advantage of unused capacity on a storage array.

What does desktop virtualization have to do with storage?

Desktop virtualization taught us a very important lesson about storage.  During boot operating systems do a lot of IOPS.   Operating systems are 90% idle except during a boot.  During boot lots of reads and some writes happen putting pressure on disk.   Desktop virtualization introduced a new pattern of pressure on disk.  At eight and nine AM everyone would boot up their virtualized desktop (spawning new desktops and booting the OS’s) putting massive pressure on storage.   The caused storage systems to fail and if shared with traditional server virtualization everything failed.   Traditional storage vendor’s solution to this problem was buy a bigger array with more cache and capacity.  This created stranded capacity and was a huge CapEx expenditure when desktop virtualization was “supposed” to save us money.

Role of Cache

The rise of SSD has provided a dramatic improvement to the size of cache available in arrays.   Cache provide ultra-fast disk for initial write and common reads thus reducing latency and improving IOPS.   I remember the days when 1GB of cache was awesome these days’ arrays can have 800GB cache solutions or more.   Cache allows you to buy larger and slower capacity disks while getting better performance to the virtualized application.    Cache is a critical component in today’s storage solutions.

How to solve desktop virtualization

Vendors saw a gap in technology with desktop virtualization not being filled with traditional array vendors.    This gap can be defined as:

  • The array was not meeting my performance needs without buying more arrays
  • I need to separate my IOPS for desktop virtualization away from servers

 

This gave rise to two solutions:

  • Hyper-converged infrastructure
  • All Flash arrays

 

 

Hyper-converged

Hyper-converged infrastructure has many different definitions depending on who you ask.  For the purpose of this article it’s a combination of x86 hardware with local hard drives.  This combination provides the compute and software based clustered storage solution for virtualization.    The local hard drives on each compute node contribute to the required cluster file system.  This model has long been used by large service providers like Google and Amazon.  These are normally implemented for ESXi over NFS.  The market leader at this time is Nutanix who really cut their teeth solving desktop virtualization problems.  They have since moved successfully into traditional server virtualization.   Their success has encouraged other vendors to enter the market to compete including Simplivity (OmniCube) and VMware (Virtual SAN).   Each vendor has some mix of the secret sauce to address a perceived problem.   It’s beyond the scope of this article to compare these solutions but they all take advantage of at least one SSD drive as a per compute cache.   This local cache can be very large compared to traditional arrays with some solution using 1TB or more local cache.   Each compute node serves as a storage controller allowing for a scale up approach to capacity and performance.  Hyper converged solution have seen huge growth in the market and does effectively resolve the desktop problem depending on scale.  Hyper converged solutions do introduce a new problem; balanced scalability.   Simply put I may need additional storage without needing more controllers or compute capacity, but in order to get more storage I have to buy more nodes.   This balanced scale issue is addressed by vendors providing different mixes of storage / compute nodes.

 

All Flash Arrays

With the rise of SSD the cost keeps getting lower.   So traditional array vendors starting producing all flash arrays.   Flash provided insane amounts of IOPS per disk, but lower capacity.  Each month the capacity increases and the cost reduces on SSD making the All flash array (AFA) a very real cost effective solution.   Years ago I was asked to demo a newly emerging Flash solution called RamSAN.  The initial implementation was 150,000 IOPS in a single 2 U unit.   I was tasked with testing its limits.  I wanted to avoid artificial testing so I threw a lot of VMware database workloads at the array (all test of course).   I quickly found out that the solution may be able to do 150,000 IOPS but that my HBA’s (2 per host) did not have enough queue depth to fulfill the 150,000 IOPS.   All flash arrays introduced some new problems:

  • Performance bottleneck moved from the disk to the controller on the array
  • Capacity was costly
  • New bottlenecks like queue depth could be an issue

I remember buying 40TB’s of ssd in more recent array.  The SSD drives combined was capable of 300K IOPS while the controllers could not push more than 120K IOPS.    A single controller was able to do 60K IOPS.   Quickly the controller became my problem, one that I could not overcome beyond buying a new array with additional controllers.    Traditional array vendors struggled with this setup bound by their controller architecture.  A number of startup vendors entered the market with scale up controllers.  All flash based solution can potentially solve the desktop problem but at a steep cost.

 

Problem with both solutions

All solutions suffer from the same problems:

  • Stranded capacity in IOPS or storage capacity (more of either than you need)
  • Storage controllers cannot meet performance needs

All of these issues happen because of a lack of understanding of the true application metrics.   vCenter understands the application metrics the array understands reads and writes at a lun level.   This lack of understanding of each virtual machine as an independent element does not allow the administrator to increase priority or preference of individual machines.  Hyper converged have two additional challenges:

  • Increased network bandwidth for data replication (assuming Fiber arrays NAS have this issue)
  • Blades rarely have enough space for multiple hard drives

The value proposition for hyper converged is that you can replace your costly array with just compute with hard drives.  This is a real cost savings but only if you are due for a technology refresh on both compute and storage and your budgets are aligned and agreed to spend on hyper converged.  Getting storage to give up funds for a compute hard drive can be a hard proposition.

 

How to understand the smallest atomic unit

Lots of vendors understand this problem and have different ways of approaching this problem including:

  • VVols
  • Local compute cache
  • NFS

Essentially to understand the small you have to understand the individual files and how they are connected.   VMFS file system handles all this information, block based arrays only understand block based reads and write.   Individual files are invisible to the block based file system.

 

VVols

Developed by VMware VVol’s provide a translation method between block based storage systems using protocol endpoints.  These protocol endpoints run on the storage controllers or in-line with controllers to allow the array to understand the file system and individual files.   This translation allows the array to act upon a single virtual machine on a lun instead of running on the whole lun.   We can apply performance, snapshots and all array operations on the individual virtual machines.   This is a great solution but has two problems:

  • The protocol endpoints much like controllers have scalability issues if not implemented correctly
  • Vendor adoption has been very slow

 

Local compute cache

This process adds SSD or RAM and creates a cache for virtual machine reads and writes.  This cache can be assigned to individual machines or shared between the whole compute node.  This method has an understanding of individual machines and accelerate reads and writes.   In order to cache writes it’s critical that the writes be redundant so normally the writes have to be committed to at least two different compute nodes cache before acknowledged to the operating system.  This ensures that the data is protected during a single compute node failure.   The current leader providing read and write cache solutions like this is PernixData.  This process ensures local performance enhancement at the lowest atomic level but does endure some common challenges with hyper converged including:

  • Every compute node must have local SSD to accelerate solution
  • Network bandwidth for replication is used (meaning your need more 10GB or you have to share it)

NFS

NFS has been around for years.  It’s a method for sharing a file system to Linux and Unix hosts.   VMware supports it natively and it’s the only supported file system (other than VMware VSan) that is not running VMFS.  VM’s on NFS are files on the NFS file system.  This allows the storage array / server full understanding of the individual files.   This exposure can be a huge advantage when looking at backup products and site to site replication.    Until NFS version 4 support (vSphere 6) there were a number of draw backs to NFS including multipathing.  They have been removed and NFS provides the full object based storage solution that VVols promise.   Scalability can be a problem with a maximum number of virtual machines and objects on a single lun, or with capacity around controllers.   NFS based solution are network based and thus create network workload.  In addition natively NFS does not provide any performance by file enhancement method it just deals with IO in and out.   Lots of vendors have implemented solutions to enhance NFS.

What is best and does it solve the issue?

I started this post with the question is there a problem with storage… well lots of vendors seem to think so and want to sell us stuff to solve the issue.   I suggest that from my experience we have a few issues:

  • Backup is a major mess, in vSphere it’s hard to manage and keep working without constant care and feeding
  • Storage arrays don’t have any understanding of the lowest atomic unit and thus cannot protect us from bad neighbors on the same lun, this becomes more of an issue in large hosting environments.
  • Performance (IOPS) is rarely the issue except in specific use cases or small business thanks to oversized arrays
  • Queue Depth is rarely the problem except in specific use cases
  • Capacity seems to be the buzz problem and the price per year just keeps getting lower

Backup

I believe we need to get to object based storage so we can solve the backup problem.  Doing VDP backups or lun snapshots does not allow management at the lowest atomic unit.  The current model causes crashes and outages and struggles to work well.  It’s not a product issue it’s an implementation and technology issue that needs a dramatic change to resolve.

Local knowledge at the lowest level

The object I manage is a virtual machine.  My storage array friend manages a lun with multiple virtual machines (sometimes 100’s – yes I am looking at your NFS).  Until we manage at the same atomic level we will have problems aligning policies and performance.   I think policy based enforcement with shares is a great way to go… something like SIOC that is enforced by the array.    Hot blocking, all flash etc… are all fixes to attempt to get around the essential communication issue between arrays.   Future storage cannot be bound by two storage controllers it needs to scale to meet needs.   The hyper converged folks have a big advantage on this problem.    Future of storage is not block, except in mixed enterprise environments (I am looking at you mainframe).   You need to get comfortable with network based storage and architect for it.   Buy switches and interfaces on your compute just for storage traffic don’t mix it.  Architect a super highway to your storage that is separate from your normal network traffic.

Performance

If performance is your issue, then solve it locally don’t buy another array.  Local cache will save you a lot.   Scale up solutions in arrays or hyper converged are both options but local SSD will be a lot cheaper than a rip and replace.  It’s easier on management cost.

What should I choose?

It depends on your needs.   If I was presented with a green field that is going to be running all virtualized workloads today I would seriously consider hyper converged.  Storage arrays are more mature but move a lot slower on updates.  I would move toward a more software defined solution instead of hardware installed.   I think that central understanding of the lowest atomic unit is critical going forward.   If you have a mixed storage environment or an investment in fiber channel large arrays with cache makes sense.   If you are looking for solve VDI issues I would consider hyper converged or lots of cache.   The future is going to hold some interesting times.  I need storage to provide the following:

  • No controller lock in I need it to scale to meet my needs
  • It needs to understand the virtual machine individual identity
  • It should include backup and restore capabilities to the VM level
  • It has to include data at rest encryption (yes I didn’t mention this but it’s huge)
  • Policy based performance (allocate shares, limits and reservations)
  • Include methods to move the data between multiple providers (move in and out of cloud)

Does it sound like a unicorn… yep it is… Someone go invent it and sell it to me.

 

Brocade Zoning via Scripting for FOS 7

About four years ago I wrote about how to do fiber channel zoning on Brocade switches using scripts.   The CLI on brocade is really feature rich but not super documented… most people use the GUI.     Well the times have changed and so have the commands so here is the super duper updated command set for FOS 7.   You can read the old post here.

Assume that we are making a single zone with a HBA and storage system:

Storage_SPA  50:01:43:81:02:45:DE:47
Server_HBA  50:01:23:45:FE:34:52:12

Steps at a glance:

  1. Use alicreate to create aliases
  2. Use zonecreate to create zones
  3. Use zoneadd to add an additional alias
  4. Use cfgadd to add new zone to active set
  5. Use cfgsave to save active set
  6. Use cfgenable to enable set

 

Step 1: alicreate “NAME”, “WWN”

alicreate “Storage_SPA”, “50:01:43:81:02:45:DE:47”

alicreate “Server_HBA”  “50:01:23:45:FE:34:52:12”

 

Step 2: zonecreate “NAME”, “First_Alias”

zonecreate “Server_To_Storage_SPA”, “Storage_SPA”

 

Step 3: zoneadd “NAME”, “Second_Alias”

zoneadd “Server_To_Storage_SPA”, “Server_HBA”

(use again to add multiples)

 

Step 4:cfgadd “Your_Config_Name”,”Zone_Name”

cfgadd “production_cfg”, “Server_To_Storage_SPA”

 

Step 5: cfgsave

cfgsave

 

Step 6: cfgenable Your_Config_Name

cfgenable production_cfg

 

You can also check your work with

zoneshow “Your_Zone_Name”

 

Thanks for reading

 Update: @MartinMune2341 Provided the link to the latest CLI reference guide here.  Thank you sir

How does storage multipathing work?

Every week I spend some time answering questions on the vmware forums.  It also provides me great idea’s for blog posts just like this one.   It started with a simple question how does multipathing work?   Along with a lot of well thought out specific questions.   I tried to answer the questions but figured it would be best with some diagrams and a blog post.    I will focus this post on fiber channel multipathing.  First it’s important to understand that Fiber channel is nothing more than L2 communication using frames to push scsi commands.   Fiber channel switches are tuned to pass scsi packets as past as possible.

Types of Arrays

There are really three types connectivity with fiber channel (FC) arrays

  • Active/Active – I/O can be sent to a LUN via any of the arrays storage processors (SP) and port.  Normally this is implemented in larger arrays with lots of cache.  Writes are sent to the cache then destaged to disk.   Since everything is delivered to cache SP and port does not matter.
  • Active/Passive – I/O is sent down to a single SP and port that owns the LUN.  If I/O is send down any other path it is denied by array.
  • Pseudo Active/Active – I/O can be sent down any SP and port but there is a SP and port combination that owns the LUN.  Traffic send to the owner of the LUN is much faster than traffic sent to non-owners.

The most common implementation of pseudo active/active is asymmetric logical unit access (AULA) defined in the SCSI-3 protocol.  In AULA the SP identifies the owner of a LUN with SCSI sense codes.

Access States

AULA has a few possible access states for any SP port combination:

  • Active/Optimized (AO) – this is the SP and port that owns the lun best possible path to use for performance
  • Active/Non-Optimized (ANO) – this is a SP and port that can be used to access a lun but it’s slower than the AO
  • Transitioning – this lun is changing from one state to another and not available for IO – Not used in most AULA now
  • Standby – Not active but available – Not used in most AULA now
  • Unavailable – SP and port not available

In a active/active array the following states exist:

  • Active – All SP and ports should be this state.
  • Unavailable – SP and port not available

In a active/passive array the following states exist:

  • Active – SP and port to access the lun (single owner)
  • Standby – SP and port available is active is gone
  • Transitioning – Switch to Active or Standby

In AULA arrays you also have Target port groups (TPG) which are SP and ports that have a similar state.  For example all the ports on a single SP may be a TPG since the LUN is owned by the SP.

How does your host know what the state is?

Great question.  Using SCSI commands a host and array communicate state.   There are lots of commands in the standard.  I will show three management commands from AULA array’s since they are the most interesting:

  • Inquiry – Ask a scsi question
  • Report Target port – Reports what TPG has the optimized path
  • Set Target port group – ask the array to switch the target port group ownership

 

This brings up some fun scenario’s who can initiate these commands and when…  All of these will use a AULA array

Setup:

So we have a server with two HBA’s connected to san switches.  In turn the SP’s are connected to the san switches.  SPa owns LUN1 via AO and SPb owns LUN2 via AO.

Untitled

 

Consider the following failures:

  • HBA1 fails – assuming the pathing software on the OS is set correctly (more on this later) The operating system access LUN1 via ANO path to SPb to continue to access storage.  Then it initiates a set target group command to SPb asking it to take over LUN1.  Which is fulfilled and the array sends out a report target port groups to all known systems that they should use SPb for access to LUN1 for AO.
  • SPa fails – assuming the pathing in OS is good.  Access to LUN1 fails via SPa and the OS fails over the SPb and initiates the LUN fail over.

This is designed just to show the interaction in a real environment you would want san switch a and b both connected to SPa and SPb if possible for redundancy.

How does ESXi deal with paths?

ESXi has three possible path states:

  • Active
  • Standby
  • Dead – cable unplug, bad connection / switch

It will always try to access to the lun via any path available.

Why does path selection policy matter?

The path selection policy can make a huge difference.  For example if you have a AULA array you would not use the round robin path selection policy.  Doing this would cause at least half your I/O’s to go down the ANO path which would be slow.   ESXi supports three policies out of the box:

  • Fixed – Honors the AO path until available most commonly used with AULA arrays
  • Most recently used (MRU) – Ignores the prefered path and uses the most recently used path until it’s dead (used in active/passive arrays)
  • Round Robin (RR) – sends a fixed number or I/O’s / bytes down a path then switches to next path.  Ignores AO.  Used normally with active/active arrays

The number of I/O’s or bytes sent before switching in RR can be configured but defaults to 1000 io’s and 10485760 bytes.

Which path should you use?  That depends on your storage array and you should work with your vendor to understand their best practices.  In addition a number of vendors have their own multipath systems that you should use (for example EMC’s powerpath).

 

VMware Lost access to volume on Brocade SAN switches

Earlier today I ran across the following errors in my ESXi logs:

Nov 15 08:03:10 esx02 Hostd: [6BA33B90 info 'Vimsvc.ha-eventmgr'] Event 305408 : Lost access to volume 51f7bd04-ace1a4c9-0114-0017a4770000 (LUN23) due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly.

It would always recover right away within 1 second and since I am on a Active/Active array it was not really an interruption.  It did bother me.  So I did some research and came up with fillwords.

The process as I understand it is that switches use these fillwords to setup link initialization.  Up until firmware 6.1 this was done by sending IDLE/IDLE messages.  Both sides would respond with IDLE this was specific to devices running 1G/2G/4G speeds.  Brocade then setup new fillword modes as show below:

MODE          MEANING
Mode 0        Use IDLE in link init and IDLE as fill word
Mode 1        Use ARB in link init and ARB as fill word
Mode 2        Use IDLE in link init and ARB as fill word
Mode 3         Try mode 1 first; if it fails then try mode 2

You should set this mode as per your storage and server best practices.  I know that for my HP Blades using 8GB virtual connect it is critical to use Mode 3 which I was not using.  You can check you current config by logging into the switch and using this command:

portcfgshow port_number

And you can switch it via:

portcfgfillword <Port#> <Mode>

You can read a much better written article on that matter here.

Lock down port speed on a brocade switch

Well this one is pretty simple and a lot of vendors still want the port speeds locked down.  So first of all use

 

switchshow

 

To show the current speed then use

portcfggport port# speed

 

Speed can be

s auto

1 GB

2 GB

4 GB

8 GB

 

 

Upgrade the firmware on a brocade switch

Yep I love doing this…. it’s really easy as long as you have an FTP server or run ftp on your desktop and have firewalls allowed…

 

The brocade switch needs to be able to download the firmware from a FTP server.  Once you have an FTP it’s easy to do the install of firmware.  The process downloads the firmware and installs it on the secondary CP then moves load over the the secondary cp.  Once all load is on the secondary CP then it upgrades the first CP.  This way the code is completely loaded and verified before load gets moved to the new system.

 

Once you have the FTP setup then you can login to the switch via ssh (or telnet yuk disable it and use ssh) and issue the following command:

 

firmwaredownload

 

It will list a series of prompts to help you install the firmware the most important is the location this is relative to where the ftp server drops you… for example mine drops you into the users home directory which is /home/username and I have my firmware unpacked into /v6.4.2.b so the full path is /home/username/v6.4.2.b but when connecting via ftp the path is v6.4.2.

 

You will be disconnected during the firmware update you can reconnect pretty quickly and use the following command to check status:

 

showfirmwaredownload

 

This will display the current status.  You can also use version to check that it’s been upgraded.

 

 

Boot From SAN, multipath.conf and initrd

At work almost everything I run boots from SAN.  This allows flexability when hardware problems arise.  It does cause one issue, when moving storage systems how do you move your boot from san luns.  Well there is the old dd trick to do a block level copy between two similar sized luns.  This alone will not work, because the wwid of your boot from san lun will change.  When this happens the following all become issues:

1. You need to change your bios settings for boot from san wwid (This is a trick provided by modern HBA’s and trick the BIOS into thinking a single lun is a local disk so it can be booted from and is normally done in the HBA BIOS at least for qlogic)

2. You need to identify the lun to your multipathing software (so it’s redundant I will use Linux device mapper which is handled by a kernel driver and the multipath.conf file)

 

Both of those are straight forward and easy but when you change them it just does not work, why not?  Well it’s because of how linux boots and handles multipath devices.

You boot loader has not kernel and as such cannot load data off a multipathed device (it needs a kernel driver to do it) so how does it solve this issue? Well it compiles both the driver and multipath.conf into the initrd.  Yep you heard me right your multipath.conf which you have been changing for years without an mkinitrd has a copy of multipath.conf.   How does it work?

1. Grub stage 1 hands off to the hard drive partition (single path of your /boot partition)

2. Grub 1.5 loads the initrd which contains the multipath driver and multipath.conf for at least the boot from san lun.

3. Grub 2 loads the normal kernel which then reloads the multipath.conf from disk enabling all other disks listed in there

 

So to solve this issue when you change the WWID of your boot from SAN lun or optionally any lun you can remake you initrd with the following command:

 

mkinitrd /boot/initrd-new-boot.img your-kernel-version

 

For some reason on RHEL when I issue this command with the name of my current initrd and a -f to force it multipath does not get added like it should so use a different name then it works.  Then just change your /boot/grub/menu.1st to boot the new initrd and your all set..

 

How do I tell whats in my initrd :

 

cd /tmp

mkdir initrd

zcat /boot/initrd-version.img | cpio -id

and you will have an open initrd that you can browse… try looking for /etc/multipath.conf.

 

You can recompile it with

cd initdir
find . | cpio -o -H newc | gzip -9 > ../initrd.gz

 

 

Enjoy

 

Troubleshooting multipathd

At work I used linux  native multipathing MPIO provided via device mapper a lot.  I can be a bit hard to troubleshoot.  I have found the best way is to use the interactive console.   From there you can do a lot of things I’ll illustrate the three most common I use:

Show current config

# multipathd -k
>> show config

Reload changes to multipath.conf

#multipathd -k
>>reconfigure

Check paths

#multipathd -k
>>show paths

In addition the following command out puts a log of debug information that can really help.

multipathd -v4 

Brocade Fiber Switches Clear stats

Well time to gather stats and clear them again.. log into your brocade switch via ssh and type the following command:

portStatsClear
portLogClear

To see the stats you can run:

portErrShow
portLogShow

I will be posting more on switches in the near future…