Double your storage capacity without buying a new storage shelf

I spent a good portion of my career moving storage from one array to another.   The driver is normally something like this:

  • Cost of older array (life cycle time)
  • New capacity, speed or feature

So off we went on another interruption migration of lun’s and data..  At one point I was sold on physical storage virtualization appliances.   They stood in front of the array and allowed me to move data between arrays without interruption to the WWID or application.   I loved them what a great solution.   Then storage vMotion became available and 95% of the workloads were running in VMware.   I no longer needed the storage virtualization appliance and my life became very VMware focused. I rather reed some access self-storage feedback instead.


New Storage paradigm

With the advent of all flash arrays and HCI (all flash or mixed) performance(speed) has almost gone away as a reason for moving data off arrays.  Most arrays offer the same features; replication capability aside.   So now we are migrating to new arrays / storage shelf’s because of capacity or life cycle issues.   Storage arrays and their storage shelves have a real challenge with linear growth.   They expect you to make a bet on the next three years capacity.   HCI allows a much better linear growth model for storage.

My HCI Gripe

My greatest grip with HCI solutions is that everyone needs more storage that does not always mean you need more compute.   Vendors that provide hardware locked (engineered) platforms suffer from this challenge.   The small box provides 10TB, Medium 20TB and large 40TB.   Which do I buy if I need 30TB?   I am once again stuck in the making a bet problem from arrays (at least it’s a smaller bet).   The software based platforms including VSAN (full disclosure – At time of writing I work for VMware and have run VSAN in my home for three years) have the advantage of offering better mixed sizing and linear growth.

What about massive growth?

What happens when you need to double your storage with HCI and your don’t have spare drive bays available?   Do you buy a new set of compute and migrate to it?  That’s just a replacement of the storage array model…  Recently at some meetings a friend from the Storage and availability group let me know the VSAN solution to this problem.   Quite simply replace the drives in your compute with larger drives in a rolling fashion.   You should create uniform clusters but it’s totally possible to replace all current drives with new double capacity drives.   Double the size of your storage for only the cost of the drives.   (doubling the size of cache is a more complex operation)  Once the new capacity is available and out of maintenance mode data is migrated by VSAN on to the new disks.

What is the process?

It’s documented in chapter 11 of the VSAN administration guide :

A high level overview of the steps (please use the official documentation)

  1. Maintenance mode the host
  2. Remove the disk from the disk group
  3. Replace the disk you removed with the new capacity drive
  4. Rescan for drives
  5. Add disk back into the disk group


vSphere 6.5 features that are exciting to me

Well yesterday VMware announced vSphere 6.5 and VSAN 6.5  both are huge leaps forward in technology.   They address some major challenges my customers face and I wanted to share a few features that I think are awesome:

vSphere 6.5

  • High Availability in vCenter Appliance – if you wanted a reason to switch to the appliance this has to be it… for years I have asked for high availability for vCenter.   Now we have it.   I look forward to testing and blogging about failure scenarios with this new version.  This has to be my #1 ask for the platform for the last three years!  – We are not talking about VMware HA we are talking about active / standby appliances.
  • VM EncryptionNotice this is a feature of vSphere not VSAN – this is huge the hypervisor can encrypt virtual machines at rest and while being vMotioned.   This is a huge enabler for public cloud allowing you to ensure your data is secure with your encryption keys.   This is going to make a lot of compliance folks happen and enable some serious hybrid cloud.
  • Integrated Containers – Docker compatible interface for containers in vSphere allowing you to spawn stateless containers while enforcing security, compliance and monitoring using vSphere tools (NSX etc..) – this allows you to run traditional and next generation applications side by side.

VSAN 6.5

  • iSCSI support – VSAN will be able to be a iSCSI target for physical workloads – a.k.a SQL failover clustering and Oracle RAC.   This is huge VSAN can now be a iSCSI server that has easy policy based management and scaleable performance.

There are a lot more annoucements but these features are just awesome.    You can read more about vSphere 6.5 here and VSAN 6.5 here.

Pernix Data 30 days later

I have been interested in Pernix data since its initial release the idea of using flash to accelerate storage is not new to me.   Anyone who reads my blog based rants have found that I am a huge supporter of larger cache on storage arrays.   I have always found having more cache will make up for any speed issues on drives.   My thoughts on this are simple if 90% of my storage writes and reads come from cache I run at near the speed of the cache.   Spend your money on larger cache instead of faster spinning disks and performance is improved.   Almost every storage array vendor has been using ssd to speed up arrays for the last four years.   They all suffer from the same problem, they treat all I/O’s equal without any knowledge of workload.

Netfliks problem

The best way to explain this problem is using netflicks.   They have implemented a system where you rate a show based upon stars.   Then it compares your ratings against everyone else ratings and locates users with similar ratings to yours.   Once located it uses these user’s recommendations to locate new shows for you.   This is great… assuming you have 100% the same taste in shows as that user.   This has been advanced a lot in the last five years and is much more accurate and complex algorithm.  It’s pretty accurate for me except for one problem… My wife and kids share the same netfliks account and my children love to rate everything.    This produces the world worst set of recommendations… I get little girl TV shows mixed with Downton Abbey and Sci-Fi movies.   It’s a mess… Netflix literally has no idea how to recommend show to me.    This problem exists for storage arrays with cache.   Choosing which data should be cached for reads is hard, because there are lots of different workloads competing for cache.    I don’t want to devalue the algorithms used by storage vendors, they much like Netflix are a work of evolving art.  With everyone being profiled into one mass everyone’s performance suffers.   Netflix understood this problem and created user profiles to solve the problem.  They added simple versions of localized intelligence to the process.   These pockets of intelligent ratings are used to provide recommendations for the local needs.

Pernix is the intelligent user profile

Pernix is just like Netflix user profiles, it’s installed locally on each ESXi server.  It caches for that ESXi host (and replicates writes for others).   It can be configured to cache everything on the host, datastore or virtual machine.   It provides the following features:

  • The only local SSD write cache that I know of outside hyper-converged solutions
  • Local SSD read cache
  • Great management interface
  • Metrics on usage
  • Replication of writes to multiple SSD’s for data protection


Pernix is built for vSphere

Pernix installs as a VIB into the kernel and does not require a reboot.   It has a Web client interface and C# client interface.   It does require a Windows server and SQL server for reporting.   It is quick and easy to install and operate.  The cache can be SSD’s or memory for pure speed.    Pernix works only in vSphere so it’s 100% customized for vSphere.


My local Pernix SE’s were kind enough to provide me a download and license for Pernix Data.   My home lab has been documented on this blog before but the current solution is 3 hp nodes with 32Gb of RAM each as shown below:


I added a 120GB san disk SSD to each node for this test.    My storage ‘array’ is an older Synology nas with two mirrored 2TB 7,200 RPM disks via iSCSI and NFS.  My rough math says I should be getting about 80 IOPS total from this solution which really sucks, oddly it’s always worked for me.  I didn’t have any desire to create artificial workloads for my tests, I just wanted to see how it accelerated my every day workload.   All of these tests were done in vSphere 5.5 U2.

Pernix Look and feel

Pernix provides a simple and powerful user interface.  I really like the experience even in the web client.   They use pictures to quickly show you where problems exist.



As you can see lots of data is presented in a great graphical interface.   They also provide performance charts on every resource using Pernix.  Without reading any manual other than Pernix quick start guide I was able to install their solution in 15 minutes and have it caching my whole environment, it was awesome.

How do we determine storage performance?

This is a constant question, every vendor has a different metric they want to use to determine why their solution is better.   If it’s a fiber channel array they want to talk about latency then IOPS.   If it’s all flash NAS its IOPS then latency.    So we will use these two metrics for the tests:

  • Latency – time it takes to commit a write or get a read
  • IOPS – Input / Outputs per second

I wanted to avoid using Pernix’s awesome graphs for my tests so I chose to use vRealize Operations to provide all recorded metrics.



The VM that gives my environment the biggest storage workout is vRealize log insight.   It has been known to have recorded IOP’s of 300 in the environment.    Generating IOP’s is easy just click around the interface prebuild dashboards with the time slider set for all time.   Read IOP’s fly up like crazy.   So my average information before Pernix is as follows:

  • Max IOPS 350
  • Max Latency: 19 ms
  • Average Latency: 4 ms


Now with Pernix

I setup Pernix to cache all virtual machines in my datacenter.  With pernix I clicked around on multiple days and performed lots of random searches.  I loaded down a SQL server with lots of garbage inserts to create writes.   Nothing perfectly scientific with control groups I just wanted to kick the tires.   After a month with pernix I got the following metrics:

  • Max IOPS: 4,000
  • Max Latency: 14 ms
  • Average Latency: 1.2 ms


So the results clearly denote a massive increase in IOP’s.  Some may say sure you are using SSD’s for the first time, which is true.   The increase is not just SSD’s speed because the latency is greatly improved as well which is representative of the local cache.   Imagine using enterprise worthy SSD’s with much larger capacity.  Simple answer will Pernix improve storage performance… the answer is it depends but there is a very good chance.

Use Cases

With my home lab hat removed I need to talk about some enterprise use cases:

  • Any environment or workload where you need to reduce latency
  • Any environment where workload needs every more IOP’s than most of the solutions

Both of these use cases should be implemented where less latency or IOP’s is a direct cost.   Pernix can be used as a general speed enhancer on some slower environments or to improve legacy arrays.   It does push toward a scale up approach to clustering.   Larger cluster nodes with larger SSD’s will cost less than lots of nodes.  Pernix is licensed per node.   Putting in larger nodes does have a big impact on failure domains that should be taken to account.

My only Gripe

My only gripe with Pernix is the cost.  Compared to large storage array’s it is really cheap.  The problem is budgets… I need more storage performance which means the storage team buys more storage arrays not the compute team.  Getting that budget transferred is hard because storage budgets are thin already.     This will change hyper-converged is becoming very accepted and Pernix will really shine in this world.   Pernix just released the read cache for free making it a very tempting product.   They are a smart company with a great product. They are on the right path bringing storage performance as close to the workload as possible with an added element of intelligence.

Change in VMware 5.5 U2 ATS can cause storage outages!

Update:  VMware has posted the following KB and there is a really good article by Comac Hogan on the matter.   I have also posted a PowerCLI script to resolve the issue.


Yesterday I was alerted to the fact that there was a change in the VMware 5.5 U2 heartbeat method.  In U2 and vSphere 6 it now uses ATS on VAAI enabled arrays to do heartbeats.   Some arrays are experiencing outages due to this change.   It’s not clear to me what array are exactly effected other than IBM has posted an article here.   It seems to cause one of the following symptoms : Host disconnects from vCenter or storage disconnects from host.  As you can see one of these (storage) is a critical problem creating an all paths down situation potentially.

The fix suggested by IBM disabled the ATS lock method and returns it to pre U2 methods.   It’s my understanding that this is an advanced setting that can be applied without a reboot.  I have also been told that if you create this advanced setting it will be applied via host profile or powercli.

It is very early in the process in all accounts you should open a VMware ticket to get their advice on how to deal with this issue.   They are working on the problem and should produce a KB when possible with more information.   I personally would not apply this setting unless you are experiencing the issue as identified by VMware.   I wish I had more information but it has not happened in my environment.


Post comments if you are experiencing this issue with more information.  I will update the article once the KB is posted.

Storage vMotion and Change block Tracking are now friends

A lot of readers may be aware that storage vMotion is a awesome feature that I love.  It had one nasty side effect it reset change block tracking (CBT).  This is a huge problem for any backup product that uses CBT (Veeam, anything else that does an image level backup).   It means that after a storage vMotion you now have to do a full backup.  It’s painful and wasteful.   It means that most enterprise environments refuse to use Automation storage DRS moves (one of my favorite features) because of it’s impact on backups.    Well the pain is now over… and has been for a while I just missed it :).   If you look at the release notes for ESXi 5.5 U2 you will find the following note:


  • Changed Block Tracking is reset by storage vMotion
    Performing storage vMotion operation on vSphere 5.x resets Change Block Tracking(CBT).
    For more information, see KB 2048201

    This issue is resolved in this release.


I wish I told us more about the change or why it is no longer reset but I guess I’ll accept it as is.   The previous work around was don’t use storage vMotion or do a full backup after… which is not a work around but an effect.  Either way enjoy moving the world again.

Enable Stateless Cache on Auto Deploy

Auto Deploy Really?

Yes a big autodeploy post is going to be following up soon.   I can really seen the benefit of auto deploy in larger environments.  I’ll be posting the architectural recommendations and failure scenarios soon.   Today I am posting about stateless cache and USB.

What is stateless cache and why do I care?

Stateless cache allows your auto deployed ESXi host (TFTP image running in memory) to be installed on a local hard drive.  This enables you to boot the last running configuration without the presence of the TFTP server.  It’s a really good protection method.   It is enabled by editing the host profile and in 5.5 it can be enabled using the fat client:

  1. Select the profile and right click on it
  2. Select Edit
  3. Expand System Image Cache Configuration
  4. Click on System Image Cache Profile Settings
  5. Select the drop down and choose the stateless caching mode you want.


This all sounds great but we had a heck of a time trying to get it to stateless cache to SD cards on our UCS gear.   A coworker discovered that SDcards are seen as USB devices.  Once we select “Enable stateless caching to a USB disk on the host”  everything worked.

Design Constraints

Using stateless caching will protect you against a failure of TFTP and even vCenter but DHCP and DNS are both still required for the following reasons:

  • DHCP to get IP address information
  • DNS to get hostname of ESXi host


Stateless does not remove all dependencies but it does allow quick provisioning.

Warning to all readers using Snapshot CBT based backups

Over the last few days I have become aware of a pretty nasty bug with VMware Snapshot API based backups (Any Image based solutions that is not array based and use change block tracking I will not give names).  This bug has been around for a while but has recently been fixed.   The problem happens when you expand a currently presented drive by 128GB’s or larger.   This expansion causes a bug in the CBT that will make all CBT based backups junk.  You will not be able to restore them.   It’s a major pain in the butt.   What is worse you cannot detect this issue until your restore.  So here is how you create the bug:

  • Expand a currently presented drive 128GB’s or more
  • Do a CBT backup
  • Try to restore that backup or any following backup

You can work around this issue with the following process:

  • Expand a currently presented drive 128GB’s or more
  • Disable CBT
  • Re-enable CBT
  • Do a new full backup

This bug has been around since the 4.1 days and I have never run into it.  I believe this is because I have mostly worked in Linux heavy shops.  We always added a new drive and use logical volume management to expand the mount points thus avoiding this issue.

Please give me some good news

Well today I can this problem is fixed in 5.5 U4 so patch away.  It does not fix machines that are incorrectly backing up just avoids future occurrences.  You can read more about it here.

All Paths Down my new short term enemy

Edit: Thanks to comments on Twitter from Duncan Epping, , and  I have corrected some errors in the original article.  This is one of the things I love about the internet, I can make a mistake and others are kind enough to help me correct it.  


Most of my VMware career I have been blessed with very solid fiber channel arrays.  These arrays have rarely gone down and when they do a reboot of the whole environment normally solves the issues (really have only done this once and it was a software bug in the array).    In so many ways this single point of failure (the storage array) is still a major problem in our journey to the software defined datacenter.    Recently during functionality tests we ran into the dreaded All Paths Down (APD)  situation.   My experience with APD has prompted this post.   In order to understand APD you have to understand Permanent device loss (PDL).


What is PDL?

PDL is when your storage array is removing a lun or about to reboot it sends out SCSI codes to ESXi to let it know the lun or path is going away.   It is the same as my renter letting me know he is moving away.  I have some warning and I can prepare.  I also know he is really leaving and not coming back.   PDL has seen a number of improvements over the years.  At this point if your ESXi host gets a PDL and has a virtual machine on that storage it starts a HA event.   If any other ESXi host can mount that storage it will power on the virtual machine and return to operation.  If the storage is 100% lost due to PDL the virtual machine will appear as disconnected and be unavailable.   PDL is not desirable  any data not committed to storage will be lost.   Virtual machine may be very unhappy with this interruption and require manual recovery but at least they try to restart.   You can resolve PDL by rebooting or rescaning ESXi.  Once storage is present you can restart virtual machines.


Why is APD the ugly brother to PDL?

APD is very different than PDL.  There are not SCSI code storage just goes 100% away.  It is my renter move out in the middle of the night without any warning.    I have no idea if they are coming back or what the situation could be.   I want to be very clear All paths down as the name suggest means all paths to a storage lun are down at the same time.   No warning, no notice just not available.    This is a bad situation.  ESXi does not know if the lun is going to return.  Much like my rental apartment I don’t want to paint and re-carpet until I am sure they are gone.    This delayed response can cause me to loose money but I want to be on the safe side.   VMware has taken the same approach.   During an APD situation VMware does nothing.  Virtual machines continue to run in memory.   Each operating system act’s differently.   I believe Windows continues to run with memory cache using FIFO (which means data will be lost because it cannot be written to disk).  Once storage has returned Windows will write to disk like nothing was lost.   Linux once finding it’s storage to be unwrittable goes read only (this can be resolved once storage is back with a OS remount or reboot).      This problem is complicated by the fact that ESXi will constantly try to write to these devices.  This creates load on the ESXi (because it’s scanning for storage that is not present) and can cause hostd to crash making a ESXi host disconnect from vCenter.   In 5.1 they added an advanced parameter (Misc.APDTimeout default 140 seconds)  which will cause the rescans to stop after 140 seconds.   From that point forward they wait for the storage to identify its presence.   As you can imagine APD is bad.  You can read more about APD and PDL in a number of VMware KB articles but this is a really good link.


Wait how do I even get a all paths down?

Well… here is the fun part.   That depends on your array.  Good chance if your reading this article you have network based storage or you are running metro cluster.   Most other customers will not see this issue unless they run into a bug or really bad change management.    If you have fiber channel arrays you must have either all your HBA’s or both your Fiber switches fail at the same time to create a APD.   If you have network storage it can be caused by broadcast storms, multiple switch failures etc, but it can only happen if you have your traditional networking and storage networking separate.    If you have them together on the same switch then you would have a host isolation event and HA would work.

You said something about Metro right?

Correct.  vMSC (vSphere Metro cluster – or stretch cluster) is one situation where you will see APD potentially.   In vMSC you have two sites and a single cluster stretched between them.  Your storage is synchronously replicated between sites.  If you loose storage only in a single site then you could have APD and be in a world of hurt.  You have created a solution that assures downtime prevention by having two sites and the ability to vMotion between them but now you have virtual machines running and potentially loosing data.   Very bad things.

My hyper-xxx solution avoids this issue 100%

It is true that some hyperconverged solutions have avoidance when it comes to APD.  Some do this by making the storage local to the workload.   Other do distribution to avoid the issue.   Most vendors share the network for both storage and networking making a APD impossible.   A failure would mean the host is isolated and your host isolation response would solve the issue.

Why does VMware allow this state to continue?

Well the first and best answer is it’s a rare condition.  I will throw out a C3PO prediction and call it 1:10,000.   It’s pretty rare assuming the following is true:  You have redundant fabrics and you have good documented change processes.   The best way to avoid APD is to architect it away.   Redundant dedicated paths to storage are your friend. To be 100% fair to VMware they have done a number of enhancements over the years to reduce the impact of APD issues (for example the change in 5.1 Misc.APDTimeout)


What about Metro?

Again it’s rare.  If you are building metro spend the money on the architecture.  In this case you will want to reboot your hosts on the failed side and allow them to HA to the other side.


What is the good news?

Well I do have some good news.  Once again VMware has provided a solution.  In vSphere 6.0 you will have a feature called component protection (read more here) which allows you to choose what to do in a PDL and APD situation.  It included timers and actions (like shutdown the VM and HA it to another host if possible.    Solid future solution to a rare event from VMware.


Design Scenario: Gigabit network and iSCSI ESXi 5.x

Many months ago I posted some design tips on the VMware forums (I am Gortee there if you are wondering).   Today a user updated the thread with a new scenario looking for some advise.  While it would be a bad idea personally and professionally for me to give specific advise without a design engagement I thought I might provide some thoughts about the scenario here.  This will allow me to justify some design choices I might make in the situation.   In no way should this be taken as law.  In reality everyone situation is different and little requirements can really change the design.   The original post is here.

The scenario provided was the following:

3 ESXI hosts (2xDell R620,1xDell R720) each with 3×4 port NICS (12 ports total), 64GB RAM. (Wish I would have put more on them ;-))

1 Dell MD3200i iSCSI disk array with 12 x 450GB SAS 15K Drives (11+1 Spare) w/2 4 port GB Ethernet Ports

2 x Dell 5424 switches dedicated for traffic between the MD3200i and the 3 Hosts

Each host is connected to the iSCSI network though 4 dedicated NIC Ports across two different cards

Each Host has 1 dedicated VMotion Nic Port connected to its own VLAN connected to a stacked N3048 Dell Layer 3 switch

Each Host will have 2 dedicated (active\standby) Nic ports (2 different NIC Cards) for management

Each Hosts will have a dedicated NIC for backup traffic (Has its own Layer 3 dedicated network/switch)

Each host will use the remaining 4 Nic Ports (two different NIC cards) for the production/VM traffic)

 would you be so kind to give me some recommendations based on our environment?


  • Support 150 virtual machines
  • Do not interrupt systems during the design changes


  • Cannot buy new hardware
  • Not all traffic is vlan segmented
  • Lots of 1GB ports per server


  • Standard Switches only (Assumed by me)
  • Software iSCSI is in use (Assumed again by me)
  • Not using Enterprise plus licenses



Dell MD3200i iSCSI disk array with 12 x 450GB SAS 15K Drives (11+1 Spare) w/2 4 port GB Ethernet Ports

2 x Dell 5424 switches dedicated for traffic between the MD3200i and the 3 Hosts

Each host is connected to the iSCSI network though 4 dedicated NIC Ports across two different cards

I personally have never used this array model, the vendor should be included on the design to make sure none of my suggestions here are not valid with this storage system.  Looking at the VMware HCL we learn the following:

  • Only supported on ESXi 4.1 U1 through 5.5 (no 5.5 U1 yet so don’t update)
  • You should be using the VMW_PSP_RR (Round Robin) for path fail over
  • The array supports the following VAAI natives Block Zero,Full Copy,HW Assisted Locking

The following suggestions should apply to physical cabling:


Looking at the diagram I made the following design choices:

  • From my limited understanding the array the cabling follows the best practice guide I could find.
  • Connection from the ESXi hosts to switches are done to create as much redundancy as possible including all available cards.  It is critical that the storage be as redundant as possible.
  • Each uplink (physical nic) should be configured to connect to an individual vmkernel port group.  Each port group should be configured with only one uplink.
  • Physical switches and port groups should be configured to use native port assuming these switches don’t so anything other than provide storage traffic between these four devices (three ESXi and one array)  if the array and switch is providing storage to more things you should follow your vendor’s best practices for segmenting traffic.
  • Port binding for iSCSI should be configured as per VMware document and vendor documents

New design considerations from storage:

  • 4 1GB’s will be used to represent max traffic the system will provide
  • The array does not support 5.5 U1 yet so don’t upgrade
  • We have some VAAI natives to help speed up processes and avoid SCSI locks
  • Software iSCSI requires that forged transmissions be allowed on the switch

Advise to speed up iSCSI storage

  • Bind your bottle neck – is it switch speeds, array processors, ESXi software iSCSI and solve it.
  • You might want to consider Storage DRS on your array to automatically balance load and IO metrics (requires enterprise plus license but saves so much time) – Also has an impact on CBT backups making them do a full backup.
  • Hardware iSCSI adapters might also be worth the time… thou they have little real benefit in the 5.x generation of ESXi



We will assume that we now have 8 total 1GB ports available on each host.   We have a current network architecture that looks like this (avoided the question of how many virtual switches):


I may have made mistakes from my reading a few items pop out to me:

  • vMotion does not have any redundancy which means if that card fails we will have to power off VM’s to move them to another host.
  • Backup also does not have redundancy which is less of an issue than the vMotion network
  • All traffic does not have redundant switches creating single points of failure

A few assumptions have to be made:

  • No single virtual machine will require more than 1Gb of traffic at any time (otherwise we have to be looking into LACP or etherchannel solutions.
  • Management traffic, vMotion and virtual machine traffic can live on the same switches as long as they are segmented with VLAN’s


Recommended design:


  • Combine the management switch and VM traffic switch into dual function switches to provide both types of traffic.
  • This uses vlan tags to include vMotion and management traffic on the same two uplinks providing card redundancy (configured active / passive)  Could also be configured with multi-nic vMotion but I would avoid due to complexity around management network starvation in your situation.
  • Backup continues to have it’s own two adapters to avoid contention

This does require some careful planning and may not be the best possible use of links.   I am not sure you need 6 links for your VM traffic but it cannot hurt.


Final Thoughts:

Is any design perfect?  Nope lots of room for error and unknowns.  Look at the design and let me know what I missed.  Tell me how you would have done it differently… share so we can both learn.  Either way I hope it helps.

GlusterFS and Virtualization

A new friend tipped me off to glusterfs which is a distributed file system for linux.   With the market quickly shifting to hyper-converged solutions I find myself revisiting software defined storage as a possible solution.  Each day I am confronted with new problems caused by the lack of agility in storage systems.  Monolithic arrays seem to rule my whole world.  They cause so many problems with disaster recovery… a vendor friend once told me it’s nothing 300k in software licences cannot solve.   This is the number one problem with storage vendors… they are not flexible and have not made any major advances in the last twenty years.   Don’t get me wrong there are new protocols (iSCSI, FCoE) and new technologies (dedupe, VAII etc..) but at the end of the day the only thing that has really changed is capacity and cache sizes.  We have seen SSD improve performance pushing the bottleneck to the controllers but it’s really the same game.   It’s an expensive rich man’s club where disaster recovery costs millions of dollars.  Virtualization has changed that market a little… a number of companies are using products of Zerto to replication between long distances for disaster recovery.   There are a number of software based replication solutions for virtualization (vSphere replication every 15 minutes, Veeam etc..)  and they solve one market.   What I really want is what google has distributed and replicated file systems.   My perfect world would look something like this:

  • Two servers at two different datacenters
  • Each having live access to the same data set
  • Read and write possible from each location
  • Data is stored at each location so no part of the server requires the other site
  • Self healing when a server or site is unavailable

Is this possible?  Yes and lots of companies are doing this using their own methods.  GlusterFS was brought by RedHat last year and turned into RedHat Storage Server.   In fact RedHat has bought at least three companies in the last year that provide this type of distributed replicated storage system.  This has been a move to create a standardized and support backend for swift (openstack storage bricks).   Thanks to RedHat we can expect more from glusterfs in the future.  Since I play around with VMware a lot I wanted to try using glusterFS as a backend for VMFS via NFS.  This would allow me to have a virtual machine live replicated to another site using glusterfs to replicate.   Nearly no data loss when a site goes down.  In a DR situation there are VMFS locks and resignatures that have to take place but for now I was just interested in performance


Setting up glusterfs is really easy.   We are going to setup a three node replicated volume.

Enable EPEL to get the package:

   wget -P /etc/yum.repos.d

Install glusterfs on each server

   yum install glusterfs-server -y

Start glusterfs

   service glusterd start

Check Status

   service glusterd status

Enable at boot time

   chkconfig glusterd on

Configure SELinux and iptables

SeLINUX can be a pain with free gluster… you can figure out the rules with the troubleshooter and work it out or run in disabled or permissive.  IPtables should allow 100% communication between cluster members any network firewall should have similar rules.

Create the trusted pool 

server 1 - gluster peer probe server2
server 2 - gluster peer probe server1
server 1 - gluster peer probe server3

Create a Volume

Mount some storage and make sure it’s not owned by root – storage should be the same size on each node.

  mkdir /glusterfs
  mount /dev/sd1 /glusterfs

On a single node create the volume called vol1
  gluster volume create vol1 replica 3 server1:/glusterfs server2:/glusterfs server3:/glusterfs
  gluster volume start glusterfs

Check Status

  gluster volume info


Gluster uses NFS natively as part of it’s process so you can use the showmount command to see gluster mounts

  showmount -e localhost

You can also mount from any node using NFS (or mount then share out).    I recommend if you are going to write locally mounting it locally using glusterfs:

mount -t glusterfs server1:/vol1 /mnt

Mounting in VMware
You can mount from any node and gain the glusterfs replication.  But if this node goes away then you will not have access to storage.  In order to create a highly available solution you need to implement linux heartbeat with VIP’s or use a load balancer for NFS traffic.   (I may go into that in another article).  For my tests a single point of failure was just fine.



I wanted to give it a try so I setup some basic test cases in all these cases except the remote in #1 the same storage system was used:

  1. Setup a three node glusterfs file system on linux virtual machines who will then serve the file system to ESXi.   Two nodes will be local and one node is remote (140 miles away across a 10GB internet link).
  2.  Setup a virtual machine to provide native NFS to ESXi as a datastore all local
  3. Native VMFS from Fiber Channel SAN
  4. Native physical server access to Fiber channel SAN

In every case I used a virtual/physical machine running RedHat Linux 6.5 x64 with 1GB of RAM and 1CPU.



I used two test cases to test write and read.  I know they are not perfect but I was going for a rough idea.  In eash case I took an average of 10 tests done at the same time.

Case 1

Use dd to write 1Gb of random data to file system and ensure it is synced back to storage system.  Sync here is critical it avoids skew from memory caching of writes.  The following command was used:

dd if=/dev/urandom of=speedtest bs=1M count=100 conv=fdatasync

Here are the numbers:

  1. 5.7 MB/s
  2. 5.6 MB/s
  3. 5.4 MB/s
  4. 8.2 MB/s

In this case only direct SAN showed a major improvement over all other virtual test cases.  gluster performed pretty well..


Case 2

Timed Cache reads using the hdparm command in linux.   This has a number of issues but it’s the easiest way to diagnose reads:

  1. 6101
  2. 5198
  3. 8664
  4. 14614


End result… oddly reads are a lot faster when using native VMFS and direct SAN.

Summary of results

My non-exhaustive testing proves that it’s possible to use glusterfs as a backend for VMFS taking advantage of gluster replication to live replicate between sites.   There are a lot more tests I would want to perform before I ever consider this a possible production solution but it’s possible.  (Bandwidth usage for this test was low.. 100mb or less the whole pipe)  I am concerned what happens to glusterfs when you have many virtual machines running on a volume it may kill the virtual machine.   I would love to repeat the test with some physical servers as the gluster nodes and really push the limit.   I would also like to see gluster include some features like SSD for caching on each server.  I could throw 1.6 TB’s of SSD in each server and really fly this solution.    There are other methods for gluster replication like geo-replication which was not tested.  Let me know what you think… or if you happen to have a bunch of servers you want to donate to my testing  :).    Thanks for reading my ramblings.