A new friend tipped me off to glusterfs which is a distributed file system for linux. With the market quickly shifting to hyper-converged solutions I find myself revisiting software defined storage as a possible solution. Each day I am confronted with new problems caused by the lack of agility in storage systems. Monolithic arrays seem to rule my whole world. They cause so many problems with disaster recovery… a vendor friend once told me it’s nothing 300k in software licences cannot solve. This is the number one problem with storage vendors… they are not flexible and have not made any major advances in the last twenty years. Don’t get me wrong there are new protocols (iSCSI, FCoE) and new technologies (dedupe, VAII etc..) but at the end of the day the only thing that has really changed is capacity and cache sizes. We have seen SSD improve performance pushing the bottleneck to the controllers but it’s really the same game. It’s an expensive rich man’s club where disaster recovery costs millions of dollars. Virtualization has changed that market a little… a number of companies are using products of Zerto to replication between long distances for disaster recovery. There are a number of software based replication solutions for virtualization (vSphere replication every 15 minutes, Veeam etc..) and they solve one market. What I really want is what google has distributed and replicated file systems. My perfect world would look something like this:
- Two servers at two different datacenters
- Each having live access to the same data set
- Read and write possible from each location
- Data is stored at each location so no part of the server requires the other site
- Self healing when a server or site is unavailable
Is this possible? Yes and lots of companies are doing this using their own methods. GlusterFS was brought by RedHat last year and turned into RedHat Storage Server. In fact RedHat has bought at least three companies in the last year that provide this type of distributed replicated storage system. This has been a move to create a standardized and support backend for swift (openstack storage bricks). Thanks to RedHat we can expect more from glusterfs in the future. Since I play around with VMware a lot I wanted to try using glusterFS as a backend for VMFS via NFS. This would allow me to have a virtual machine live replicated to another site using glusterfs to replicate. Nearly no data loss when a site goes down. In a DR situation there are VMFS locks and resignatures that have to take place but for now I was just interested in performance
Setting up glusterfs is really easy. We are going to setup a three node replicated volume.
Enable EPEL to get the package:
wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/EPEL.repo/glusterfs-epel.repo
Install glusterfs on each server
yum install glusterfs-server -y
service glusterd start
service glusterd status
Enable at boot time
chkconfig glusterd on
Configure SELinux and iptables
SeLINUX can be a pain with free gluster… you can figure out the rules with the troubleshooter and work it out or run in disabled or permissive. IPtables should allow 100% communication between cluster members any network firewall should have similar rules.
Create the trusted pool
server 1 - gluster peer probe server2
server 2 - gluster peer probe server1
server 1 - gluster peer probe server3
Create a Volume
Mount some storage and make sure it’s not owned by root – storage should be the same size on each node.
mount /dev/sd1 /glusterfs
On a single node create the volume called vol1
gluster volume create vol1 replica 3 server1:/glusterfs server2:/glusterfs server3:/glusterfs
gluster volume start glusterfs
gluster volume info
Gluster uses NFS natively as part of it’s process so you can use the showmount command to see gluster mounts
showmount -e localhost
You can also mount from any node using NFS (or mount then share out). I recommend if you are going to write locally mounting it locally using glusterfs:
mount -t glusterfs server1:/vol1 /mnt
Mounting in VMware
You can mount from any node and gain the glusterfs replication. But if this node goes away then you will not have access to storage. In order to create a highly available solution you need to implement linux heartbeat with VIP’s or use a load balancer for NFS traffic. (I may go into that in another article). For my tests a single point of failure was just fine.
I wanted to give it a try so I setup some basic test cases in all these cases except the remote in #1 the same storage system was used:
- Setup a three node glusterfs file system on linux virtual machines who will then serve the file system to ESXi. Two nodes will be local and one node is remote (140 miles away across a 10GB internet link).
- Setup a virtual machine to provide native NFS to ESXi as a datastore all local
- Native VMFS from Fiber Channel SAN
- Native physical server access to Fiber channel SAN
In every case I used a virtual/physical machine running RedHat Linux 6.5 x64 with 1GB of RAM and 1CPU.
I used two test cases to test write and read. I know they are not perfect but I was going for a rough idea. In eash case I took an average of 10 tests done at the same time.
Use dd to write 1Gb of random data to file system and ensure it is synced back to storage system. Sync here is critical it avoids skew from memory caching of writes. The following command was used:
dd if=/dev/urandom of=speedtest bs=1M count=100 conv=fdatasync
Here are the numbers:
- 5.7 MB/s
- 5.6 MB/s
- 5.4 MB/s
- 8.2 MB/s
In this case only direct SAN showed a major improvement over all other virtual test cases. gluster performed pretty well..
Timed Cache reads using the hdparm command in linux. This has a number of issues but it’s the easiest way to diagnose reads:
End result… oddly reads are a lot faster when using native VMFS and direct SAN.
Summary of results
My non-exhaustive testing proves that it’s possible to use glusterfs as a backend for VMFS taking advantage of gluster replication to live replicate between sites. There are a lot more tests I would want to perform before I ever consider this a possible production solution but it’s possible. (Bandwidth usage for this test was low.. 100mb or less the whole pipe) I am concerned what happens to glusterfs when you have many virtual machines running on a volume it may kill the virtual machine. I would love to repeat the test with some physical servers as the gluster nodes and really push the limit. I would also like to see gluster include some features like SSD for caching on each server. I could throw 1.6 TB’s of SSD in each server and really fly this solution. There are other methods for gluster replication like geo-replication which was not tested. Let me know what you think… or if you happen to have a bunch of servers you want to donate to my testing :). Thanks for reading my ramblings.