This post spawned from a question in the official VMware forums and got me thinking about this problem. For years I have manually adjusted by partition alignment on my linux machines to get that little bit of improvement from the disk. So the question was with so many layers (physical array, array raid, VMFS, Guest OS) what is the best practice around alignment and block size.
VMFS 5 facts
First lets start with some facts around VMFS version 5 (this assumes they are new lun’s in vmfs 5 not upgraded luns. Upgraded lun’s retain their original block size)
- Unified 1MB block file size (only present on new lun’s upgraded lun’s retail their 4.xx block size)
- Supports very small files (<1KB) by storing them in the metadata rather than in the file blocks.
- Uses sub-blocks of 8K rather than 64K, which reduces the space used by small files.
- Partition table has been changed from MBR to GPT – which allows files larger than 2TB (remember that the max size for a single drive presented to a guest is locked at 2TB-512Bytes until 5.5 which allows 60TB’s to be presented to guests)
- Upgrade does change the part table but only if the lun is over 2TB
- Increased file count VMFS-5 allows for 100,000 files
You can check the current settings of your lun (figure out block size etc..) via the following command:
vmkfstools -Pv 10 /vmfs/volumes/newly-created-vmfs5/
What is the deal with alignment:
Sectors on the disk
I will skip the legacy history lesson but the simple answer is hard drives are divided up into sectors using a method called Logical Block addressing (LBA). LBA assigned the carved up disk sector numbers which are used to address specific locations. The original size of a sector is 512 byte in size but there has been a big movement toward 4096 byte sectors. The larger sector size provides some great efficiency but due to backward’s compatibility most storage will allow for 512 byte read’s / writes emulated which actually use 4096.
RAID on the disks
So now we have disk cut into 4096 bytes and we want to give it some redundancy so we stripe / mirror data across multiple disks. So on top of our 4096 byes we put data format carved into increments of 4KB to 256KB’s depending on the the array. Now since all options should be in increments of 4KB’s we will not need half a sector to do a read. See the diagram below for the correct alignment:
Now if you raid set used an alignment of 13KB this would cause all kinds of problems. The read/writes would cross sectors and carve them up into all kinds of messes.
So the good news is your storage vendors know this and make sure their raid size is an increment of 4KB to avoid this. What size does your array you use? Completely up to your storage vendor. You need to ask them.
VMFS Again
So how does vmfs handle alignment well it’s simple it used 1MB alignment size. so now we are up from 4KB – disk to 4KB – 256KB Raid to 1MB for VMFS. There is another portion to this story disk alignment. I will cover disk alignment more in the Guest OS section at this point just know that VMFS takes care of the issue.
Guest OS
Now we are really having fun. The fine people who created your operating system needed some space at the front of every drive to hold partition information. In their wisdom they used 62 bytes leaving 1 byte free and thus creating a mis-alignment with the 4KB sectors. This alignment will create cross boundry read/writes as seen on the raid section. So you need to offset your parition to start at a logical 4KB boundry like 63. Failure to do this can cause problems. In addition what block size should you use on your OS. Well if given the choice you do not want to go smaller than 1MB since that’s the size VMFS uses. You also want to make sure it’s divisable by 4KB. So what should you use? Really up to you and your operating system. On Linux I normally use 4MB block sizes. Windows does this on it’s own and aligns correctly.
What is the big deal? How much do I really gain through all this math?
Not a whole lot really but it’s worth building your templates with these boundaries in mind. Remember that all these things apply to physical machine best practices minus the 1MB VMFS size. It’s a best practice and remember that with virtual machines it’s hundreds of mis-reads / writes and contention for the same sectors which can really add up.