How do VMware Snapshots work

When I studied computer science it was not a raw science.  My training did not require knowledge of how transistors worked or even logic circuits.  It focused mostly on programming languages and how to configure a web server.   Why?  Because these were the skills most likely to be used by a computer scientist in the field today.   Very few people build computers from scratch.  Intel has a corner on that market.  Personally I wanted to understand all the under the hood components so I took a minor in electrical engineering.  It was worth my time and a great learning experience.   I find that a lot of technology is like this… which includes VMware snapshots.   I have had snapshots explained to me in every VMware course that I have attended and every answer is different.  I have cobbled together lots of KB articles and other sources into this article.  If something is missing or incorrect let me know so I can fix it.

What is a Snapshot?

  • A snapshot file is only a change log of the original virtual disk
  • A virtual machine uses the disk descriptor file to access the most current snapshot not the original disk.
  • It is not a complete copy of the original disk
  • Snapshots are combined dynamically with the original disk to form a current state of system
  • The snapshot change logs are sometimes called delta files
  • Think of them as chains to get the complete picture you need all the links in order.
  • Snapshot files will grow forever unless deleted(re-integrated into original disk)
  • Using a lot of snapshots can effect performance of the virtual machine

How do snapshots work?

The process on the surface seems simple.  When you initiate a snapshot the following process is followed:

  1. A request to create (CreateSnapshot) snapshot for a virtual machine is forwarded to ESXi host running the virtual machine.
  2. If memory snapshot is included the ESXi host writes the memory of virtual machine to disk
  3. If quiesce is possible the ESXi host request the guest OS to quiesce the disks via VMtools
  4. ESXi host changes the virtual machines snapshot database (.vmsd file) to denote snapshots
  5. ESXi host calls a function to make changes to the child disks (-delta.vmdk via the .vmdk descriptor)

What is a .vmdk descriptor?

Due to the nature of file systems os’s don’t like to have file names change mid-access.  So VMware has implemented descriptors.  Or symbolic links that point to the real files.  This allows a snapshot to be created and access via descriptor to continue.

How can I identify snapshots?

  • Select the Virtual Disk and check the Disk File. If it is labeled as VM-000001.vmdk , the virtual machine is running on snapshot disks.
  • Run the following command from ESXi Shell:
find /vmfs/volumes/ -iname "*.vmx" -exec grep -Hie "-[0-9][0-9][0-9][0-9][0-9][0-9].vmdk" {} \;
  • List currently open delta disks via command line:
ls -l /vmfs/devices/deltadisks
  • Locate all delta disks on file system:
find /vmfs/volumes/ -iname "*delta.vmdk"
  • In powercli
get-vm | get-snapshot | format-list

 

 

Active directory authentication fails with ESXi 5.5 Fresh install

I ran into this issue last week while upgrading several ESXi environments to 5.5.  On these I needed to move the boot from SAN partition so I did fresh installs from the disks.  Everything was great until I tried to login to the ESXi host with domain credentials.  (By default ESXi looks for members of a AD group called ESX Admins)  All AD authentication requests failed.   At first I thought it was a issue with the HP version of ESXi from Sep. 2013:

VMware-ESXi-5.5.0-1331820-HP-5.71.3-Sep2013

Turns out it’s a issue with VMware’s base image.  So every fresh install could have this issue.   When you setup AD authentication a number of daemons are started:

netlogond, lwiod, and lsassd

First time they run they create a number of directories and create pid files in /var/lock/subsys.   Unfortunately this is where the error lies.  There is no /var/lock/subsys directory.  It was missed.  I confirmed the issue still exists on the latest patch set as well.  So here is the work around:

Log in to ESXi via SSH or vMA as root (since you cannot login as you)
mkdir /var/lock/subsys
/etc/init.d/netlogond restart; /etc/init.d/lwiod restart; /etc/init.d/lsassd restart;

And test AD login… all should be good.  VMware should post a KB article in the near future.

UPDATE:’

VMware has posted the KB : http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2075398

They state it’s not a issue beyond the display of the message and if you refresh the status everything is running.  Personally my experience is you need the fix to make it work… but it’s just me.

VMware Network IO Control

Network IO Control has been around since 4.1 but I mostly ignored it.   We mostly would use HP’s virtual connect to divide up 10GB connections into smaller nics.  As I spend more time designing solutions I have found Network IO control to be my best friend.   I allows you to get the most of out of a 10GB connection at all times.   The concept is simple:  Each type of network traffic is allocated a share (number between 1 and 100).  Network IO comes with some predefined traffic classes that are automatically assigned by vsphere.

  • vMotion
  • iSCSI
  • FT Logging
  • Management
  • NFS
  • Virtual machine traffic

In addition you can create your user defined class of traffic.  User defined classes of traffic can be assigned at the port group level.   You need to keep a few things in mind when working with Network IO

  • Network IO is evaulated on the dvUplink level (shares are per network uplink)
  • Network IO requires vNetwork Distributed Switch (vDS)
  • Network IO shares are only used when contention is present (in other words each type of traffic get 100% of requested bandwidth unless contention)
  • When evaluating shares only active traffic is taken into account (For example if you have NFS with 30, Virtual machine with 100, and vMotion with 50 but you don’t use NFS then you only have 150 shares to divide your 10GB)
  • Network IO only applied to outbound flow – it cannot do anything about shaping in bound flow.

In addition network IO control offers two addition features:

  • Limits – Just like cpu or memory limits (allows you make the customer think they have 10Gb for their virtual machine but never allowing them more than 1gb) – I would avoid limits unless you have a odd use case
  • Load-Based Teaming – this new feature allows you to move traffic flow to an addition uplink once you reach 75% of capacity over a 30 second period.  This is 100% the best load balancing option with vmware.

Network IO provides some awesome features that you should play with.   You can read a older but still valid white paper here.

VMware technical white papers

I have always been a sucker for technical white papers.  Most companies make them hard to find a search they also require you to sign up for an account.  VMware provides them all via a searchable interface which is great.  I was reading the new technical white paper about how to use open stack with vmware and that lead me to their repository for white papers.   Check out this interface:

https://www.vmware.com/resources/techresources/cat/125

I just wish I had more hours in my day.

New free vmware course Network Virtualization Fundamentals

As VMware jumps into the world of network virtualization they are ramping up their education services.  They have released their first course in network virtualization and it’s a free course.  It covers the following topics:

1. Why we need network virtualization

2.  What VMware defines as network virtualization

3. What VMware products provide network virtualization

Much like the other free course it’s 50% marketing and 50% product awareness.  I assume they will soon release a VCA level certification for network virtualization to compliment the course.  You can sign up for the free course here:

http://mylearn.vmware.com/mgrreg/courses.cfm?ui=www_edu&a=one&id_subject=52533

ESXi purple screen of death with errors E1000PollRxRing and E1000DevRx

Everyone loves the purple screen of death right?

300px-ESX_PSOD_adaptec_asm

This is a nasty problem that I ran into… It happens with Windows virtual machines using the E1000 adapter and is a known issue. Since this is a known issue from the 5.0 days I doubt vmware is going to resolve it.  The only fix is to replace the E1000 adapters with VMXNET3 adapters.   You really should be using VMXNET3 adapters anyways but replacing these requires downtime.   It’s a pain.   You can read the official vmware article here:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2059053