March 2015 – Joseph Griffiths

March 28, 2015

Storage vMotion and Change block Tracking are now friends

A lot of readers may be aware that storage vMotion is a awesome feature that I love. It had one nasty side effect it reset change block tracking (CBT). This is a huge problem for any backup product that uses CBT (Veeam, anything else that does an image level backup). It means that after a storage vMotion you now have to do a full backup. It’s painful and wasteful. It means that most enterprise environments refuse to use Automation storage DRS moves (one of my favorite features) because of it’s impact on backups. Well the pain is now over… and has been for a while I just missed it :). If you look at the release notes for ESXi 5.5 U2 you will find the following note:

Changed Block Tracking is reset by storage vMotion
Performing storage vMotion operation on vSphere 5.x resets Change Block Tracking(CBT).
For more information, see KB 2048201

This issue is resolved in this release.

I wish I told us more about the change or why it is no longer reset but I guess I’ll accept it as is. The previous work around was don’t use storage vMotion or do a full backup after… which is not a work around but an effect. Either way enjoy moving the world again.

March 22, 2015

Free Central Ohio Lunch and Learn session on HA and DRS April 1st

I have started up the Lunch and Learn sessions again. We have sign up’s and a schedule. I ask that you sign up so I can notify you if a session has been canceled. Your information will not be shared with anyone at all. You can sign up here:

goo.gl/JC6HlK

I will be conducting the discussion around HA and DRS on April 1st.

All sessions will be at:

Columbus Public Library Driving Park location

Small Meeting Room 2

1422 E Livingston Ave

Columbus, OH 43205

I am still looking for people willing to present on the other topics so please contact me if interested.

March 22, 2015

Who is the conductor in vitualization?

Disclaimer: This is a rant about technology. I will return you to your normal technical posts soon.

Recently I attended the Columbus Symphony with my daughter. She has an interest in music and I want to encourage it. As I was sitting in the theater watching the performers a few things struck me:

Redundancy
Scale out
Unity
The unknown problem
The role of the conductor

Redundancy

As you watch the symphony play there are many different instruments each with individual functions. For each sheet music there are two performers while playing if a page on the sheet music needs to be turned one of the two performers stops playing and changes the page. At the next page turn the other performer takes his turn. This automated and orderly way to perform offline duties reminds me of infrastructure. We are constantly looking to remove single points of failure. We want to create redundancy so if maintenance or failure occurs the whole performance is not effected (page turning or broken strings). Failure does not happen often because each performer should do regular maintenance on their individual instruments. This redundancy is critical to a well running infrastructure. We must be able to perform regular maintenance and be redundant while a failure occurs. During periods when the pages are being switches it’s possible our infrastructure will not be at full strength this is where scale out comes into play.

Scale out

It can be easy to see how two people playing an instrument is not enough to provide the required volume and power for the performance. This is where we see the principle of scale out in play. I can add as many two person violin groups as I need to produce the required volume. Adding more violins should be possible to meet the demands of the location or song. The challenge with scaling out is three-fold:

Expertise required
Management demands of scaling out
Balancing the needs as a whole

Expertise

In order to fill more chairs I need more highly trained performers. In infrastructure terms I need more specialized devices that are compatible. I cannot simply plug a kazoo player into a violin and expect beautiful music, compatibility and skills are required. This infrastructure tenant applies to all aspects of infrastructure.

Management demands of scaling out

As I scale out I quickly find it hard to manage so many people, simply put unless I can manage 100 people exactly as I manage 1 there is a cost associated. This is where scale out computer solutions have the advantage, assuming you buy a solution from the same vendor we hope they can be managed as one entity. I have found that vendors solutions don’t seem to have this level of intelligence. VMware has brought us vSphere which does abstract and pool compute resources. It seems that a lot of storage and networking vendors have not discovered the idea behind scale out without making it hard to manage.

Balancing the needs as a whole

Adding more violins increases the volume of my violins but may drown out all other instruments except the drums. This is not a desired effect. Adding more violins has the potential effect of requiring more instruments to meet the newly required scale. This is a very hard thing to balance. In storage systems we need balancing acts between iops, cache sizes, algorithms and spinning disks. In networking we see total throughput, hair pinning and redundant architectures all effecting our scale up. In compute we have the introduction of server-side flash and cache with the needs of the application as a whole. One cannot simply increase one metric without looking at its effect upon the whole.

Unity

In the symphony they all have a common goal. They know that goal from the start, they have practiced and trained for that goal. (QA testing and programming logic) They require that all components do their job in unity to correctly achieve the goal. If one section of instruments is a few seconds off from the rest the performance is ruined (at least for those that can tell the difference). Their unity and timing is critical. Humans are prone to mistakes and they will happen. Performers will get out of sync and need to catch up. Infrastructure is the same way. If my networking chooses to delay a message for a few seconds everything else is effected. We need all the components to work perfectly every time. This is harder than it sounds. Computers are programs and cannot account for anything that was not provided in their program.

The Unknown Problem

Here is the big problem. It’s what we don’t know that will kill us. In religion there is the concept of absolute truth and relative truth. Absolute truth is truth based upon all the facts. The concept is that if we understood everything we could always make the right choice. We would be able to be perfect and create without failures. Religion is largely based upon following a being that has absolute truth. Relative truth is truth based upon our current understanding until proven incorrect, think world is flat… now its roundish. Relative truth is the world that we deal with each day. In the performance assumptions can be made about the required number of performers based on the size of the hall or past experience with the hall. Best practices around performance sizing can be made. The assumptions are just that assumptions they cannot take into account all possible eventualities. Disaster may strike like a floor or roof caving in or something simple like an accident outside the theater will cause an ambulances noise to ring for half the performance. Look for Maryland car accident lawyer to learn more about this topic. These factors are unknowns and are common. When writing code in college I often had my wife test out the application. It normally took her about 15 seconds to do something totally unexpected (by me) and break everything. It was so frustrating. Users and applications will do the unexpected. There are a lot of unknowns like effect of a lightning induced power failure on your storage system (it’s not good trust me). The unknown requires that we keep an open mind and adjust as needed. All IT is software defined. It does not matter if it’s on a chip or running in memory it’s software defined. Firmware is software that runs on hardware. The critical concept for me from software defined IT is the ability to have intelligence and agility. I love the story about the last google outage. We had a bug introduced into our production networking and it was detected and removed automatically by the software. Can anyone else say awe-some followed up by I am afraid of Sky-net. (For non-US readers Sky-net is a A.I. from the movie series terminator and tried to kill all humans) This is intelligent and agile. The latest movement to define in software should provide quicker redundant fixes to the unknown problem.

The role of the conductor

The conductors role is to unify the performers, set a tempo, execute clear preparations and beats, and to listen critically and shape the sound of the ensemble (Wikipedia). So he is the big boss man whole keeps the whole ship running perfectly. He could be called the architect but it’s simply untrue. The architect is a person who works with relative truth, old truth and observed truth. In order to understand my problem with the architect being called the conductor I have to illustrate another challenge: the music changes. The symphony plays a song then changes to the next song. Their roles and goals change. Violins may have had a heavy role in the last song and a very minor role in this song. Making that scale out of performers not required. The game is constantly changing each with their own challenges. Infrastructure has a much larger problem: there is not common goal. Take for example that I am running 200 virtual machines. Each virtual machine has a different role and different needs. They are like 200 garage bands playing at the same time. No amount of conducting can solve the lack of similar goals. It will sound really bad or least really loud. Each application really needs their own conductor and space. They need to be able to get access to resources in an intelligent way without effecting other applications.

Who is the conductor?

Like it or not each of our applications is our own conductor. Treating them as a single entity with the same metrics is only asking for trouble. We have been given a number of tools in the compute arena to manage individual applications like reservations, DRS, SDRS, NIOC etc. This allows the conductor of vSphere to understand some metric around our little bands. This knowledge is even automated from time to time to make out life easier for example DRS. This understanding of our applications ends at the compute layer. Storage and networking treats everyone the same. There have been some inroads into this problem: QoS and IOP’s allocations. At the end of the day storage systems want to deal with writes and reads, network wants to deal with transfer of data and neither wants to be intelligent about the 200 applications running on those ESXi hosts. When I provisioned storage to a single server it was easy. Now I provision storage to potentially 32 servers running 4,000 little bands. I need a master conductor, I need agility, I scale up, I need unity, I need something that allows my application to be their own conductors and most of all I need intelligence. I need all these things to work together in concert at my individual operating system layer. I need virtualized networking and storage. I need the same magic VMware brought with ESXi to the other realms. This post is not a slam on vendors they do an awesome job and I geek out on their stuff everyday. This is not easy or it would already be done. There are vendors out there doing parts of this today. We need to find them and support them to bring change.

March 12, 2015

Sending vCenter Alarms via SNMP Challenges with hidden event names

I have been working with vSphere to get internally generated Alarms an SNMP Trap for ticket generation. This process seemed simple on the surface but proved quite challenging. The high level steps are as follows:

Choose which vCenter Alarms need to be ticketed
Configure the Alarms to send SNMP events to SNMP trap
Download the vSphere MIB and install in SNMP Trap
Configure the Alarm actions on SNMP Trap
Tickets get opened

Choose which vCenter Alarms need to be ticketed

You can get a list of current alarms with powercli as follows:

Get-AlarmDefinition

Configure Alarms to send SNMP

vCenter must be configured to use SNMP with the following lines:

$srv = vcenterservername

Get-AdvancedSetting –Entity $srv  –Name snmp.receiver.2.community | Set-AdvancedSetting –Value public
Get-AdvancedSetting –Entity $srv  –Name snmp.receiver.2.enabled | Set-AdvancedSetting –Value $true
Get-AdvancedSetting –Entity $srv  –Name snmp.receiver.2.name | Set-AdvancedSetting –Value 192.168.1.10

The following will add SNMP to the alarms:

Get-AlarmDefinition -Name "Alarm1" | New-AlarmAction -Snmp

Configure the Alarm Action in SNMP Trap

I ran into a number of issues that generated this community post . The essential issue is that all SNMP events generated by vSphere come in as the same type of event vpxaAlarmInfo. The details of the event contains information an internal name. This is where the problem begins. The name for any custom created event is the name of the event. For example if I create a Alarm called JoeTest then it’s called JoeTest. Sounds simple right? Well… no because the VMware built in alarms don’t following this naming convention. The Host connection and power state (easiest one for me to generate) is named alarm.HostConnectionStateAlarm. Making my mappings for any VMware generated events very hard. So I went on a quest to locate these names.

The Quest for the names

My first stop was PowerCLI using the command:

$bob = Get-AlarmDefinition -Name "Host connection and power state"

$bob | fl

This fine powershell did not produce the alarm.HostConnectionStateAlarm name. It did produce a Alarm-145 (unique to my vCenter). I tried lots of ways to work on this object like get-view etc… without any luck.

My next stop was the MOB (Managed Object Browse) also known as my least favorite place. Using the following MOB I was able to learn everything about the alarm except the name for SNMP:

https://vcenter/mob/?moid=alarm-145

https://vcenter/mob/?moid=alarm-145&doPath=info

https://vcenter/mob/?moid=alarm-145&doPath=info.action.action

https://vcenter/mob/?moid=alarm-145&doPath=info.expression.expression

https://vcenter/mob/?moid=alarm-145&doPath=info.setting

This lead me to my last stop the vCenter database. Some finely crafted searches produced a number of tables with the alarm.xxx information. I was left with the VPX_EVENT_ARG table. It seems to be a table of all events in the system. Inside this I was able to locate names that seemed to fit. A few more minutes did not produce any primary keys to link to the Alarm tables. I was stuck so I punted. The following is a SQL command I used to produce the Alarms names:

select distinct OBJ_NAME from [vCenter].[dbo].[VPX_EVENT_ARG] where obj_name like ‘%alarm%’

It produced the following built in Alarm names:

alarm.BatteryHealthAlarm
alarm.BMCHealthAlarm
alarm.ConsistencyGroupViolation
alarm.DatastoreDiskUsageAlarm
alarm.DatastoreInMultipleDatacenters
alarm.DatastoreStorageComplianceAlarm
alarm.ExitStandbyErrorAlarm
alarm.FanHealthAlarm
alarm.HAcannotFindMaster
alarm.HAfailoverFailed
alarm.HAfailoverInProgress
alarm.HAhostStatus
alarm.HAinsufficientFailoverResources
alarm.HAvmMonitoringAction
alarm.HAvmMonitoringError
alarm.HealthStatusChangedAlarm
alarm.HostConnectionStateAlarm
alarm.HostConnectivityAlarm
alarm.HostCPUUsageAlarm
alarm.HostErrorAlarm
alarm.HostEsxCosSwapAlarm
alarm.HostLicenseEditionNotAllowed
alarm.HostMemoryUsageAlarm
alarm.HostVendorProviderRegistrationAlarm
alarm.IormNonVIWorkloadAlarm
alarm.LicenseCapacityExceededAlarm
alarm.LicenseError
alarm.LicenseNonComplianceAlarm
alarm.LicenseUserThresholdExceededAlarm
alarm.LunCapabilityAlarm
alarm.MemoryHealthAlarm
alarm.MigrateBindToVMKAlarm
alarm.MigrationErrorAlarm
alarm.NetworkConnectivityLostAlarm
alarm.NetworkRedundancyDegradedAlarm
alarm.NetworkRedundancyLostAlarm
alarm.OtherHealthAlarm
alarm.PowerHealthAlarm
alarm.ProcessorHealthAlarm
alarm.SELHealthAlarm
alarm.SiocNotSupportedHostAlarm
alarm.StorageConnectivityAlarm
alarm.StorageHealthAlarm
alarm.StoragePodOutOfSpace
alarm.StoragePodSDRSNotSupportedHost
alarm.StoragePodSDSRecommendation
alarm.SystemBoardHealthAlarm
alarm.TemperatureHealthAlarm
alarm.ThinProvisionedLunAlarm
alarm.VCHealthStateChangedAlarm
alarm.VdsHCMTUMatchAlarm
alarm.VdsHCMTUSupportedAlarm
alarm.VdsHCTeamingMatchAlarm
alarm.VdsHCVlanTrunkedAlarm
alarm.VFlashResourceHealthAlarm
alarm.VFlashResourceUsageAlarm
alarm.VmCPUUsageAlarm
alarm.VmDiskConsolidationNeededAlarm
alarm.VmErrorAlarm
alarm.VmFaultToleranceLatencyStatusAlarm
alarm.VmFaultToleranceStateChangedAlarm
alarm.VmMemoryUsageAlarm
alarm.VmNoCompatibleHostForSecondaryAlarm
alarm.VmStorageComplianceAlarm
alarm.VmTimedoutStartingSecondaryAlarm
alarm.VoltageHealthAlarm
alarm.VsanClusterLicenseExpiryAlarm
alarm.VsanHostSsdOverUsageAlarm

Testing two additional events confirmed I was on the correct track.

End Result

Yep I don’t have a clue how they link but it produced a list that seems to work and I it. I hope it helps you save some time.

March 12, 2015

Enable Stateless Cache on Auto Deploy

Auto Deploy Really?

Yes a big autodeploy post is going to be following up soon. I can really seen the benefit of auto deploy in larger environments. I’ll be posting the architectural recommendations and failure scenarios soon. Today I am posting about stateless cache and USB.

What is stateless cache and why do I care?

Stateless cache allows your auto deployed ESXi host (TFTP image running in memory) to be installed on a local hard drive. This enables you to boot the last running configuration without the presence of the TFTP server. It’s a really good protection method. It is enabled by editing the host profile and in 5.5 it can be enabled using the fat client:

Select the profile and right click on it
Select Edit
Expand System Image Cache Configuration
Click on System Image Cache Profile Settings
Select the drop down and choose the stateless caching mode you want.

This all sounds great but we had a heck of a time trying to get it to stateless cache to SD cards on our UCS gear. A coworker discovered that SDcards are seen as USB devices. Once we select “Enable stateless caching to a USB disk on the host” everything worked.

Design Constraints

Using stateless caching will protect you against a failure of TFTP and even vCenter but DHCP and DNS are both still required for the following reasons:

DHCP to get IP address information
DNS to get hostname of ESXi host

Stateless does not remove all dependencies but it does allow quick provisioning.

March 4, 2015

VCIX-NV Study Guide Objective 1.2

To see other posts in this series go here.

This section deals with upgrading from older versions of vShield to NSX. The simple answer is there is a specific order that must be followed. Upgradeds from vShield require version 5.5. Most of it is in the GUI via vCenter except the vShield Manager which will be replaced by NSX Manager. Most of these processes roughly follow the documented process in this document.

Products name translation:

Roughly here are the old names to new names or new service providing function:

vShield Manager -> NSX Manager

Virtual Wires -> NSX Logical Switch

vShield App -> NSX Firewall

vShield -> NSX Edge

vShield Endpoint ->vShield Endpoint

Data security -> NSX Datasecurity

Practicing this process:

Unless you want to take a few hours configuring all vShield products it’s hard to practice. You can do the upgrade from vShield Manager to NSX manager really quickly. Just download the vShield Manager and setup with the following:

Deploy OVF
Power on
Console login as admin with password of default
type enable with password of default
type setup
Setup your IP settings
Wait 5 minutes
Login via IP with web browser and do upgrade

The rest of the upgrade requires you understand vShield products which is not required for NV so I vote you skip it and be familiar with process, order and requirements.

Objective 1.2 Denotes the following items:

Upgrade vShield Manager 5.5 to NSX Manager 6.x.

Upgrading vShield Manager to NSX Manager can only be done from version 5.5 of vShield. It also requires the following things:

vCenter 5.5
vShield Data Security uninstalled
vShield Edges be upgraded to 5.5

Process:

Download the NSX upgrade bundle called vCNS to NSX for vSphere xxx Upgrade Bundle
Login to vShield Manager and click Settings & Reports
Click Updates tab and click upload upgrade bundle
Click Choose file Browse to the vCNS to NSX for vSphere xxx Upgrade Bundle and click open
Click Upload file – this process will take some time
Click Install to begin the upgrade process
Click confirm install – this will reboot the vShield manager – none of the other components are rebooted
After upgrade visit the ip address of your vShield manager again via https
Login and look at summary page to confirm you are running NSX Manager
Log off all windows and close your browser to clear cache
Login to vSphere Web client
Shutdown your NSX Manager vm and increase memory to 12GB and vCPU to 4

Upgrade NSX Manager 6.0 to NSX Manager 6.0.x
Upgrade Virtual Wires to Logical Switches

Virtual wires must be upgraded to NSX logical switches to use NSX features. The process is required even if you don’t use virtual wires. In order for this to work you need to upgrade your vShield manager to NSX manager and make sure it’s connected to vSphere.

Process

Login to Web client
Networking and Security Tab click install
Click host prepare
Virtual wires will show as Legacy
Click update on each wire
Wait for them to show green and no longer legacy

Upgrade vShield App to NSX Firewall

You can only upgrade vShield App 5.5 to NSX. It requires that vShield manager be upgraded to NSX manager and virtual wires upgraded to NSX logical switches.

A pop up window should ask if you want to upgrade
Click update and wait
Done

Upgrade vShield 5.5 to NSX Edge 6.x

This upgrade requires the following:

vShield 5.5
NSX Manager
Virtual wires upgraded to NSX logical switches

Processes:

Login to web client
Networking & Security tab
NSX Edges button
Select upgrade version from actions menu on each edge
After compete check the version number tab

Upgrade vShield Endpoint 5.x to vShield Endpoint 6.x

This upgrade requires the following:

vShield Manager upgraded to NSX Manager
Virtual wires upgraded to NSX Logical switches

Process:

Login to web client
Networking & Security tab
Click Installation
Click Service deployments tab
Click on upgrade available
Select datastore (must be shared) and network and ok

Upgrade to NSX Data Security

There is no clean upgrade path you have to remove before install of NSX manager. You have to re-register the solution with NSX if available.