Deep Dive: vSphere Network Load Balancing

In vSphere load balancing is a hot topic.   As load size per physical host increases so does the need for more bandwidth.  In a traditional sense this was done with etherchannel or LACP.  This bonds together multiple links so they link and act like a single link.   This helps avoid loops.

What the heck is a loop?

A loop is anytime two layer 2 (ethernet) endpoints have multiple connections to each other.

 

It is possible with two virtual switches to create a bridged loop if care is not taken.   Virtual switches by default will not create loops.  On the physical switch side protocols like spanning tree were created to solve this link issue.  STP disables a link if a loop is detected.  If the enabled link goes down STP turns on the disabled link.   This process works for redundancy but does not do anything if link 1 is not a big enough pipe to handle the full load.    VMware has  provided a number of load balancing algorithms to provide more bandwidth.

Options

  • Route Based on Originating virtual port (Default)
  • Route Based on IP Hash
  • Route Based on Source MAC Hash
  • Route Based on Physical NIC Load (LBT)
  • Use Explicit Failover Order

 

In order to explain each of these options assume we have a ESXi host with two physical network cards called nic1 and nic2.   It’s important to understand that the load balancing options can be configured at the network switch or port group level allowing for lots of different load balancing on the same server.

Route Based on Originating virtual port (Default)

The physical nic to be used is determined by the ID of the virtual port to which the VM is connected.  Each virtual machine is connected to a virtual switch which has a number of virtual ports, each port has a number.   Once assigned the port does not change unless the host changes ESXi hosts.  This number is the virtual ID.   I don’t know the exact method used but I assume it’s something as simple and odd’s and evens for two nics.  Everything odd goes to port 1 while even goes to port 0.  This method has the lowest overhead from a virtual switch processing, and works with any network configuration.  It does not require any special physical switch configuration.  You can see though it does not really load balance.  Lets assume you have a lot of port groups with only virtual machine on port 0.  In this case all virtual machines would use the same uplink leaving the other unused.

Route Based on IP Hash

The physical nic to be used is determined by a hash of the source and destination IP address.   This method provides load balancing to multiple physical network cards from a single virtual machine.  It’s the only method that allows a single virtual machine to use the bandwidth of multiple physical nics.  It has one major draw back the physical switches must be configured to use etherchannel (802.3ad link aggregation) so they present both network links as a single link to avoid problems.   This is a major design choice.  It also does not provide perfect load balancing.  Lets assume that you have a application server that does 80% of it’s traffic with a database server.  Their communication will always happen across the same link.  They will never use the bandwidth of two links.  Their hash will always assign them the same link. In addition this method uses a lot of CPU.

  • When using etherchannel only a single switch may be used
  • Beacon probing is not supported on IP Hash
  • vDS is required for LACP
  • Troubleshooting is difficult because each destination/source combination may take a different path.  (Some virtual machine paths may work with others will not in a non-consistent pattern.)

Route Based on Source Mac Hash

The physical nic to be used is determined by a hash created from the virtual machines source address.  This method provides a more balanced approach to load balancing than originating virtual port.  Each virtual machine will always use only a single link but load will be distributed.  This method has a low CPU overhead and does not require any physical switch configuration

Route Based on Physical NIC Load (Distributed Virtual Switch Required also called LBT)

The physical nic to be used is determined by load.  The nics are used in order (nic1 then nic2)  No traffic will be moved to nic2 untile nic1 is utilized above 75% capacity for 30 seconds.  Once this is achieved traffic flows are moved to the next available nic.  They will stay at that nic until another LBT event happens moving traffic.   LBT does require the dVS and some CPU overhead.  It does not allow a single virtual machine to gain more than 100% of a single link speed.   It does balance traffic among all links during times of contention.

Use Explicit Fail over

The physical nic to be used is determined by being the highest nic on the list of available nics.  The others will not be used unless the first nic is unavailable.  This method does no load balancing and should only be used is very special cases (link multi-nic vMotion).

 

Design Advice

Which one should you use?  It depends on your need.  Recently a friend told me they never changed the default because they never get close to using a single link.   While this method has merit and I wish more people understood their network metrics you may need to plan for the future.  There are two questions I use to determine which to use:

  • Do you have any virtual machines that alone require more than a single links bandwidth? (If yes then the only option is IP Hash and LACP or etherchannel)
  • Do you have vDS’s? (If yes then use Route based on physical nic load, if no then use default or source MAC)

Simply put the LBT is a lot more manageable and easy to configure.

Do IT certifications really matter?

Twice in the last week people in IT have asked this question of me.  My answer has been it depends.  When I first started my career I hated certifications.  This is mostly because in college I attended a Microsoft certification course.   This course was a memorize the content don’t worry if you don’t understand type of test/course.   It seemed pointless to me… I passed the test and still had never worked with half the stuff I was tested on.   The memorized information was soon lost and nothing other and a piece of paper was gained.   This tainted my view toward certifications.  For many years I did not see the point and avoided them.   A few years ago an employer encouraged me to get a VMware certification.  They also offered to pay.   So I took them up on the offer and got the VCP certification.   The required course for the certification was good because it allowed a lot of time for question and answer sessions.   The instructor knew the material very well.   It was a good course.  With a little additional study I passed the test and had another IT certification.

What did I learn?

Knowing I was going to have to take the VCP test made my course learning more meaningful.   I was able to learn with intent.   I now realized that certifications might not have value but the knowledge did…  So since that time I have used certifications to motivate myself to learn.

Wait… certifications should translate into more money right?

While it is true my jobs continue to pay more as time goes along I do not believe this is because of my certifications.  I think it’s because of what I learned while doing the certifications.   Will certifications ensure more money?  Not always.   But more knowledge and skills will translate to more ability to do.

So you convinced me … what certs should I do?

Well here is the tough one.  I can tell you what certifications I see a lot of resumes and job postings:

  • ITIL – This one is on every resume.  Buy a book off Amazon and take the test… it’s not hard and people want it a lot.
  • VMware certification – Virtualization is hot… but only a few places have virtualization only admins..  VCP is normally enough.  VCAP and above are not seen much on job postings.  (Don’t get me wrong I am all about geeking out with VMware certs… as shown by my VCDX but in translation to jobs VCAP will not help you more than VCP… VCDX will but it’s a long journey)  Best fun test on that journey VCAP-DCA (it’s a live test that makes you do it’s so much fun)
  • RedHat certification (normally RHCE) redhat is still the leader in enterprise linux and their cert is a practice test that requires that you do things not just know them.
  • Windows Certification – They are a lot better than they used to be and look great for Windows jobs
  • PmP – if you want to get into technical project management this is the cert.
  • CCNA – If you are interested in networking start here… even if you don’t have Cisco in your shop.

 

Live Tests

My final note is a shout out to all testing systems that require you to work with a real environment like the VCAP-DCA, CCNA or RHCE.  These tests require you know how to do things and are awesome.  No pointless memorization required.  We need more IT tests like this…

Deep Dive: Virtual Switch Security settings and Port Binding

Security Settings:

Three options are available on a virtual switch.  These settings can be set at the switch layer then overwritten on individual port groups.

  • Promiscuous Mode – This allows the guest adapter to detect all frames passed on the vSwitch that are in the same VLAN as the guest.  Allows for packet sniffing.  This is not port mirroring it only allows a host to see it’s own traffic and any broadcast traffic.
  • MAC Address Change – Allows the guest to change it’s mac address.  If set to reject all frames for the mac not in the .vmx file are dropped at the switch.
  • Forged Transmits – If set to reject all frames from the guest with a mac address that does not match the .vmx file are dropped.

Security settings advise:

Set all three to reject on the switch keeping your operating systems admins in a box while protecting shared resources.   Then add individual polices to each port group as needed.   If you are wondering where it’s needed one of the use cases is nested virtualization.. which requires all three to be set to accept.

Port Binding:

Port binding is a setting that allows you to determine how and when the ports on a virtual switch are allocated.  Currently there are three port binding options:

  • Static binding (default)
  • Dynamic binding
  • Ephemeral binding

Static Binding – means a port is allocated to a virtual machine when it is added to the port group.  Once allocated to the port group it continues to use the port until removed from the port group (via deletion or move to another port group).  Network stats with static binding is kept through power off and vMotion.

Dynamic Binding – will be removed in the near future. Ports are allocated only when a virtual machine is powered on and the virtual network card is connected.  They are dynamically allocated when needed.  Network stats are kept through vMotion but not power off.

Ephemeral Binding – Is a lot like a standard vSwitch it can be managed from the vCenter or ESXi host.  Ports are allocated when the host is powered on and nic is connected.  One major difference is that dvPorts are created on demand all other binding type creates them when the port group is created.  This process takes more RAM and processor power and so their are limits on the number of ephemeral ports available.  Ephemeral ports are used for recovery when vCenter is down and may help with vCenter availability.  All stats are lost when you vMotion or power off the virtual machine.

Port Group Type advice:

I would use static binding on almost everything.  Ephemeral has a high cost and does not scale.  I do personally use ephemeral for vCenter because I use 100% dVS switches.  If you are using standard switches just use static across the board.

 

Deep Dive: Standard Virtual Switch vs Distributed Virtual Switch

Let the wars begin.  This article will discuss the current state of affairs between virtual switches in ESXi.   I have not included any third party switches because I believe them to becoming quickly not part of the picture with NSX.

 

Whats the deal with these virtual switches?

Virtual switches are a lot like ethernet layer 2 switches.  They have a lot of the same common features.  Both switch types feature the following configurable items:

  • Uplinks – connections from the virtual switch to the outside world – physical network cards
  • Port Groups – groups of virtual ports with similar configuration

In addition both switch types support:

  • Layer 2 traffic handling
  • VLAN segmentation
  • 801.1 Q tagging
  • nic teaming
  • Outbound traffic shaping

So the first question everyone ask’s is if two virtual machines are in the same vlan and on the same server does their communication leave the server?

No… communication between the two vm’s on the same ESXi host can communicate without leaving the switch.

 

Port Groups what are they?

Much like the name suggests port groups are groups of ports..  They can be best described as a number of virtual ports (think physical port 1-10) that are configured the same.  Port groups can have a defined number of ports and expanded at will (like a 24 port switch or 48 port switch)  There are two generic types of port groups:

  • Virtual machine
  • VMkernel

Virtual machine port groups is for guest virtual machines.  VMkernel port groups are for ESXi management functions and storage. The follow are valid uses for VMkernel ports

  • Management Traffic
  • Fault Tolerance Traffic
  • IP based storage
  •  vMotion traffic

You can have one or many port groups for VMkernel but each requires a valid IP address that can reach other VMkernel ports in the cluster.

At time of writing (5.5) the follow maximum’s apply

  • Total switch ports per host: 4096
  • Maximum active ports: 1016
  • Port groups per standard switch:512
  • Port groups per distributed switch: 6500
  • VSS port groups per host: 1000

So as you can see vDS scales a lot higher.

Standard Virtual Switch

The standard switch has one real advantage.  It does not require enterprise plus licensing to use.  It has a lot less features and some draw backs including:

  • No configuration sync – you have to create all port groups exactly the same on each host or lots of things will fail (even upper case vs lower case will cause it to fail)

Where do standard switches make sense?  Small shops with a single port group they make a lot of sense.  If you need to host 10 virtual machine on the same subnet then standard switches will work fine.

Advice

  • Use scripts to deploy switches and keep them in sync to avoid manual errors
  • Always try vMotions between all hosts before after each change to ensure nothing is broken
  • Don’t go complex on your networking design – it will not pay off

Distributed Virtual Switch

Well the distributed virtual switch is a different animal.  It is configured by vCenter and deployed to each ESXi host.  The configuration is in sync.  It has the following additional features:

  • Inbound Traffic Shaping – Throttle incoming traffic to the switch – useful to slow down traffic to a bad neighbor
  • VM network port block – Block the port
  • Private VLAN’s – This feature requires switches that support PVLAN so you can create VLAN’s inbetween vlans
  • Load – Based teaming – Best possible load balancing (another article on this topic later)
  • Network vMotion – Because the dVS is owned by vCenter traffic stats and information can move between hosts when a virtual machine moves… on a standard switch that information is lost with a vMotion
  • Per port policy – dVS allows you to define policy at the port level instead of port group level
  • Link Layer Discoery Protocol – LLDP enables virtual to physical port discovery (your network admins can see info on your virtual switches and you can see network port info – great for troubleshooting and documentation)
  • User defined network i/o control – you can shape outgoing traffic to help avoid starvation
  • Netflow – dVS can output netflow traffic
  • Port Mirroring – ports can be configured to mirror for diagnostic and security purposes

As you can see there are a lot of features on the vDS with two draw backs:

  • Requires enterprise plus licensing
  • Require vCenter to make any changes

The last draw back has provided a number of hybrid solutions over the years.  At this point VMware has created a work around with the empherial port group type and the network recovery features of the console.

Advice in using:

  • Backup your switch with PowerCli (a number of good scripts out there)
  • Don’t go crazy just because you can if you don’t need the feature don’t use it
  • Test your vCenter to confirm you can recover from a failure

So get to the point which one should I use?

Well to take the VCDX model here are the elements of design:

Availability

  • VSS – deployed and defined on each ESXi host no external requirements + for availability
  • dVS – deployed and defined by vCenter and requires it to provision new ports/ port groups – for availability

Manageability

  • VSS – pain to manage in most environments and does not scale with lots of port groups or complex solutions – for manageability
  • dVS – Central management can be deployed to multiple hosts or clusters at the same datacenter + for manageability

Performance

  • VSS – performance is fine no effect on quality
  • dVS – performance is fine no effect on quality other than it can scale up a lot larger

Recoverability

  • VSS – is deployed to each host and stored on each host… if you loose it you have to rebuild from scratch and manually add vm’s to the new switch – for recoverability
  • dVS – is deployed from vCenter and you always have it as long as you have vCenter.  If you loose vCenter you have to start from scratch and cannot add new hosts.  (don’t remove your vCenter it’s a very bad idea) + as long as you have a way to never loose your vCenter (does not exist yet)

Security

  • VSS – Offers basic security features not much more
  • dVS – Wider range of security features + for security

 

End Result:

dVS is better is most ways but costs more money.   If you want to use dVS it might be best to host vCenter on another cluster or ensure it’s availability.

 

 

Deep Dive: vSphere Network Link Failure Settings

In this series of posts I will tackling different topics around vSphere and attempting to explain what they mean and how to use them.  This article will discuss the link fail over detection methods.

 

Link Fail over detection

Link fail over detection is a critical component in any infrastructure this is the method ESXi used to determine if a link has failed and should not be used for traffic.   ESXi provides two options:

  • Link Status
  • Beacon Probing

 

Link Status

Link status is just as it sounds.  The link is either up or down.  This method can detect switch failure or cable failure on the next hop. For example if switch A were to loose power ESXi move move all possible traffic from NIC1 and NIC2 to NIC3 and 4.

Drawing1

 

Link status does have some drawbacks:

  • It cannot detect mis-configuration on the switches or upstream.
  • It cannot detect upstream failures (for example the router attached to each switch)

For these reasons it is critical that you implement some type of link state tracking on your network gear.  A common setup is to configure ports to shutdown when their uplink ports fail.   This type of link state tracking is a function of the switch gear and it’s critical that it be configured all the way to the ESXi ports so ESXi see’s a link failure.   It still cannot overcome the misconfiguration.  This is really bad in situations where MTU is misconfigured upstream.   For this reason VMware implemented a Network health check and can help identify MTU mismatches and VLAN issues.  I would 100% recommend turning it on.  It’s a free health check that can save you hours.

Beacon probing

Beacon probing is a really simple process.  It requires a odd number of network devices.  Each network card sends out a broadcast message.  As each nic receives the other network cards broadcast it knows it is not isolated from the others and assumes good link state.   This process has a number of advantages:

  • Can detect upstream failures
  • Can detect some misconfigurations

It does have a downside

  • Requires at least three network cards for a quorum (2 would vote each other out)
  • Can lead to false positives

I would like to explain the false positives.  There are a number of situations where it would be possible for broadcast message to not reach the destination during these times all links determined as isolated would be shutdown.   You could put your host into a isolation event very quickly all at once.

 

Link State Tracking Choice

This one is 100% up to you.  If you only have two or less network cards use link state.   If you have three or more then you might want to use beacon probing.  Either way test every possible failure scenario for possible issues before depoying in production.

 

Notify Switch of failure

Should you notify the switch of a failure?  I would think this is a good idea.  Without going into a discussion of arps.  This setting chooses to send out gratuitous arp messages after a fail over event.   These messages allow switches to quickly update their arp tables.  Without these updates messages destined for moved virtual machines may take up to five minutes before they get the message.   This is unlikely but possible in complex network configurations.   My vote is always yes… I cannot think of a downside but suggestion one if you know it.

 

Failback:

This setting allows traffic to be moved back to a link after a link state failure is set to yes.  If set to no you have to manually move the traffic flow back.    There are two schools of thought on this matter.  Failback yes creates a automated fail back when outages occur.  Less work is good.  But it’s possible that a link starts flapping and traffic keeps moving back and forth all night between working and failed links… causing availability problems in your environment.  It’s really up to your requirements but I suggest that if you use failback:NO enable a vSphere alarm to let you know so you can re-add the link after the failure is resolved.

 

Radically simple storage design with VMware

Storage is my bread and butter.  I cut my teeth on fine storage arrays from EMC.  Since then I have moved on to many different vendors and I have learned one truth: storage can be hard or simple.   VMware can make storage easy.   I am very excited about SDS (software defined storage)  I personally love VSAN and Nutanix they are the commercial solution to something google figured out long ago.   Storage is simple but storage arrays are hard.   VMware has been making great strides to simply storage but I find lots of people are afraid to use them.   They prefer to stick non-flexable arrays and provisioning methods.   Please don’t get me wrong these designs are required for some solutions.  Some transnational processing requires insane IOP’s or low latency.  This design is for the rest of you.

Design Overview:

You have a VMware cluster with highly available shared storage.   You have a mixed VMware cluster running lots of different applications.  Some of your virtual machines have lots of drives spread all over your storage luns.   Some of your virtual machines have 2TB drives attached so you have standardized on 4TB lun’s for all VMFS datastores.  All of your luns are thin provisioned.  You need to provide a solution that is easy to manage but avoids lun’s running out of disk space in the middle of the night.  You are also concerned about performance you would love an automated way to move virtual machines if I/O on a lun is a problem.

Assumptions:

The following assumptions have been made:

  • You have enterprise plus licensing
  • You are running 5.5 and all VMFS luns are at least 5.XX format native
  • You do not have an array that provides auto tiering
  • You do not need to take into account path selection in the process or physical array

 

Storage:

VMware’s Storage cluster provides for all the requirements and needs.  By using all storage in a storage cluster management of storage becomes easy.  Just group storage together based on IO metrics (do not mix 15,000 disks with 7,200 k disks)  into a pool or datastore cluster.  Enable storage DRS and your life just got a lot easier.  Enable automated storage DRS for ease of management.   This will help you place new virtual machine and move virtual machines off luns that are above a certain threshold (80%) by default.   Now you just need to enable IO latency moves.  This will move virtual machines to other datastores if the latency on the datastore passes a threshold (default 10ms) for a specific duration.   I have used storage DRS just like this with over 2,000,000 successful storage moves without a single outage.

 

Abstract storage -> Pool -> Automate

 

All are provided by this design.

PowerCLI Change vSphere Alarms to send emails

vSphere 5 comes with a slew of really great alarms prebuilt by VMWare.   There is only one problem by default they all just alarm in the GUI.  I really want to avoid having someone login to find out status.  One organization I worked for had a ticketing system that accepted input as emails.  It was not the cleanest but it worked.  So here are some generic alarms that I recommend switching to email notifications.

 

“Datastore usage on disk”
“Health status changed Alarm”
“Health status monitoring”
“Insufficient vSphere HA failover resources”
“VMKernel NIC not configured correctly”
“vSphere HA failover in progress”
“vSphere HA host status”

 

Yes you can manually change them to send emails in the GUI… but it’s a lot quicker to use PowerCLI.  Here are the commands:

 

Get-AlarmDefinition "Datastore usage on disk" | New-AlarmAction -Email -To “your_email@domain.com” -Subject "Datastore usage on disk" 
Get-AlarmDefinition "Health status changed Alarm" | New-AlarmAction -Email -To “your_email@domain.com” -Subject "Health status changed Alarm" 
Get-AlarmDefinition "Health status monitoring" | New-AlarmAction -Email -To “your_email@domain.com” -Subject "Health status monitoring" 
Get-AlarmDefinition "Insufficient vSphere HA failover resources" | New-AlarmAction -Email -To “your_email@domain.com” -Subject "Insufficient vSphere HA failover resources"
Get-AlarmDefinition "VMKernel NIC not configured correctly" | New-AlarmAction -Email -To “your_email@domain.com” -Subject "VMKernel NIC not configured correctly"
Get-AlarmDefinition "vSphere HA failover in progress" | New-AlarmAction -Email -To “your_email@domain.com” -Subject "vSphere HA failover in progress"
Get-AlarmDefinition "vSphere HA host status" | New-AlarmAction -Email -To “your_email@domain.com” -Subject "vSphere HA host status"

 

 

VMware Certified Design Expert (VCDX) Journey

bestSome of my normal readers will have noticed that I have been missing from the blog of late.  My family has been noticing me missing in life for the last six months.  This is all due to my goal to become a VMware certified design expert.  It’s a certification created by VMware to identify individuals who are IT architects.  It’s a design focused certification instead of a technical one.   It requires that you create a datacenter or multi-datacenter design and submit it.  Once the submission is accepted you have a opportunity to defend it before a group of current VCDX’s for a chance at becoming a VCDX.   On Thursday exactly one day after defending my design I learned that I passed the defence.  I am now VCDX #143.  You can read about the VCDX program here.  The number gives you an idea of how few people hold the certification.  You can see the full list here.   I would like to share some things about my journey in order to help and encourage others to join me.  If you don’t want to read my story skip to the bottom I have some bullet point tips to help you.

 

A year ago I had an experience that had a profound effect on my career path.  My ememployer kindly sent me to VMworld.  At the time I was the primary VMware systems administrator as well as being the lead for all systems administration.  As I attended the conference my eye were opened to all the future possibilities.  It was the year of automation and cloud computing 2012.  During the sessions I was able to interact with a lot of the virtualization personalities and the attendees.  I found a group of kindred spirits and an excitement and passion for IT that I had lost in the last few years.  I came home excited my parents game me a ride home from the airport and mention multiple times how excited I was..  For those who know me excited is not a term normally associated with me.  At this point I was a VMware Certified Professional (VCP) 5 that my employer had encourage me to obtain.   I came back with the desire to learn more about VMware products.

I have always learned more while on a project. I have to have a deadline or life will get in the way. So I set a goal along with my employers support (they were awesome) that I would become a VMware Certified Advanced professional (VCAP) by the next VMworld. Turns out the goal was way to far out.. and I did nothing for a long time. A few months before VMworld 2013 I really started hitting the VCAP-DCA (DCA Data Center Administration) content. I was determined to install and play with every VMware features which as it turns out is a great way to prepare for this exam. I had also been working with my employer to install and setup vCloud director in order to become a service provider of IaaS. This allowed me to get a  VCP-Cloud certification. When it was posted that exams would be 75% off at VMworld I now had a concrete goal to achieve. I took both the VCAP-DCA and the IaaS test during the general sessions at VMworld 2013.. I managed to pass both. Now realizing that certifications provide the solid ground to force me to learn I looked at the certification track for my next learning path. VCAP-DCD (Data Center Design) seemed like the most logical next step. After VMworld they offered a coupon for 50% off certifications if done within two months. I had my goal and bought a test. I read and studied a lot for this test. I have done a lot of architecting but in education we didn’t always use the same terms. I found the VCAP-DCD book from VMware press the greatest help.  The night before the test I came down with a 103 fever.. but unwilling to waste my money and Pearson’s inflexible reschedule policy in my way I took the test. It went by quickly and I was faced with a score of 297. Passing is a 300. So exactly two weeks later without a fever I passed the test. Now I was faced with the VCDX-DV certification.  I bloged about my experiences with both exams in other posts.

 

VCDX-DCV the Journey

I started on my design in Dec 2013. This was after securing this goal with my wife, family and employer. Through this process my employer has been very supportive which is critical. If you are an employer wondering if your employee with benefit from this certification, the answer is a simple yes. It’s cheaper than a week of training at $1200 and trust me I learned more about vSphere and architecture during the design than I have in the last three years combined.

The design
I started with a fictional design.. based in VSAN. I was going for the wow factor. I spend most of the month playing with VSAN and reading every word about VSAN. In January as I struggled with the fictional design around business requirements I gave it up. I went with a real world design. If you have not read it somewhere else read it here… don’t do 100% fictional. Do something you know. VCDX is not about the wow factor it’s about meeting a customers requirements and constrains in your design. It’s not about the perfect design. It’s about meeting the customers needs. You may have to add some portion of fiction to your design to meet the VCDX requirements (included in the blueprint – read it over and over it’s the greatest clue to what they want)
I did have to add some fictional elements in order to show mastery of all the elements of design but I kept them limited. Remember to include business requirements: SLA’s, RTO, RPO and define them in detail. Once I had an idea around my design, I needed to figure out the format. I again struggled with the format to use. My day to day job did not use anything as formal as was requested by the design submission document. I used several publicly available documents and started to pattern a document but I still struggled to find the correct format.

Solutions Enablement toolkit

I read in multiple places that a lot of the people who submitted used the solutions enablement toolkit from VMware as a template. This tookit is only available to partners. This posed a problem. I worked for a partner on a enterprise licensing agreement but they did not have the required partner level to access the kit. In order to get the partner level I had to pass a whole bunch of pre-sales and post sales certifications. So after two days of certifications and online learning (thank goodness the tests are free to partners) I was able to access the SET for vSphere. This document really helped me and partners really have a leg up with these documents. I did not use them as templates beyond understanding which each document type should contain.  If you are not with a partner then I recommending using this document as a template.  With a lot more detail. Remember that everything should align with the elements of design.  (RAMPS, Recoverability, Availability, Manageability, Performance, Security)  Make sure you justify each design choice against these elements and understand the effect of each design choice positive and negative against these elements. In addition, you can use Soda PDF that can resize the signature to match the surrounding text or form field by dragging on it’s corners. Learn more at www.sodapdf.com

Adjusted Expectations

So fast forward we are now at Lat February. My goal was to get VCDX-DCV at VMworld 2014 but they didn’t have any defenses scheduled for VMworld. I have a choice between Cambridge MA in July or late fall in Palo Alto. Since I had some question if my employer would pay for the trip to defend I went for Cambridge. This gave me three months to complete the design by May 9th.   I feel it’s critical to have balance in your life.  I did not have any all night design documents writing sessions.  I did however spend about one hour after work and two hours after my kids went to bed each night for three months.  I took three days off work and it really did occupy a lot of my time and thoughts.  I took the 8th off work and submitted my design by 5 PM.   They were kind enough to let me know they got my documents.  At this point I had invested about 600-700 hours in the process.  Most of this time was spend figuring out the format I wanted to use for my documents.   I changed my mind a lot and reformed the document multiple times.    The real document only took about 100 hours plus some proof reading.

Perfect Document

After providing my perfect document (about 600 pages in total) I went on vacation with my family.  A carefully crafted beach visit to Lake Michigan to end on the day I would hear back if I get to defend.   I was happy to hear that I was invited to defend and terrified that I now had to prepare to be judged.  I started reviewing my design document to create slides… and there were errors… things I had missed and new problems I discovered.  Which goes to show you no design is perfect and it’s ok to have errors just let them know you changed something and why … and of course what effect that change had to the elements of design.

Slides and presentation

Putting together a slide deck I stuck to the recommendations in the VCDX book.   I did not do anything fancy.  I practices my presentation a few times but not a ton… I mostly studied VMware books (HA and DRS deep dive etc..) and my design.

Defense

I am not allowed to talk a huge amount about the defense.  What I can tell you is I really enjoyed spending time with the other VCDX’s explaining my design.   It was just like VMworld all over again… I had a great time.   They asked great questions and helped me to think critically about my design choices.  I spend two years as a missionary going door to door which perhaps prepared me for this experience.  After eight hours every day for two years of having people reject your message you get used to rejection and different opinions.   It was great practice for this experience and life as a whole.    My panelists were awesome and helped me a lot.

Ok … Ok enough about me here is the tips

  • Get lots of practice talking about VMware in public speaking situations (Your local VMUG is awesome for this)  My local VMUG allowed me to present  three times in the last year… it was great experience.
  • Get lots of practice public speaking… in your church community or something somewhere…. practice and push yourself out of that comfortable skin
  • Study to learn… figure out how your learn… if it’s writing after you learn start a blog… if it’s teaching others start a lunch and learn session at your work… Figure out your learning style and start using it on VMware products
  • Set deadlines and don’t miss them… also don’t give up with failure.  The only real way to fail is to not try or stop trying.  You can do it… your a smart person
  • Reach out to people in twitter… (yes I said the evil word.)  #VCDX find local resources that are studying and join a VCDX group
  • Focus on the business side of it all… I bet your great at technical but it’s time to understand the business… they are why we exist.  Find out what the business wants and why.  Learn the business lingo and terms.
  • Don’t make a design choice just because…. know why you make the choice and the consequences then stick by your design.   Don’t let the first stiff wind blow you down.
  • Use a white board and pictures to explain your thoughts…  a picture is really worth a million words.  Learn to use Visio or some application like it…  Draw pictures to explain everything.    Trust me use pictures…
  • Conflicts are your friend.  Every design has conflicts and risks… if your does not it’s not real.  Conflicts just show critical thinking.
  • Use RAMPS like crazy.   Think about it, drink it, and live it…
  • VCDX is about becoming not achieving – It’s not a one time event it’s a life long journey.   My journey to VCDX did not really begin in January and it did not end in July it continues.  Life should be about becoming something..  How can you become a better architect.
  • The journey to VCDX has taught me so much more about virtualization technology… networking… storage.. disaster recovery and people skills find your weak spot and learn about it
  • Getting your design accepted is a huge deal don’t treat that lightly may people don’t try or ever get that far.
  • Read the VCDX book and blueprint and live it… don’t think you know better just follow them… it’s like being in college again… figure out what the professor wants and do it… don’t worry if you think you know a better way.
  • View the recorded VCDX boot camps and attend a boot camp if you can.. they both help a lot.
  • If your ready to submit a design you already know a lot about VMware products… go do what you know be confident.  I am a VCDX now and I don’t know everything.. There are lots of holes in my knowledge…
  • Do a real design with real requirements and constraints… don’t go for WOW factor… just make a good design.
  • Know what else you could have done… and the effect of that choice.  If you had six more nic’s how would you use them?
  • Mock Design sessions… get on twitter and locate a mock design and defense group they are happening all around and will really help you prepare.
  • Push yourself.. if you don’t enjoy public speaking… do it.  If you don’t enjoy speaking up to defend your design choices do it… push yourself
  • Do not sacrifice your family, health, employment etc… to the certification.  Balance is hard in life but worth it.   When you die you don’t want you tombstone to say divorced VCDX.  No amount of success in life can replace failure in your home.
  • Enough with my preaching… I hope it helps.  Contact me via twitter or here anytime if you have questions.

 

Whats Next for me?

Well a VCDX-Cloud would be a good choice for me since I am doing a lot of cloud now… but in reality I think I might focus on a CCNA… because I am a little weak on networking and I want to round up my experiences, plus it will justify buying some cool looking cisco switches.  (Yes that means you are going to see Cisco posts soon.)

 

Network IO Control failing to shape traffic with multi-nic vMotion

NIOCI have always used network IO Control to shape my traffic on a virtual switch (Enterprise Plus required).   It does a great job of balancing traffic when contention comes into play.  Unfortunately it cannot shape traffic as it comes into the virtual switch.  It can only shape traffic going out.   A new friend (@VMPrime)  pointed this out to me at the perfect time.   He was knowledgeable and encouraging an all around great guy.   He pointed out that NIOC only have effect on the machine during traffic flows exiting the machine.  When traffic goes to another machine NIOC has no effect.  I remember reading about this but the terminology was a little fuzzy from a VMware perspective.  Joe provided a simple scenario when that lack of control could be a problem.  Take into account the following scenario.  Assume that we have a two host cluster each running two 10GB nics.  We have a vlan for management, virtual machines and we have setup multi-nic vMotion as shown in the diagram 1 below.   We have NIOC setup with shares to protect each traffic type during contention.  Assume that the network utilization of host A is 2GB.  While the network utilization of host b is 15GB.Assuming that host B has capacity for all of host A virtual workloads I put host A into maintenance mode host A now utilizes up to 18GB of network to transfer the running state of virtual machines to host B.  Host A’s NIOC kicks in preserving 2GB for virtual machines and allocated 18GB to vMotion to Host B.  We are now shoving 18GB into Host B who’s virtual machine need 15GB’s.  Now both sides are contending for space and we might have availability issues on our host in addition the vMotion might fail.

How do we solve this issue?

This is exactly why we have Network Limits. Unlike CPU and memory Limits NIOC limits can really help with this exact issue. Putting a limit on vMotion of for example 2.5 GB per link would create a scenario when it could never use more than 5GB per host. Will this still have an effect? Maybe it’s a cost benefit anaylisis. You have to weigh your options and you might have to adjust you limit lower.

 

Drawing1