vRO get all VM’s

I have been spending less time than I would like in vRO but I wanted to share some of my findings in a brief format.  Here is the code in a scriptable task that can get all virtual machines across all vCenters connected to your vRO instance.

var vCenters=VcPlugin.allSdkConnections;

 for each (vCenter in vCenters){
     System.log(vCenter.name);
     var clusters = vCenter.getAllClusterComputeResources();
     for each (cluster in clusters) {
        System.log(cluster.name);
        var vms = vCenter.allVirtualMachines
         for each (vm in vms)
           System.log(vm.name);
           //do your per vm action here
    }
}

There are better ways to gather each virtual machine but I wanted to demonstrate how to walk down the layers.  (you can just get the vm’s without getting vCenters and getting clusters I’ll show at bottom)  This code will be very familiar to PowerCLI users who do this type of action all the time.   I have included lots of system logging to help you understand the walking feel free to remove.  Some highlights are as follows

  • create an instance of the vCenter sdk called vCenters
    • From this you can call almost any vCenter sdk exposed object
  • identify vCenters one at a time into vCenter
  • identify clusters one at a time into cluster
  • identify vm’s one at a time into vm
  • Take some action on each vm

You can of course shorten this code with:

var vCenters=VcPlugin.allSdkConnections;

for each (vCenter in vCenters){
  var vms = vCenter.allVirtualMachines
    for each (vm in vms) {
      System.log(vm.name);
      //do your per vm action here
    }
}

See how that is shorter.  It’s a pretty cool feature.  One thing to remember is the returned data into vms is a object of allVirtualMachines not a text field.  It’s a multi-dimensional array of key value pairs associated with vm.   I am referencing one element of individual elements using vm.name (or single vm entity key field of name)

Enjoy and let me know if I can help.

Firewall Ports for PowerCLI with vRO

I have always assumed that vSphere’s extensions to PowerShell (PowerCLI) all used the standard vCenter api end point (https://vsphere.name/sdk).  Normally the server I run PowerCLI on is also the server I run the vSphere fat client on meaning more than 443 is open to the fat client.   When we deployed vRO (Orchestrator for those who are confused with the new naming) we needed to run a PowerShell host for a number of PowerCLI and Powershell only functions to be initiated by vRO.   You can read about how to implement a powershell host on VMware’s blog here.  It works great vRO is able to execute Powershell and pass information back to vRO.   We did run into some issues that were determined to be firewall blocks.  Initially we found that some PowerCLI commands used 10443 (inventory service api) directly not just the assumed 443.   This was a surprise.   Digging around I have been unable to locate information on the exact ports used.  VMware was also unable to identify the exact ports.  What we did locate was that if you open the fat client (C#) ports all of the command’s will work.   So here is my list all TCP:

Inventory service
10443 10080

vCenter service
80 8080 8443 9443 10443 443 903 902

Web client
9443 9090

ESXi
427 22 80 443 902

Some of these have been removed in vSphere 6.  I hope it helps you remove a potential issue in advance.   If someone knows the exact ports or has the ability / time to test them all feel free to comment.

VMkernel interfaces in vSphere 6

Not every one has noticed the new types of vmkernel interfaces in vSphere 6.   Here is a quick note to identify the types of interfaces available:

  • vMotion traffic – Required for vMotion – Moves the state of virtual machines (active datadisk svMotion, active memory and execution state) during a vMotion
  • Provisioning traffic – Not required will use management network if not setup – cold migration, cloning and snapshot creation (powered off virtual machines = cold)
  • Fault tolerance traffic (FT)  – Required for FT – Enables fault tolerance traffic on the host – only a single adapter may be used for FT per host
  • Management traffic – Required – Management of host and vCenter server
  • vSphere replication traffic – Only needed if using vSphere replication – outgoing replication data from ESXi host to vSphere replication server
  • vSphere replication NFC traffic – Only needed if using vSphere replication – handles incoming replication data on the target replication site
  • Virtual SAN – Required for VSAN – virtual san traffic on the host

The purpose of the multiple interface types is you are now allowed to route all these types of traffic in 6.  Allowing you to segment this traffic even more.  (In ESXi 5.xx only management had a TCP/IP stack.)  I recommend that you create unique subnets for each of these types of traffic you can use.  In addition many of them support multiple concurrent nic’s (like multi-nic vMotion) which can improve performance.   When possible setup multi-nic.

Server has a weak ephemeral Diffie-Hellman public key vRealize Orchestrator Script

So our friends in the browser world decided what length of encryption keys are no longer secure.  In this case it effects Chrome and Firefox.  vRo has the ability to not present the insecure keys but the current version out of the box does not do this.   I know a lot of VMware administrators come from the Windows side of the house and are uncomfortable with linux command line.   Since I came from Linux I figured I would provide a script to resolve this issue.   It really does not do anything that is not already provided in the VMware KB.  It does help you avoid using VI which I know is the bane of most Windows admins.

So here is how to use it:

  1. Login to your VCO server with putty (login as root)
  2. Login to your VCO server with WINSCP (login as root)
  3. Copy the script provided into the base directory with WINSCP
  4. Switch to your putty session and type
  5. mv change_key.sh.txt change_key.sh
  6. chmod 755 change_key.sh
  7. Then type exactly
./change_key.sh

Wait for it to complete and you are done.   You can also just run the commands in the script manually.   I hope it helps a little.

change_key.sh

How does memory ballooning work

After my memory management post there were a number of discussions about how memory ballooning works.   It’s a rather neat way to free up memory by making the guest operating system choose which things to swap.  There are three types of memory when discussing ballooning:

  • Guest Operating system virtual memory – virtual memory mapping handled by the guest operating system requested by applications
  • Guest Operating system memory – “Real memory” owned by the operating system – this would be the physical RAM on the operating system if it was not virtualized
  • Hypervisor (ESXi) Memory – memory on the ESXi host mapped by ESXi and translated in parts into guest operating system memory

 

In order to understand ballooning it’s critical to understand how memory allocation works within a virtual machine:

  • You start an application on your VM that requires 2GB of RAM
  • Application requests 2GB of Guest virtual memory
  • Guest operating system allocated system memory for the 2GB
  • Hypervisor has Guest operating system memory mapped to physical RAM
  • You stop application using 2GB of RAM
  • The 2GB of Guest virtual memory is freed
  • The 2GB of guest operating system memory is freed – known as free memory
  • The 2GB of Hypervisor memory is not freed – in fact the hypervisor has not idea it’s not used – the operating system does not deallocate and erase 2GB of memory it just unlinks the memory

This creates an issue where the hypervisor is allocating RAM to a server that is not required.

For this discussion there are two types of memory on a guest:

  • Active memory – in use memory
  • Free memory – memory not used by operating system

What does the Ballooning driver do?

The drive which is part of VMware tools (no tools no Ballooning) is a process that can run with the highest guest privilege and request memory (runs as vmmemctl.sys).  This process chews up free ram.  The hypervisor as the initiator of this ram request understand which blocks are allocated in virtual and guest operating system memory and then deallocates them on hypervisor memory.  Making that Hypervisor memory available for other guest operating systems.

Where does it go wrong?

Ballooning is normally a bad thing…  because it does not just target free memory.   It’s job is to remove memory pressure on the hypervisor to avoid hypervisor swapping (which is bad).  It will eat into active memory on an operating system forcing the operating system to swap/page out active memory.   This is better than hypervisor swapping but still a really bad thing for performance.

How much will the balloon driver eat?

Each operating system has a minimum recommended memory value (google minimum recommended memory for OS) for example RHEL 7 has 512MB’s.   In addition to these values the advanced setting Mem.AppBalloonMaxSlack is added which is set to 16 MB by default.  (it’s adjustable from 1mb – 256mb)  So for my RHEL 7 system it will balloon all ram down to 528MB’s.   The impact of this adjustment on high memory systems is huge while low memory may have no impact.

How can I avoid Ballooning?

  • Avoid memory contention situations
  • Avoid over provisioning servers
  • Make a reservation on a server (reservation does not allow ballooning to claim any memory)
  • Do not install VMware tools (Really bad idea)

 

Really the secret is just avoid memory ballooning.  It’s better than hypervisor swapping but it’s still bad.

 

Phantom Ballooning on Virtual machine

Recently I ran into an issue where esxtop was showing a virtual machine that was ballooning even thou my ESXi host was in a high state (see this article for state information).  A quick check of the guest operating system found the ballooning driver running but no additional memory than usual was used.  (This virtual machine used a pretty static 3GB of RAM)    I wanted to understand and duplicate this situation so I loaded up my home lab with two Linux virtual machines with 40GB of ram each (my ESXi hosts are 32GB each).  I took the following steps:

  • Power on each virtual machine and bring up top
  • SSH to the ESXi host run esxtop and switch to Memory (M) and filter to virtual machines only (V) then add only the fields D, J, K, Q
  • Monitor the normal load of the virtual machines and ESXi host for 10 minutes

During this time I found that one of my virtual machines was using 3GB of ram and the other was .5GB of RAM.   So I started to apply pressure.  Knowing that I needed to create a soft state to force ballooning I added 27GB of synthetic ram usage to a single operating system (using linux command stress).   I found that I quickly moved into a hard state and ballooning and compression began.   After two minutes I turned off the stress application using 27GB and allowed the virtual machine to return to 3GB used (ESXi host 3.5GB used).   The screen shot below was taken after 10 minutes of no memory pressure:

Capture

As you can see we are still showing ballooning even thou the operating system top showed it had returned to the requested original value of 3GB:

Capture

 

So we had a problem the guest was not ballooning but esxtop was showing ballooning.  I can assume from this that ballooning is not reclaimed until requested by the guest again.   I found that if I initiated a 25GB request the ballooning metrics in esxtop would be removed.  So ballooning without active soft state can indicate over provisioned ram on a guest and that a soft state once existed.

What about vMotion

I wanted to test the effects of this phantom ballooning on vMotion as expected the metric is 100% cleared after a vMotion and not set again unless soft state is achieved on the destination esxi host.

Memory Management in ESXi 6

A good friend recently reviewed the what’s new in vSphere 6 course and has some questions.  That generated a number of really great discussions and this blog article.   At about the same time I had a customer asking why their virtual machine was showing ballooning even thou there was no memory pressure.   This generated some research and though organization into this article:

Memory reclamation techniques

VMware uses a number of memory reclamation techniques when under pressure.   The implemented methods get worse in terms of overall cost as the pressure increases.  I don’t want to dedicate much time or duplicate others blog entries.  The following processes are in use:

  • Transparent Page Sharing – This (pre 5.5 U4) used to share common pages between all guests on the same host.  Now it’s shared pages on each VM only due to security concerns.  See this article
  • Guest Ballooning – This is a in guest driver that asks the guest operating system for system resources to force the guest to swap intelligently
  • Memory Compression – Pages are compressed and stored on a cache setup on main memory (set to 10% of total memory by default)
  • Hypervisor Swapping – Hypervisor swaps pages to disk… major effect on performance.

minFree the state to rule them all

minFree is an internal metric used by ESXi to denote when the reclamation techniques should be used.   Each technique is trigged once a limit of minFree is reached.  minFree is based upon the total RAM available on a ESXi host.  You determine minFree using these rules:

  1. First 28GB of physical memory in host = 899MB
  2. Add 1% of remaining physical memory to the 899MB value in step 1

 

For example here are some common minFree numbers:

Total RAM

MinFree

28GB

899MB

48GB

1099MB

72GB

1349.56MB

144GB

2086.84MB

244GB

3233.72MB

5.5 Overview with Memory

Each version of ESXi has memory states that are tied to which technique gets used in 5.5 the states are as follows:

High 100 % of minFree – TPS

Soft 64 % of minFree – Ballooning

Hard 32 % of minFree – Memory compression

Low 16% of minFree – Swapping

So on a 144GB host minFree =2086.84MB it would look like this:

Free

Used

State

Memory Reclamation method

2.03GB

141.97GB

High

TPS enabled

1.30GB

142.7GB

Soft

Ballooning

.65GB

143.3GB

Hard

Memory Compression

.32GB

143.68GB

Low

Swapping

This model worked well but as you can see the difference between hypervisor swapping and TPS is a very little 1.5GB.   Two virtual machines could consume this at the same time making it impossible for TPS to break large pages down and save space.   This was a common problem that your host would go from a high state directly to low and swapping.   Normally when you reach swapping things go bad for your applications.

 

Changes in ESXi 6

vSphere 6 added an additional state to allow memory pages to be broken down (from perhaps 2MB to 4 kb) a lot sooner I believe this is due in part to two factors:

  • The change to TPS to no longer be between VM’s but only on the same VM – meaning very few large pages will be TPS but breaking into 4KB’s might provide some savings
  • The small overhead that ESXi gives between states – even on large memory hosts the difference between TPS and hypervisor swapping is 2GB’s or a single virtual machine

 

So the change introduces the clear state to replace the 5.5 high state as detailed below:

High 400 % of minFree – Large page break

Clear 100% of minFree – TPS begins

Soft 64 % of minFree – Ballooning

Hard 32 % of minFree – Memory compression – compressed and swapped out

Low 16% of minFree – Swapping

This new clear state allows the pages to be broken much sooner before TPS is enacted.   The odd thing is every ESXi host I have ever seen is in a high state even before it gets to 400% or 100% of minFree.   These are the documented levels but the state seems to be static unless a lower status is achieved.

 

How to identify which state you are in

Use esxtop on the ESXi host.   Choose M for memory and the state is listed at the top right side:

Capture

You can identify current virtual machine ballooning, swapping and compression via esxtop:

Capture

Nutanix and Acropolis test drive for $1 an hour

Catchy title eh?  Well I have been wanting to test drive Nutanix community edition for a while now.  It allows you to set up a Nutanix cluster on  almost any hardware.  It also runs Nutanix’s new hypervisor Acropolis (KVM based).   My desired to set this up has always been limited by my time and the need to clear out some hardware for the test.  Nutanix was also kind enough to provide me with their training portal access for free.  So I am able to learn about their products using their interactive learning system.  It is by far one of the most advanced online teaching situations I have ever experienced.  But no teaching tool is the same as playing with the real thing.   We have Nutanix at work but I was not present for the original setup and don’t do much of the day-to-day configuration.   So I wanted a low-cost play ground. When looking for online casino game like lucky irish slots, visit www.slotsbaby.com for more information.

Enter Ravello systems

Ravello made a huge smash this year at VMworld taking some of the best in show awards, in addition they offered all vExperts a free 1,000 hours per month access.  Mix that with the 1,000 free hours I get as a RHCE and I have a lot of cloud space available.   For those who have not used Ravello they provide a cloud front end to public cloud providers Amazon and google.  Allowing you to set up nested hypervisors including Nutanix Acropolis and ESXi.  You do have to bring your own licenses.   The interface is simple and clean but very powerful.   A perfect example of what cloud provisioning should be.   In addition people can share templates via libraries for you to use.

Enter the Nutanix Library

It’s available here.  So you login to your ravello account then visit that page and click add to library.  Now you can deploy a complete community edition of Nutanix in the cloud with two clicks.  It’s really impressive.   Here are the steps to deploy:

 

Deployment

  • Select Library -> Blueprints
  • Select Nutanix Community Edition
  • Select Create application
  • Name your application
  • Click on the Nutanix CE icon in the center of the screen

Capture

  • On the right side information will be presented about the virtual machine (that will run everything included nested virtual machines
  • I wanted to make mine accessible via the internet (yes it’s very insecure)
  • Click on services
  • Under each service select Advanced and Enable SNAT
  • After they are all enabled click save at bottom
  • Now click the publish button
  • Select optimize for performance
  • You can select to auto power down after xx hours to avoid costs
  • You can see your billing rate per hour for your server (this is per hour of turned on server not deployed)

Capture

So you can see I am able to run this virtualized Hypervisor for $1.0131 per hour.

  • It will take a few minutes to boot up and generate all required SSL keys for the first time (mine took about 20 minutes before it was totally ready to go)
  • Once it’s booted up select application and your application name
  • The summary tab will show the status and ports
  • Click open on the 9440 and you should be presented with Nutanix login

Capture

First time login is admin:admin and will require you change the password.   If all goes well you are now able to deploy nest virtual machines on your hypervisor.  It does require that you have a free Nutanix account… it even offers to let you sign up live.  For bonus points deploy two of them and get them to replicate the virtual machine.   The sky is the limit.

Capture

 

Wrap up

I am personally really amazed at how easy it all worked.  It took literally hours of configuration and fiddling (which would be a great learning experience) into 20 minutes.  Both Nutanix and Ravello should be commended for these awesome services.   I wish that other vendors could provide a complex POC in 20 minutes.  I think Ravello has a major future in the market.  Give it a try it’s worth $20 to play with it.    Let me know what crazy things you try.   In know in the future weeks replication is in my mind.  A quick guide to Nutanix community edition can be found here.

ESXi Thousands of failed logins from 127.0.0.1

A co-worker brought this to my attention.   We run a daily vCheck on all clusters and it identified thousands of failed logins like this:

 

info 'Vimsvc.ha-eventmgr' opID=hostd-2bce] Event 11090 : Cannot login user @127.0.0.1: no permission

They all appeared in the /var/log/hostd.log.   It was identified as caused by the script:

 

/sbin/hostd-probe ++group=host/vim/vmvisor/hostd-probe

 

which runs out of crontab on ESXi hosts every 5 minutes.  (/var/spool/cron/crontab/root).  This would cause the problem every time.  The problem is cause by the dcui user not being in the /etc/vmware/hostd/authorization.xml file.   You can see a healthly example of a ESXi host without any local users except the default here:

 

<ConfigRoot>
  <ACEData id="11">
    <ACEDataEntity>ha-folder-root</ACEDataEntity>
    <ACEDataId>11</ACEDataId>
    <ACEDataIsGroup>false</ACEDataIsGroup>
    <ACEDataPropagate>true</ACEDataPropagate>
    <ACEDataRoleId>-1</ACEDataRoleId>
    <ACEDataUser>dcui</ACEDataUser>
  </ACEData>
  <ACEData id="12">
    <ACEDataEntity>ha-folder-root</ACEDataEntity>
    <ACEDataId>12</ACEDataId>
    <ACEDataIsGroup>false</ACEDataIsGroup>
    <ACEDataPropagate>true</ACEDataPropagate>
    <ACEDataRoleId>-1</ACEDataRoleId>
    <ACEDataUser>vpxuser</ACEDataUser>
  </ACEData>
  <ACEData id="17">
    <ACEDataEntity>ha-folder-root</ACEDataEntity>
    <ACEDataId>17</ACEDataId>
    <ACEDataIsGroup>false</ACEDataIsGroup>
    <ACEDataPropagate>true</ACEDataPropagate>
    <ACEDataRoleId>-1</ACEDataRoleId>
    <ACEDataUser>root</ACEDataUser>
  </ACEData>
  <NextAceId>18</NextAceId>

 

While my ESXi host was missing the dcui entry.    So that account could not login to execute it’s function.  I have some idea’s how it got removed but the fix is simple.   Login to the host with the C# client and add the dcui user as a administrator.  Then restart hostd:

 

/etc/init.d/hostd restart

 

Careful though restarting hostd does not always come back clean so I would move off any workload before the restart.  I doubt many people will run into this issue but figured I would post in case to save people time when googling.